Multi-Omics Data Fusion via a Joint Kernel Learning Model for Cancer Subtype Discovery and Essential Gene Identification

Feng, Jie; Jiang, Limin; Li, Shuhao; Tang, Jijun; Wen, Lan

doi:10.3389/fgene.2021.647141

ORIGINAL RESEARCH article

Front. Genet., 04 March 2021

Sec. Computational Genomics

Volume 12 - 2021 | https://doi.org/10.3389/fgene.2021.647141

This article is part of the Research Topic Artificial Intelligence in Bioinformatics and Drug Repurposing: Methods and Applications View all 13 articles

Multi-Omics Data Fusion via a Joint Kernel Learning Model for Cancer Subtype Discovery and Essential Gene Identification

$\r\nJie Feng$ Jie Feng¹

Limin Jiang^1*

Shuhao Li¹

Jijun Tang^1,2,3

Lan Wen^4*

¹School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
²School of Computational Science and Engineering, University of South Carolina, Columbia, SC, United States
³Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin, China
⁴Changsha Municipal Center of Disease Control, Changsha, China

The multiple sources of cancer determine its multiple causes, and the same cancer can be composed of many different subtypes. Identification of cancer subtypes is a key part of personalized cancer treatment and provides an important reference for clinical diagnosis and treatment. Some studies have shown that there are significant differences in the genetic and epigenetic profiles among different cancer subtypes during carcinogenesis and development. In this study, we first collect seven cancer datasets from the Broad Institute GDAC Firehose, including gene expression profile, isoform expression profile, DNA methylation expression data, and survival information correspondingly. Furthermore, we employ kernel principal component analysis (PCA) to extract features for each expression profile, convert them into three similarity kernel matrices by Gaussian kernel function, and then fuse these matrices as a global kernel matrix. Finally, we apply it to spectral clustering algorithm to get the clustering results of different cancer subtypes. In the experimental results, besides using the P-value from the Cox regression model and survival analysis as the primary evaluation measures, we also introduce statistical indicators such as Rand index (RI) and adjusted RI (ARI) to verify the performance of clustering. Then combining with gene expression profile, we obtain the differential expression of genes among different subtypes by gene set enrichment analysis. For lung cancer, GMPS, EPHA10, C10orf54, and MAGEA6 are highly expressed in different subtypes; for liver cancer, CMYA5, DEPDC6, FAU, VPS24, RCBTB2, LOC100133469, and SLC35B4 are significantly expressed in different subtypes.

Introduction

Cancer is the important leading cause of death in the world and is responsible for an estimated 9.6 million deaths in 2018. Unlike most other diseases, cancer is not a sort of single disease but is a group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body. In the same type of cancer, patients usually have the same or similar external appearances, but in most cases, universal drugs and universal treatment methods do not produce good prognosis in all cases. The multiple sources of cancer determine its multiple causes, and the same cancer can be composed of many different subtypes. The discovery and identification of cancer subtypes are a key part of personalized cancer treatment and provide an important reference for clinical diagnosis and treatment (de Kruijf et al., 2013).

The Cancer Genome Atlas (TCGA) is the largest open cancer genome database to date initiated by the US government, which aims to catalog and discover major cancer-causing genome alterations in large cohorts of over 30 human tumors through large-scale genome sequencing and integrated multidimensional analyses. It covers a variety of omics expression data including genomics, transcriptomics, copy number variation, DNA methylation, proteomics, and clinical information of follow-up cases (Tomczak et al., 2015; Jiang et al., 2019a), which provide great support for the detection of cancer subtypes by computational methods.

Recently, many methods for cancer subtypes recognition and marker extraction have been proposed (Yeoh et al., 2002; Lapointe et al., 2004; Figueroa et al., 2010; Yang et al., 2017a; Pan et al., 2019). Some models are based on single expression data, including gene expression (Yeoh et al., 2002; Lapointe et al., 2004), microRNA (miRNA) expression (Yang et al., 2017a,b; Liu and Yang, 2018), copy number variation (Pan et al., 2019), and DNA methylation (Figueroa et al., 2010). Lapointe et al. (2004) identified three subclasses of prostate tumors based on distinct patterns of gene expression. Yang et al. (2017a) clustered miRNAs based on Fisher linear discriminant analysis (FDA), using representative cluster member combinations as potential biomarkers. Pan et al. (2019) used copy number variation, a biomarker more likely to be used for cancer diagnosis than mRNA biomarkers, to further reveal differences between various breast cancer subtypes. Figueroa et al. (2010) examined the methylation profiles of 344 patients with acute myeloid leukemia (AML). Clustering of these patients by methylation data segregated patients into 16 groups. Five of these groups defined new AML subtypes. Also, there are methods to analyze and predict cancer subtypes by considering multiple expression data (Shen et al., 2009; Wang et al., 2014; Ge et al., 2016; Jiang et al., 2019b). The iCluster is a latent variable model-based clustering algorithm proposed by Shen et al. (2009). It uses multiple sources of data for integrated analysis to identify tumor subtypes. Similarity network fusion (SNF) is a network fusion method integrating multicomponent data, which was proposed by Wang et al. (2014). SNF builds a similar network of sample pairs on different histological data (gene, methylation, and miRNA) and then integrates the network to predict cancer subtypes.

Furthermore, due to the high dimensionality of research data, we need to find effective and suitable dimensionality reduction methods. Some methods, such as principal component analysis (PCA) and non-negative matrix factorization (NMF), have been used to combine clustering algorithms (Alter et al., 2000; Holter et al., 2000; Brunet et al., 2004). However, for the high-dimensional and non-linear gene data, the performance is not always good. In order to better handle these, we consider a non-linear version of PCA, kernel PCA (Schölkopf et al., 1998), which introduces a non-linear mapping function that can map data in the raw space to high-dimensional space. It can make the distribution of all mapped data linearized and simplified in high-dimensional space, and then PCA can be used to construct features.

In this study, inspired by SNF, we combine gene expression profile with isoform expression profile and DNA methylation expression data. We propose a novel method: first, take kernel PCA to extract features for each profile, then convert them into three similarity kernel matrices, and fuse them into one. Finally, we apply it to spectral clustering algorithm to get the clustering results of different cancer subtypes.

Materials and Methods

We propose a novel method for analyzing various cancer subtypes. First, we rescale the raw expression data by min–max normalization and reduce the dimensionality of data via kernel PCA, with a minimal loss of information. Then, in each cancer dataset, based on the expression of gene profile, isoform profile, and methylation data, we construct three similarity kernel matrices through the Gaussian kernel function and fuse them into a global similarity expression matrix. Finally, the integrated similarity kernel matrix is fed to spectral clustering, and the predictive clusters are identified. The flowchart of our proposed method is shown in Figure 1.

FIGURE 1

Figure 1. The flowchart of our novel method. (A) We first collect and check seven cancer datasets from The Cancer Genome Atlas (TCGA) dataset of the Broad Institute GDAC Firehose. (B) For each dataset, after representing, constructing, fusing, and clustering for gene, isoform, and DNA methylation data, we carry out survival analysis, gene set enrichment analysis, and others.

Data Sources

In our study, all research data are collected from the Broad Institute GDAC Firehose¹ (Center BITGDA, 2016). Firehose is an analytical infrastructure created at the Broad Institute based on the data of TCGA project (Tomczak et al., 2015), which provides genome-scale transcriptome data for various cancers and different levels of processed data for cancer analysis. Firehose gives a corresponding visual web platform, Firebrowse², which can easily access TCGA open access layer data. This greatly lowers the threshold for experimenters to operate the TCGA database and also makes the data for analyzing as consistent as possible. Here, we extract seven common cancer datasets: BLCA, BRCA, COAD, KIDNEY (KICH&KIRC&KIRP), LIHC, LUNG (LUAD&LUSC), and STAD. For each cancer dataset, it consists of gene expression information (gdac_rnaseqv2_genes_RSEM_normalized_Level_3, 2016-02- 18), isoform expression information (gdac_rnaseqv2_isoforms_ RSEM_normalized_Level_3, 2016-02-18), DNA methyla-tion expression information (gdac_Methylation_Preprocess_ mean_Level_3, 2016-02-18), and corresponding clinical information (gdac_Clinical_Pick_Tier1_Level_4, 2016-02-18). The clinical data are used in subsequent survival analyses, while the other three expression profiles are used to construct a suitable global similarity kernel matrix. We check redundant cases (reserve cases with number 01–09) in the four profiles at each cancer datasets and extract all valid cases that contain the above expression information. And then, we obtain the experimental input data for cluster analysis. A summary of our datasets is shown in Table 1.

TABLE 1

Table 1. Description of seven datasets.

Data Representation

Min–Max Normalization

Since we collect high-quality, standardized datasets directly from Firehose, the size of the data values can intuitively reflect the expression abundance of gene, isoform, or methylation. However, to eliminate the influence of digital distribution in three expression profiles and make them fused reasonable, we use min–max normalization to rescale the values. The general formula for a min–max of [0, 1] is defined as Eq. 1:

x {}^{'}= \frac{x - m i n (x)}{m a x (x) - m i n (x)} (1)

where x is an original expression value and x′ is the rescaled value.

Kernel Principal Component Analysis

The kernel PCA is a method for performing a non-linear form of PCA proposed in Schölkopf et al. (1998). Through using kernel PCA, the dimensionality of complex, non-linear features can be reduced effectively. Kernel PCA transforms the raw linear input space R into a high-dimensional feature space F by using some non-linear mapping, like a dot product matrix defined as Eq. 2:

K (x_{i}, x_{j}) = ⟨ Φ (x_{i}), Φ (x_{j}) ⟩ (2)

and calculates the principal components in F. Then compute projections onto the eigenvectors obtained by diagonalizing K to extract the principal components corresponding to the k of K (Schölkopf et al., 1998; Devi et al., 2014). In this study, we use the kernel PCA method and take polynomial kernel defined as Eq. 3:

K_{p o l y} (x_{i}, x_{j}) = {(x_{i}^{T} x_{j} + 1)}^{3} (3)

as the non-linear mapping. We adopt the default parameter, xi and xj are the expression vector of i-th case and j-th, and all the non-zero components are preserved. After performing the above rescale and reduction on the gene, isoform, and methylation expression profiles in the dataset, we gain the necessary input to construct a similarity kernel matrix.

Similarity Kernel Matrix

Kernel Construction

The kernel methods map data points into possibly high-dimensional feature space, where the distribution of all mapped data is linearized and simplified (Vert et al., 2004; Mei and Fei, 2010). Assume mapping function Φ(x), the computation of the inner product ⟨Φ(x_i), Φ(x_j)⟩ in the high-dimensional feature space F can be implemented in the original space R using kernel trick, K(x_i, x_j) = ⟨Φ(x_i), Φ(x_j)⟩, such that no explicit mapping function or even explicit feature representation is required. The size of the matrix used to represent the profile of N cancer cases is always N by N. This allows us to comprehensively consider the expression of three profiles for one specific cancer, which perform a more accurate cluster analysis (Vert et al., 2004). Here, we use the Gaussian kernel and the adjusted parameter γ, as Eq. 4:

K_{g a u s s i a n} (x_{i}, x_{j}) = e x p (- γ {|| x_{i} - x_{j} ||}^{2}) (4)

Kernel Fusion

Data fusion from multidimensional expression profiles has been shown to produce better results than considering a single expression information. Jiang et al. (2019b) and Li et al. (2020) have researched multi-omics data fusion and achieved good results. We fuse three different expression profiles (gene, isoform, and methylation) to construct a global similarity kernel matrix for each cancer. Therefore, we integrate three Gaussian kernel matrices. In our example, we adopt an average fusion strategy as Eq. 5:

K_{f u s e} = \frac{1}{3} (K_{g e n e}, K_{i s o f o r m}, K_{m e t h y l}) (5)

where K_gene, K_isoform, and K_methyl represent the similarity kernel matrix constructed by gene, isoform, and methylation expression profiles, respectively.

Spectral Clustering

Spectral clustering is a clustering method based on graph theory algorithm; the basic idea is to use the similarity matrix of the samples to obtain the feature vector of the feature decomposition for cluster analysis (Von Luxburg, 2007). Because of its excellent algebraic graph foundation, it can get a global loose solution for complex cluster structure (Jia et al., 2014). We use it as the core algorithm for cluster analysis. The process of spectral clustering algorithm is taken as follows. First, based on K_fuse, calculate the Laplacian matrix L. Then, construct the normalized Laplacian matrix D^−1/2LD^−1/2. D is a diagonal matrix whose diagonal element is the sum of the row elements of K_fuse. And compute the eigen vectors y corresponding to the eigen values of D^−1/2LD^−1/2. The matrices composed of corresponding eigen vectors y are standardized on a row basis to form the N_case × N_feature feature matrix Y. Finally, each row in Y is taken as a sample, which is clustered by discrete method to obtain cluster partition C(C₁, C₂, …, C_j). Each partition will represent a cancer subtype. The whole process of spectral clustering can be transformed into solving the optimization problem as Eq. 6:

\begin{matrix} m i n t r (Y^{T} D^{- \frac{1}{2}} L D^{- \frac{1}{2}} Y) \\ s . t . Y^{T} Y = I \end{matrix} (6)

where Y is the eigen matrix for the eigen values of D^−1/2LD^−1/2, D is the degree matrix of K_fuse, and L is the Laplacian matrix of K_fuse.

Results

In this section, we evaluate and compare the performance of our proposed method in multiple dimensions, using P-values and survival curves as the primary criteria and taking indexes such as RI and ARI into consideration. Finally, through gene set enrichment analysis (GSEA), some key genes supporting each subtype are obtained and displayed using heat maps and boxplots.

Evaluation Novel Method

We use the P-value of Cox regression model to evaluate the performance of several key steps of the proposed method (Pölsterl et al., 2015, 2016a,b). It includes applying kernel PCA to reduce the original data dimension, using similarity kernel fusion strategy to obtain feature input, and employing spectral clustering as the core clustering method to obtain the final clustering result. We calculate the P-values for the clusters on the seven datasets. A lower P-value indicates a more significant result. Here, we use 0.05 as the threshold for evaluation.

Performance of Kernel Principal Component Analysis

The kernel PCA is a non-linear version of PCA widely used in linear dimensionality reduction methods. Using kernel PCA, the dimensionality of complex, non-linear features can be reduced effectively. For the features with tens of thousands of dimensions in the original data, it can be reduced to only a few hundred. We compare all datasets, with and without kernel PCA, and the resulting P-values are shown in Table 2. According to the data in Table 2 (the number of clusters in the table is the optimal results with the lowest P-values), kernel PCA greatly improves the performance of the method and makes the results more reliable and stable.

TABLE 2

Table 2. Results with or without kernel principal component analysis (KPCA).

Performance of Fusion Strategy

Here, we compare the performance of using a single kernel directly with the use of a kernel fusion strategy on seven datasets, and the results are shown in Table 3. After using similarity kernel fusion, the P-values on seven datasets have been significantly improved. And for the dataset LIHC, although the performance on single kernel has already performed well, kernel fusion will further enhance the clustering results. We therefore conclude that the strategy for similarity kernel fusion is necessary.

TABLE 3

Table 3. P-values of taking single kernel and fused kernel.

Comparing With Other Methods

We compare the results with those of the method of Li et al. (2020). As shown in Table 4, we find that using kernel PCA for feature reduction, taking weighted fusion strategies instead of complex SKF, and finally generating clustering results from spectral clustering have comparable reliability and stability.

TABLE 4

Table 4. Performance between Li’s and our method.

Clustering Analysis

The Rand index (RI) (Rand, 1971) is an indicator for evaluating clustering performance in statistics, by measuring the similarity between two data clusters. However, a problem with the index is that the expected value of the RI of two random partitions cannot take a constant value (Steinley, 2004). The adjusted RI proposed by Yeung and Ruzzo (2001), which is the corrected-for-chance version of the RI, can effectively avoid RI’s insufficient. We measure both the LUNG and KIDNEY datasets and compare the results obtained by Li et al. (2020). The results are shown in Table 5. From the indicators such as RI and ARI, our clustering is more stable, and the effect is much better than Li’s in KIDNEY and LUNG. Especially on LUNG, although we get a higher P-value, we have a 28% advantage on the ARI.

TABLE 5

Table 5. Rand index (RI) and adjusted Rand index (ARI) on two datasets between Li’s and our method.

Survival Analysis

Survival analysis is a branch of statistics that analyzes the expected duration until an event occurs, such as the death of a cancer patient. We find that patients with subtype 2 of liver cancer (LIHC), subtype 6 of colon cancer (COAD), and subtype 3 of stomach cancer (STAD) have higher mortality. More attention should be paid to these patients. We also see that the average survival time of breast cancer patients (BRCA) and lung cancer patients (LUNG) is longer than that of others. It indicates that these cluster results can be used to guide clinical treatment. The survival curves of all datasets are shown in Figure 2.

FIGURE 2

Figure 2. Survival curves of various subtypes for seven cancer datasets.

Gene Set Enrichment Analysis

GSEA (Mootha et al., 2003; Subramanian et al., 2005; Huang et al., 2020) is an analysis method for genome-wide expression profile chip data, which compares genes with a predefined set of genes. Synthesize the existing information base of gene location, nature, function, biological significance, etc., to build a molecular tag database (MSigDB), in which known genes are identified by chromosomal location, established gene set, and model sequence. Tumor-related gene set and GO gene set and other functional gene sets are grouped and classified. By analyzing the gene expression profile data, we can understand their expression status in a specific functional gene set and whether this expression status has some statistical significance. In this paper, we use Broad Institute’s offline analysis software GSEA_4.0.2, and C4 collection (cancer gene neighborhoods and cancer modules), provided by Broad Institute in the Molecular Signatures Database (MSigDB), which is a computational gene set defined by mining large collections of cancer-oriented microarray data. We analyze the LUNG and LIHC datasets, and we collect the expression data of genes with higher scores on different subtypes. The heat maps drawn are shown in Figure 3.

FIGURE 3

Figure 3. Differential expression of some essential genes in different subtypes on LUNG and LIHC.

Essential Gene Analysis

For each subtype on the datasets LUNG and LIHC, we select the essential gene that can highly distinguish the subtypes. According to the expression of each gene on its dataset, we obtain the box diagrams as shown in Figures 4, 5. We find that GMPS, EPHA10, C10orf54, and MAGEA6 are highly expressed in different subtypes on the dataset LUNG; and CMYA5, DEPDC6, FAU, VPS24, RCBTB2, LOC100133469, and SLC35B4 are significantly expressed in different subtypes on the dataset LIHC, respectively.

FIGURE 4

Figure 4. The boxplots for essential genes supporting each subtype on LUNG.

FIGURE 5

Figure 5. The boxplots for essential genes supporting each subtype on LIHC.

Conclusion

In this paper, we propose a model for accurately predicting cancer subtypes. First, we collect seven cancer datasets from Firehose website, which contained three kinds of expression data (gene expression, isoform expression, and methylation expression). Then we construct three similar kernels for three kinds of expression data, respectively, and we fuse the three kernels into the global one. Finally, the cancer subtypes are discovered by spectral clustering. We take P-value as the overall evaluation criterion, combining with survival curve analysis and GSEA.

In the future, we will also try other machine learning methods or deep learning methods (Kong and Yu, 2018; Ding et al., 2019a,b; Shen et al., 2019; Gao et al., 2020; Lee et al., 2020; Wang et al., 2020), to deal with the problem of small samples and large features of cancer data and predict cancer subtypes more accurately.

Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.

Author Contributions

JF and SL conceived and designed the experiments. JF and LJ performed the experiments and analyzed the data. LW and LJ wrote the manuscript. JT supervised the experiments and reviewed the manuscript. All authors contributed to the article and approved the submitted version.

Funding

This work was supported by a grant from the National Natural Science Foundation of China (NSFC 61902271 and 61972280) and the National Key RD Program of China (2020YFA0908401, 2020YFA0908400, 2018YFC0910405, and 2017YFC0908400).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

We would like to thank the Broad Institute GDAC Firehose for providing extensive amounts of biological data for the community.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2021.647141/full#supplementary-material

Footnotes

References

Alter, O., Brown, P. O., and Botstein, D. (2000). Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci.U.S.A. 97, 10101–10106. doi: 10.1073/pnas.97.18.10101

PubMed Abstract | CrossRef Full Text | Google Scholar

Brunet, J.-P., Tamayo, P., Golub, T. R., and Mesirov, J. P. (2004). Metagenes and molecular pattern discovery using matrix factorization. Proc. Natl. Acad. Sci.U.S.A. 101, 4164–4169. doi: 10.1073/pnas.0308531101

PubMed Abstract | CrossRef Full Text | Google Scholar

Center BITGDA (2016). Analysis-Ready Standardized TCGA Data From Broad GDAC Firehose 2016_01_28 run: Dataset. Cambridge, MA: Broad Institute of MIT and Harvard, doi: 10.7908/C11G0KM9