Incomplete multi-view gene clustering with data regeneration using Shape Boltzmann Machine

https://doi.org/10.1016/j.compbiomed.2020.103965Get rights and content

Abstract

Deciphering patterns in the structural and functional anatomy of genes can prove to be very helpful in understanding genetic biology and genomics. Also, the availability of the multiple omics data, along with the advent of machine learning techniques, aids medical professionals in gaining insights about various biological regulations. Gene clustering is one of the many such computation techniques that can help in understanding gene behavior. However, more comprehensive and reliable insights can be gained if different modalities/views of biomedical data are considered. However, in most multi-view cases, each view contains some missing data, leading to incomplete multi-view clustering. In this study, we have presented a deep Boltzmann machine-based incomplete multi-view clustering framework for gene clustering. Here, we seek to regenerate the data of the three NCBI datasets in the incomplete modalities using Shape Boltzmann Machines. The overall performance of the proposed multi-view clustering technique has been evaluated using the Silhouette index and Davies–Bouldin index, and the comparative analysis shows an improvement over state-of-the-art methods. Finally, to prove that the improvement attained by the proposed incomplete multi-view clustering is statistically significant, we perform Welch's t-test.

Availability of data and materials

https://github.com/piyushmishra12/IMC.

Introduction

Clustering is a widely used unsupervised machine learning technique. One can find its applications in many domains: development of different optimization models [1], electricity customer classification [2], spectral-spatial classification [3], MRI segmentation [4] etc. The majority of the existing clustering algorithms are developed by relying on a single metric like density, connectivity, or symmetry of the data points. These approaches are very responsive to parameter setting [5]. Hence, identifying the best-suited clustering algorithm for a particular dataset is one of the significant challenges in machine learning.

One of the crucial problems in computational biology is understanding the biological functionalities of the genes. Recent years have witnessed a high rise in the research of high-throughput technologies specially in gene expression profiles [6]. The gene expression profiles are very informative and analyzing those profiles helps us to understand molecular functionalities of the genes. Therefore gene expression profile is extensively used for gene clustering [7], gene classification [8], disease function prediction [9,10], hub gene prediction [11] and in solving many more challenging bioinformatics tasks. Genes with similar expression patterns have similar functionalities [8,12,13]. In recent years, it has been shown that rather than only analyzing gene expression profiles, the integration of various heterogeneous biomedical data aids in gaining more insights into the underlying biological data.

With the advent of high throughput technologies, there is an explosion of the different biomedical data, e.g., DNA sequencing, amino acid sequences, protein structures, and many more. Each view/modality of the data contains different essential and complementary information of each other. This has lead to a surge in the interest of considering the information of various views/modalities for solving popular bioinformatics tasks [[14], [15], [16], [17]]. In most real-life multi-view scenarios, information about a few instances is missing in some of these views due to the different nature of data collection.

To overcome this bottleneck, there is an increasing prominent trend of using incomplete multi-view clustering (IMC) technique. One of the prevalent solutions of the IMC problem is using Non-negative Matrix Factorization (NMF) [[18], [19], [20]]. NMF based methods learn a common latent space for complete instances and private latent representations for incomplete instances. However, NMF fails to handle more than two incomplete views. Researchers have recently utilized a generative adversarial network (GAN) for generating missing instances [[21], [22], [23]].

Drawing inspirations from the above facts, in this paper, we have proposed a Shape Boltzmann Machine (SBM) based incomplete multi-view clustering for gene partitioning. Here, the Shape Boltzmann machine [24] is exploited for generating missing instances. The proposed incomplete clustering technique is applied to three real-life NCBI gene expression datasets, and it exhibits better performance than baselines and other state-of-the-art methods. The significant contributions of the proposed incomplete multi-view clustering technique are the following:

  • 1.

    Conversion to images: Each modality has one dataset, the entirety of which is converted to a single matrix. For the sake of convenience of comprehension and representation, each matrix is referred to as an image.

  • 2.

    Regeneration: All images are regenerated with reference to the general image using SBM for a more uniform computation and modality completion.

  • 3.

    Generating a common image: A standard image is generated using all the modalities. This helps in using all the modalities equally for the final clustering.

  • 4.

    Clustering: After the respective images have been regenerated, clustering is carried out using the K-Means clustering algorithm, and its robustness is verified.

These stages are discussed in significant detail in the following sections. The primary reason for following this pipeline of workflow is to reduce as much complexity as possible in the computation. Every stage seeks to minimize the redundancy in the workflow. Different concepts of separate domain-specificity have been utilized in this study: concepts of rudimentary encoding for gene sequence data, vocabulary-based hashing for the data on protein data bank, conversion of non-image data for image-based computing, etc. are some of the practicalities of this study.

A significant amount of research has been done in the domain. However, most of the previous works carry the assumption that all the views are complete. Most widely used solutions for incomplete multi-modal clustering involve the use of non-negative matrix factorization [[18], [19], [20]]. Despite being fairly robust, these methods cannot be employed for clustering using more than two views. Thus, this study aims to remove these drawbacks and bottlenecks from the previous works.

The rest of the paper is organized as follows. Section-2 gives a brief explanation of the generated datasets and the proposed Shape Boltzmann machine-based incomplete multi-view clustering. The detailed descriptions of the experimental analysis and the comparative study are given in section-3. Finally, the paper concludes in section-4.

Section snippets

Proposed methodology

In this section, we will illustrate the details of each essential stages of the proposed Shape Boltzmann machine-based incomplete multi-view clustering. Here, we have discussed the details of the data sets and the preprocessing methods applied to them. Next, we have briefly described the overall presentation of the proposed incomplete multi-view clustering technique, which is divided into two phases. Firstly, we have provided an overview of each view, and in the second phase, we have briefly

Experimental results

This section elaborates, in detail, about the results obtained from experimentation using the proposed methodology. Along with this, we have also reported a comparative analysis of the proposed method with different baselines and state-of-the-methods in terms of two cluster validity indices. Finally, to statistically prove the better performance of the proposed IMC method, we have performed Welch's statistical t-test.

Conclusions & future works

This paper presented a Shape Boltzmann machine-based incomplete multi-view clustering technique for gene partitioning. Here, for a particular gene instance, we consider three different views, i.e., gene expression profile, gene sequence and gene PDB IDs. Among these three views, we have incomplete data instances for gene sequences. To regenerate the missing instances, we have exploited a Shape Boltzmann Machine (SBM). The common subspace representation is generated using the average pooling.

Acknowledgement

Pratik Dutta acknowledges the Visvesvaraya Ph.D. Scheme for Electronics and IT, an initiative of the Ministry of Electronics and Information Technology (MeitY), Government of India, for fellowship support. Dr. Sriparna Saha gratefully acknowledges the Young Faculty Research Fellowship (YFRF) Award, supported by Visvesvaraya Ph.D. Scheme for Electronics and IT, Ministry of Electronics and Information Technology (MeitY), Government of India, being implemented by Digital India Corporation

References (45)

  • A.K. Jain et al.

    Data clustering: a review

    ACM Comput. Surv.

    (1999)
  • K. Yang et al.

    A stable gene selection in microarray data analysis

    BMC Bioinf.

    (2006)
  • A.K. Miller et al.

    Uveal melanoma with histopathologic intratumoral heterogeneity associated with gene expression profile discordance

    Ocular Oncol. Pathol.

    (2017)
  • C.L. Decatur et al.

    Driver mutations in uveal melanoma: associations with gene expression profile and patient outcomes

    JAMA Ophthalmol.

    (2016)
  • P. Dutta et al.

    Graph-based hub gene selection technique using protein interaction information: application to sample classification

    IEEE J. Biomed. Health Inf.

    (2019)
  • P. Dutta et al.

    Ensembling of gene clusters utilizing deep learning and protein-protein interaction information

    IEEE ACM Trans. Comput. Biol. Bioinf

    (2019)
  • A. Cheerla et al.

    Deep Learning with Multimodal Representation for Pancancer Prognosis Prediction

    (2019)
  • M. Kulmanov et al.

    Deepgo: predicting protein functions from sequence and interactions using a deep ontology-aware classifier

    Bioinformatics

    (2017)
  • D. Sun et al.

    A multimodal deep neural network for human breast cancer prognosis prediction by integrating multi-dimensional data

    IEEE ACM Trans. Comput. Biol. Bioinf

    (2018)
  • S. Mitra et al.

    A multiobjective multi-view cluster ensemble technique: application in patient subclassification

    PloS One

    (2019)
  • S.-Y. Li, Y. Jiang, Z.-H. Zhou, Partial multi-view clustering, in: Twenty-Eighth AAAI Conference on Artificial...
  • H. Zhao, H. Liu, Y. Fu, Incomplete multi-modal visual data grouping., in: IJCAI, pp....
  • Cited by (7)

    View all citing articles on Scopus
    View full text