Incomplete multi-view gene clustering with data regeneration using Shape Boltzmann Machine
Introduction
Clustering is a widely used unsupervised machine learning technique. One can find its applications in many domains: development of different optimization models [1], electricity customer classification [2], spectral-spatial classification [3], MRI segmentation [4] etc. The majority of the existing clustering algorithms are developed by relying on a single metric like density, connectivity, or symmetry of the data points. These approaches are very responsive to parameter setting [5]. Hence, identifying the best-suited clustering algorithm for a particular dataset is one of the significant challenges in machine learning.
One of the crucial problems in computational biology is understanding the biological functionalities of the genes. Recent years have witnessed a high rise in the research of high-throughput technologies specially in gene expression profiles [6]. The gene expression profiles are very informative and analyzing those profiles helps us to understand molecular functionalities of the genes. Therefore gene expression profile is extensively used for gene clustering [7], gene classification [8], disease function prediction [9,10], hub gene prediction [11] and in solving many more challenging bioinformatics tasks. Genes with similar expression patterns have similar functionalities [8,12,13]. In recent years, it has been shown that rather than only analyzing gene expression profiles, the integration of various heterogeneous biomedical data aids in gaining more insights into the underlying biological data.
With the advent of high throughput technologies, there is an explosion of the different biomedical data, e.g., DNA sequencing, amino acid sequences, protein structures, and many more. Each view/modality of the data contains different essential and complementary information of each other. This has lead to a surge in the interest of considering the information of various views/modalities for solving popular bioinformatics tasks [[14], [15], [16], [17]]. In most real-life multi-view scenarios, information about a few instances is missing in some of these views due to the different nature of data collection.
To overcome this bottleneck, there is an increasing prominent trend of using incomplete multi-view clustering (IMC) technique. One of the prevalent solutions of the IMC problem is using Non-negative Matrix Factorization (NMF) [[18], [19], [20]]. NMF based methods learn a common latent space for complete instances and private latent representations for incomplete instances. However, NMF fails to handle more than two incomplete views. Researchers have recently utilized a generative adversarial network (GAN) for generating missing instances [[21], [22], [23]].
Drawing inspirations from the above facts, in this paper, we have proposed a Shape Boltzmann Machine (SBM) based incomplete multi-view clustering for gene partitioning. Here, the Shape Boltzmann machine [24] is exploited for generating missing instances. The proposed incomplete clustering technique is applied to three real-life NCBI gene expression datasets, and it exhibits better performance than baselines and other state-of-the-art methods. The significant contributions of the proposed incomplete multi-view clustering technique are the following:
- 1.
Conversion to images: Each modality has one dataset, the entirety of which is converted to a single matrix. For the sake of convenience of comprehension and representation, each matrix is referred to as an image.
- 2.
Regeneration: All images are regenerated with reference to the general image using SBM for a more uniform computation and modality completion.
- 3.
Generating a common image: A standard image is generated using all the modalities. This helps in using all the modalities equally for the final clustering.
- 4.
Clustering: After the respective images have been regenerated, clustering is carried out using the K-Means clustering algorithm, and its robustness is verified.
These stages are discussed in significant detail in the following sections. The primary reason for following this pipeline of workflow is to reduce as much complexity as possible in the computation. Every stage seeks to minimize the redundancy in the workflow. Different concepts of separate domain-specificity have been utilized in this study: concepts of rudimentary encoding for gene sequence data, vocabulary-based hashing for the data on protein data bank, conversion of non-image data for image-based computing, etc. are some of the practicalities of this study.
A significant amount of research has been done in the domain. However, most of the previous works carry the assumption that all the views are complete. Most widely used solutions for incomplete multi-modal clustering involve the use of non-negative matrix factorization [[18], [19], [20]]. Despite being fairly robust, these methods cannot be employed for clustering using more than two views. Thus, this study aims to remove these drawbacks and bottlenecks from the previous works.
The rest of the paper is organized as follows. Section-2 gives a brief explanation of the generated datasets and the proposed Shape Boltzmann machine-based incomplete multi-view clustering. The detailed descriptions of the experimental analysis and the comparative study are given in section-3. Finally, the paper concludes in section-4.
Section snippets
Proposed methodology
In this section, we will illustrate the details of each essential stages of the proposed Shape Boltzmann machine-based incomplete multi-view clustering. Here, we have discussed the details of the data sets and the preprocessing methods applied to them. Next, we have briefly described the overall presentation of the proposed incomplete multi-view clustering technique, which is divided into two phases. Firstly, we have provided an overview of each view, and in the second phase, we have briefly
Experimental results
This section elaborates, in detail, about the results obtained from experimentation using the proposed methodology. Along with this, we have also reported a comparative analysis of the proposed method with different baselines and state-of-the-methods in terms of two cluster validity indices. Finally, to statistically prove the better performance of the proposed IMC method, we have performed Welch's statistical t-test.
Conclusions & future works
This paper presented a Shape Boltzmann machine-based incomplete multi-view clustering technique for gene partitioning. Here, for a particular gene instance, we consider three different views, i.e., gene expression profile, gene sequence and gene PDB IDs. Among these three views, we have incomplete data instances for gene sequences. To regenerate the missing instances, we have exploited a Shape Boltzmann Machine (SBM). The common subspace representation is generated using the average pooling.
Acknowledgement
Pratik Dutta acknowledges the Visvesvaraya Ph.D. Scheme for Electronics and IT, an initiative of the Ministry of Electronics and Information Technology (MeitY), Government of India, for fellowship support. Dr. Sriparna Saha gratefully acknowledges the Young Faculty Research Fellowship (YFRF) Award, supported by Visvesvaraya Ph.D. Scheme for Electronics and IT, Ministry of Electronics and Information Technology (MeitY), Government of India, being implemented by Digital India Corporation
References (45)
- et al.
Construction of fuzzy models through clustering techniques
Fuzzy Set Syst.
(1993) - et al.
Fusion of expression values and protein interaction information using multi-objective optimization for improving gene clustering
Comput. Biol. Med.
(2017) - et al.
Selection of genes mediating certain cancers, using a neuro-fuzzy approach
Neurocomputing
(2014) - et al.
Segs: search for enriched gene sets in microarray data
J. Biomed. Inf.
(2008) - et al.
Identification of progression markers in b-cll by gene expression profiling
Exp. Hematol.
(2005) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
J. Comput. Appl. Math.
(1987)- et al.
Fcm: the fuzzy c-means clustering algorithm
Comput. Geosci.
(1984) - et al.
Comparisons among clustering techniques for electricity customer classification
IEEE Trans. Power Syst.
(2006) - et al.
Spectral–spatial classification of hyperspectral imagery based on partitional clustering techniques
IEEE Trans. Geosci. Rem. Sens.
(2009) - et al.
Mri segmentation using fuzzy clustering techniques
IEEE Eng. Med. Biol. Mag.
(1994)
Data clustering: a review
ACM Comput. Surv.
A stable gene selection in microarray data analysis
BMC Bioinf.
Uveal melanoma with histopathologic intratumoral heterogeneity associated with gene expression profile discordance
Ocular Oncol. Pathol.
Driver mutations in uveal melanoma: associations with gene expression profile and patient outcomes
JAMA Ophthalmol.
Graph-based hub gene selection technique using protein interaction information: application to sample classification
IEEE J. Biomed. Health Inf.
Ensembling of gene clusters utilizing deep learning and protein-protein interaction information
IEEE ACM Trans. Comput. Biol. Bioinf
Deep Learning with Multimodal Representation for Pancancer Prognosis Prediction
Deepgo: predicting protein functions from sequence and interactions using a deep ontology-aware classifier
Bioinformatics
A multimodal deep neural network for human breast cancer prognosis prediction by integrating multi-dimensional data
IEEE ACM Trans. Comput. Biol. Bioinf
A multiobjective multi-view cluster ensemble technique: application in patient subclassification
PloS One
Cited by (7)
Incomplete multi-view learning: Review, analysis, and prospects
2024, Applied Soft ComputingMulti-view clustering via deep concept factorization
2021, Knowledge-Based SystemsIncomplete Multiview Clustering Using Normalizing Alignment Strategy With Graph Regularization
2023, IEEE Transactions on Knowledge and Data EngineeringOmics Data and Data Representations for Deep Learning-Based Predictive Modeling
2022, International Journal of Molecular Sciences