Multi-view Clustering: A Survey

In the big data era, the data are generated from different sources or observed from different views. These data are referred to as multi-view data. Unleashing the power of knowledge in multi-view data is very important in big data mining and analysis. This calls for advanced techniques that consider the diversity of different views, while fusing these data. Multi-view Clustering (MvC) has attracted increasing attention in recent years by aiming to exploit complementary and consensus information across multiple views. This paper summarizes a large number of multi-view clustering algorithms, provides a taxonomy according to the mechanisms and principles involved, and classifies these algorithms into five categories, namely, co-training style algorithms, multi-kernel learning, multiview graph clustering, multi-view subspace clustering, and multi-task multi-view clustering. Therein, multi-view graph clustering is further categorized as graph-based, network-based, and spectral-based methods. Multi-view subspace clustering is further divided into subspace learning-based, and non-negative matrix factorization-based methods. This paper does not only introduce the mechanisms for each category of methods, but also gives a few examples for how these techniques are used. In addition, it lists some publically available multi-view datasets. Overall, this paper serves as an introductory text and survey for multi-view clustering.


Introduction
In many real-world applications of big data mining and analysis, data are collected from different sources in diverse domains or obtained from various feature collectors. For instance, pictures shared on websites often have corresponding textual tags and descriptions; specific news are reported by multiple news organizations; sensor signals decompose in the time and frequency domains; the same semantic Yan Yang and Hao Wang are with the School of Information Science and Technology, Southwest Jiaotong University, Chengdu 611756, China. E-mail: yyang@swjtu.edu.cn; cshaowang@gmail.com. Hao Wang is currently a PhD candidate and shares first authorship. To whom correspondence should be addressed. Manuscript received: 2017-08-02; accepted: 2017- [11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28] meaning (e.g., hello) is represented with multilingual forms; an image is described by different types of features. All these are referred to as multi-view data. These data exhibit heterogeneous properties but hold potential connection. In other words, in such data, each individual view has its specific property for a particular knowledge discovery task; however, different views often contain complementary information that should be exploited. Therefore, a way to exploit this information, in order to uncover the potential values of multi-view data, is very significant in big data research. Applications in real-life data analysis also calls for advanced technologies that can deal with data objects with multiple views, in order to bring data mining and knowledge discovery to new heights.
In the past decade, plenty of machine learning technologies have been investigated for the scope of dealing with multi-view data. Good surveys on multi-view learning have been conducted in Refs. [1][2][3]. Moreover, Zheng [4] provided an overview on the methodologies for multi-view (cross-domain) data fusion, in which some specific applications were discussed. Existing multi-view learning technologies are roughly divided into supervised learning and unsupervised learning. This paper focuses on one of the unsupervised learning techniques, namely, clustering. Clustering has emerged as a powerful alternative learning tool for exploring the underlying structure of data [5,6] , especially in the era of big data [7] . The basic idea of clustering algorithms is to partition a set of data objects according to some criteria, such that similar objects are grouped into the same cluster, and dissimilar objects are divided into different clusters.
Many advanced clustering algorithms have been investigated in the last few decades. Although these clustering algorithms have been very successful to some extent, most of them are only suitable to single view data.
Even concatenating all views into a single view and then adopting state-of-theart clustering algorithms on this single view may not improve the clustering performance, because such way is not physically meaningful due to each view having its specific statistical property. In comparison, Multi-view Clustering (MvC) performs effectively on multi-view data by considering the diversity and complementarity of different views. Early studies on MvC, such as reinforcement clustering for multi-type interrelated data [8] , multi-view version of DBSCAN [9] , and two-view version of EM-based and agglomerative algorithms [10] , etc., began approximately in 2003. As an advanced clustering paradigm, MvC received increasing attention in recent years. Thus far, four workshops [11][12][13][14] and a mini-symposium [15] have been held in conjunction to related international conferences. In the context of MvC, an inherent problem (and also the goal) of all algorithms having to be dealt with elaborately, is to find a way to maximize clustering quality within each view, while taking clustering consistency across different views into consideration. Moreover, incomplete multi-view data, where some data objects could be missing their observation on one view (i.e., missing objects) or could be available only for their partial features on that view (i.e., missing feature), also pose challenges to MvC.
In this paper, we review a number of representative MvC methods. According to the mechanisms and principles on which these methods are based, we organize and summarize them in five categories: Co-training style algorithms: This category of methods treats multi-view data by using cotraining strategy. It bootstraps the clustering of different views by using the prior or learning knowledge from one another. By iteratively carrying out this strategy, the clustering results of all views tend to each other and this leads to the broadest consensus across all views. Multi-kernel learning: This category of methods uses predefined kernels corresponding to different views, and then combines these kernels either linearly or non-linearly in order to improve clustering performance. Multi-view graph clustering: This category of methods seeks to find a fusion graph (or network) across all views and then uses graph-cut algorithms or other technologies (e.g., spectral clustering) on the fusion graph in order to produce the clustering result.
Multi-view subspace clustering: This category learns a unified feature representation (to be input into a model for clustering) from all the feature subspaces of all views by assuming that all views share this representation.
Typical models include subspace learning and Nonnegative Matrix Factorization (NMF). Multi-task multi-view clustering: This category treads each view with one task or multiple related tasks, transfers the inter-task knowledge to one another, and exploits multi-task and multiview relationships in order to improve clustering performance. We will provide a specific introduction and a few examples for each category in the following sections. Moreover, we also list some widely used multi-view datasets in order to help researchers in this field.
To do that, the rest of this paper will be organized as follows. In Section 2, we illustrate two related principles that ensure the success of MvC. In Section 3, we provide an overview of earlier and more recent MvC methods with five categories, and enumerate a few examples for each category. Some publicly available datasets are covered in Section 4. Finally, in Section 5, we conclude this paper and discuss challenges and future trends for MvC.
Notations and Definitions: We begin with a description of the notations used in this paper. We state that matrices and vectors throughout this paper are written in uppercase and lowercase letters, respectively. The common notations and corresponding definitions are summarized in Table 1.

Principles of MvC
This section deals with analyzing two significant principles of MvC, namely, complementary and consensus principles. These two principles partially answer why MvC is effective, what the underlying assumptions are, and above all how the MvC should be modeled and performed.
By referring to Ref. [16], we give an illustration on these two principles. Given a data object with two views, this data object is mapped into a latent data space as shown in Fig. 1. From Fig. 1, we can observe that: (1) some ingredients (part A and part C) exist in the individual view, such as part A in view 1 and part C in view 2, i.e., the complementarity of two views, and (2) some ingredients (part B) of the object are shared by both views, i.e., the consensus between two views.
Next, we analyze these two principles as follows: Complementary principle: This principle states that multiple views should be employed in order to describe data objects more comprehensively and accurately. In Table 1 Notations and definitions.

Notation
Definition k User-specified number of clusters m Number of views n Total number of instances n v Number of instances in the v-th view data d v Dimensionality of the v-th view data X X D fX 1 ; :::; X m g, data matrices with m views Note: If multi-view data is complete, then n D n v .
One data object Fig. 1 Illustration of complementary and consensus principles [16] .
the context of multi-view data, each single view is sufficient for a particular knowledge discovery task. However, different views often contain information complementary to each other. For instance, in the field of image processing, each image is described by different types of features, such as LBP, SIFT, and HOG, where LBP is a powerful texture feature, SIFT is robust to image illumination, noise, and rotation, while HOG is sensitive to marginal information. Therefore, it is necessary to exploit these mutually complementary information underlying multiple views in order to describe these data objects, and to provide deeper insights with regard to the internal clustering.

Consensus principle:
This principle aims to maximize consistency across multiple distinct views. Based on probably approximately correct analysis, Dasgupta et al. [17] proposed a generalization error analysis for the consensus principle. Given a multiview dataset X , this dataset has two views X 1 and X 2 . Under some mild assumptions, Dasgupta et al. [17] demonstrated the connection between the consensus of two hypotheses on two views, respectively. The connection is formulated as the following inequality: From this inequality, we conclude that the error probability of the disagreement on two independent hypotheses is an upper bound on the error rate of either hypothesis. Thus, minimizing the disagreement of the two hypotheses will lead to the error rate of each minimized hypothesis; namely, maximizing the agreement (or consistence) of the two hypotheses will result in the error rate of each minimized hypothesis. This is called maximizing-consistence policy. Cotraining [18] is a landmark technology, and is one of the most widely used schemes for multi-view learning. The standard co-training algorithm is trained alternately in order to maximize the mutual agreement of the two unlabeled data views by using learning or by providing labeled data to one another. In terms of clustering, De Sa [19] pioneered a two-view spectral clustering algorithm, which was inspired by the idea of minimizing-disagreement (the same concept as maximizing-consistency). There are also many cotraining style MvC algorithms (see Section 3.1).
To sum up, both complementary and consensus principles play important roles in addressing the problem of MvC, and both of them should be considered in order to take full advantage of multi-view data.

Co-training style algorithms
Co-training style algorithms are investigated under the consideration of multiple views consensus. This category of methods intends to maximize the mutual agreement across all views and arrive at their broadest consensus. The general procedure of the conventional co-training algorithms is shown in Fig. 2. According to the procedure, the algorithm is trained alternately in order to maximize the consistency of the two distinct views by using prior information or by learning knowledge from each other. Note that the success of co-training mainly relies on three assumptions: (1) Sufficiency: Each view is sufficient for the learning task on its own, (2) Compatibility: The objective functions export the same predictions for co-occurring features with high probability in both views, and (3) Conditional independence: All views provide learning labels in conditional independence. While, in practice, it is usually too hard to satisfy the conditional independence assumption. Thus, several weaker assumptions such as the weak conditional dependence assumption [20] , much weaker "expansion" assumption [21] , and the difference assumption [22] have been investigated. In addition, several extended versions of co-training such as co-EM [23] , co-regularization [24] , and co-clustering [25] , have been studied.
Most of the above-mentioned methods are designed for multi-view data in a semi-supervised learning setting. In unsupervised learning (i.e., clustering), Bickel and Scheffer [10] first studied MvC with a co-training idea, and proposed two kinds of MvC algorithms for text data. One is a multi-view EM algorithm that works alternately between the views, while the other is an agglomerative algorithm, inspired by the co-training algorithm. As a result, Bickel and Scheffer [10] concluded that the multi-view EM algorithm significantly outperformed the single-view algorithm; however, the agglomerative algorithm led to negative results. Furthermore, they studied the estimation of mixture models with co-EM for multiview data analysis [26] , which contributes to adopting a mixture model estimation for multiple views by demonstrating that the co-EM algorithm is a special case of mixture model estimation. Moreover, Tzortzis and Likas [27] put forward a weighted multi-view convex mixture model that automatically assigns weights to views via EM. Assuming that the similar data objects are grouped into the same cluster, regardless of views, Kumar and Daumé III [28] proposed a cotraining approach for multi-view spectral clustering, where the clusters are bootstrap to different views by using complementary information from one another. Kumar et al. [29] further proposed a co-regularized approach for multi-view spectral clustering, where the graph Laplacians are imposed on all views and the regularizations on the eigenvectors of the Laplacians in order to consistently lead the resulting clustering structures. Inspired by the work of Kumar et al. [29] , Ye et al. [30] discussed co-regularized kernel K-means for MvC. This method automatically learns the weights of different view from data. In addition, a multiple view aware method with a co-training strategy was investigated in order to cluster process executions (traces) [31] , in which it considers that the traces of an event log is described by multiple trace profiles, and adapts an iterative co-training strategy to the process mining setting. A co-regularized Probabilistic Latent Semantic Analysis (PLSA) model for MvC was developed in Ref. [32]. The central idea behind it is that the sample similarities in the topic space from one view should agree to another view. To address the challenge of partial mapping between the views (i.e., incomplete views), multi-view constrained clustering via co-EM with pairwise constraints propagation were studied in Refs. [33,34]. In other words, the proposed methods in Refs. [33,34] used co-EM in order to iteratively estimate the propagations within each view, transfer the given pairwise constraints across views, update the clustering model, and finally learn a unified clustering result for all views. MvC based on co-clustering (simultaneously clustering the objects and features) has also been investigated. For instance, Meng et al. [35] proposed a heterogeneous data co-clustering approach, which does not only extend fusion from two views to multiple views, but also weight the features of multiple data sources. Based on matrix decomposition, Sun et al. [36] presented a proximal alternating linearized minimization algorithm. This algorithm can simultaneously decompose multiple data matrices into sparse row and column vectors, and link different views of data with a binary vector, where the binary vector enforces consistency for the row clusters from all views. Simultaneously building similarity matrices, rather than a set of clusters, between the rows and columns of a data matrix, an architecture to learn co-similarities from multi-view datasets was designed in Ref. [37], and was subsequently parallelized in Ref. [38]. Assuming that transferring similarity values (generated from individual data) from one view to others would result in better data clustering, Hussain and Bashir [39] extended the co-similarity based architecture in order to handle multiple datasets with two kinds of integration schemes (i.e., intermediate integration and late integration). In addition, several collaborative MvC approaches have been investigated in Refs. [40,41]. These approaches consist of two phases: the local phase and collaboration phase, where the local phase applies a clustering algorithm to each view, and the collaboration phase collaborates each view with the clustering results associated to the other views produced from the local phase.
Example 1: As illustrated in Fig. 3 (using two views for brevity), Kumar and Daumé III [28] first applied the co-training strategy to the problem of multi-view spectral clustering. Unlike semi-supervised learning, there are no labeled data in unsupervised learning settings; therefore, the prototypical co-training algorithms were not available directly for MvC. However, the motivation of co-training still remains the same as in unsupervised learning problems. In other words, it limits the search only to hypotheses (clusterings) that agree with those in other views. Assuming that the true underlying clustering would 1: Calculate the graph similarity matrix S 1 and S 2 for both views. 2: Initialize the graph Laplacian matrices L 1 and L 2 , and the discriminative eigenvectors U 1 and U 2 . 3: Perform spectral embedding on S 1 with U 2 to get new similarity matrix S 1 . 4: Perform spectral embedding on S 2 with U 1 to get new similarity matrix S 2 . 5: Compute the new Laplacians matrices L 1 and L 2 , new eigenvectors U 1 and U 2 . 6: Go to Step 3 and repeat for a number of iterations. Fig. 3 Co-training approach for multi-view spectral clustering [28] . assign a point to the same cluster, irrespective of the views, as was done in most of those co-training based MVC approaches, Kumar and Daumé III [28] took the spectral embedding from one view in order to constrain the similarity graph of the other view. By carrying out this process iteratively, the clusterings of the two views tended to each other.

Multi-kernel learning
Multi-kernel learning was originally developed in order to boost the search space capacity of possible kernel functions, e.g., Linear kernel, Polynomial kernel, and Gaussian kernels, in order to achieve good generalization. As kernels in multi-kernel learning naturally correspond to different views, multi-kernel learning has been widely applied in order to deal with multi-view data. The general procedure of multi-kernel learning approaches is shown in Fig. 4, where different predefined kernels are used to deal with different views. Then these kernels are combined either linearly or nonlinearly in order to arrive at a unified kernel. In an MvC setting, multi-kernel learning based MvC intends to optimally combine a group of predefined kernels in order to improve clustering performance. In such methods, an essential problem consists of finding a way to choose suitable kernel functions and combine these kernels optimally.
In a single view scenario, based on maximum margin clustering [42] , Zhao et al. [43] presented a multiple kernel clustering algorithm, which can simultaneously find the maximum margin hyperplane, best clusterings, and the optimal kernels. Du et al. [44] performed a robust Kmeans (with l 2;1 -norm) on kernel space, and proposed a multiple kernel K-means algorithm, which is able to simultaneously find the best clustering labels, the cluster membership, and the optimal combination of multiple kernels. It is worth stressing that this type of the above mentioned algorithms is available for dealing with multi-view data under the framework shown in Fig.  4. In a multi-view scenario, De Sa et al. [45] constructed a custom kernel combination method based on the minimizing-disagreement algorithm [46,47] . Specifically, they generated a multi-partite graph in order to induce a kernel, which was then used for spectral clustering. In fact, this method could be regarded as a variant of kernel canonical correlation analysis, and a generalization of co-clustering and spectral clustering. Moreover, Yu et al. [48] extended the classical K-means clustering into Hilbert space, where multi-view data matrices were denoted as kernel matrices and were then combined automatically for data fusion. Similar work has also been done in Ref. [49]. The difference with Ref. [48] is that the kernels were combined in a localized way in order to better capture the sample characteristics of the data. Rather than extending the existing clustering algorithms with a multi-kernel learning setting, Lu et al. [50] studied multiple kernel clustering based on a centered kernel alignment (an effective kernel evaluation measure), which was employed in order to unify two clustering tasks and multi-kernel learning into a single optimization framework.
Methods with the weighted combination of kernels have also been studied by considering the difference of views (or kernels). For instance, kernel based weighted MvC was investigated in Ref. [51], where the weights to the kernels were assigned according to the information quality of the corresponding views. A systemic MvC approach was proposed in order to automatically assign weights for deriving the kernel matrix on each view through an optimization process in Ref. [52], where the kernel matrix learning was based on the kernel alignment in order to measure the similarity between two kernel matrices. In addition, Liu et al. [53] showed a weighted multiple kernel K-means clustering method with matrix-induced regularization, which could reduce the redundant kernels and enhance the diversity of the predefined kernels. Zhao et al. [54] provided a weighted MvC method with matrix-induced and lowrank regularization. Zhang et al. [55] also presented a weighted MvC algorithm based on improved Gaussian kernels with variable weights.
However, in many applications, it is common that data on some views is not available, or is only partially available, which leads to incomplete multi-view data, as we mentioned in the introduction. To address this issue, Trivedi et al. [56] presented a general approach that allows the MvC, in complete view settings, to be applicable in this scenario, where only one view was complete and the auxiliary views were incomplete. They took the kernel CCA based MvC as an example in order to illustrate their idea. De Sa et al. [45] (mentioned above) also stated that their proposed algorithm could calculate sample affinities with missing views. In the setting where no view is complete, Shao et al. [57] proposed a collective kernel learning algorithm in order to infer the hidden sample similarity. The idea behind this approach was to collectively complete the kernel matrices of the incomplete views by optimizing the alignment of the shared instances of those views. In addition, unlike some existing methods, where incomplete kernels were first imputed and then an available multi-kernel clustering algorithm was applied to the inputting kernels, Liu et al. [58] integrated the kernel imputation and clustering into a unified learning procedure for incomplete MvC.
Example 2: One challenge of multi-kernel learning consists of choosing appropriate kernel functions (e.g., Linear kernel, Polynomial kernel, and Gaussian kernel), which map the original low-dimensional space to a high-dimensional space. The general method for multiview data is to use a linear combination of several kernel functions, while the weights of different kernels should be taken into consideration. Moreover, the weights of different views are also an important factor for MvC. To these ends, Zhang et al. [55] developed an auto-weighted multi-kernel MvC algorithm that weights the views and kernels simultaneously. Figure 5 gives an illustration for the proposed algorithm. First, it employs Kernel Principal Component Analysis (KPCA) on each view in order to reduce the dimension of the original data, and results in low-dimensional multi-view data. Then, it applies the designed weighted Gaussian kernel on the low-dimensional multi-view data. This step drives the weight of each view and cluster centers. After finite iterations, it arrives at the final clustering result. It is worth noting that the designed weighted Gaussian kernel integrates the advantages of Gaussian kernel and Polynomial kernel. The designed weighted Gaussian kernel [55] is formulated as (2) Given multi-view data with m views, n samples, and k clusters, the objective function [55] based on the Kmeans and designed kernel is formulated as where c v j is the cluster center, and ı ij is the indicator By plugging the designed Gaussian kernel K.x; y/ D .x/ .y/ into Formula (3) [55] , it is rewritten as Note that Formula (4) inherits the properties of Kmeans and kernel, and the designed kernel integrates the advantages of Gaussian kernel and Polynomial kernel.

Multi-view graph clustering
Graphs (or networks) are widely used for representing the relationships between objects, where each node corresponds to a data object and each edge depicts the relationship between a pair of objects. In practice, the relationship is often denoted by the similarity or the affinity relationship; namely, the input graph matrix is generated from a data similarity matrix. In a multi-view scenario, data objects are captured by multiple graphs. A common assumption is that each individual graph can capture the partial information of the data; while, all graphs have the same underlying clustering structure of data. Thus, these graphs are able to mutually reinforce each other by consolidating the correlation between the data objects collectively. In general, the graph-based fusion procedure for multi-view data is similar to Fig. 6. Multi-view graph clustering aims to find a fusion graph across all views and then uses graph-cut algorithms or other technologies (e.g., spectral clustering) on the fusion graph in order to produce the final clustering result.
In this category, the literature review is organized in three parts, namely, graph-based MvC, network-based MvC, and spectral-based MvC.

Graph-based MvC
Based on multiple similarity graphs, Tang et al. [59] discussed a general clustering problem. The proposed linked matrix factorization method extracted common factors from multiple graphs, which led to various graph-based clustering methods that could be naturally applied to multi-view data. Hussain et al. [60] proposed a multi-view document clustering algorithm, which first applies single-view clustering algorithms to the data matrix of each view, in order to generate multiple partitions. Then, it uses these partitions to generate a set of three different similarity matrices, namely, the affinity matrix, cluster based similarity matrix, and pair-wise dissimilarity matrix. Finally, it employs an ensemble technique in order to aggregate these matrices, and forms a unified similarity matrix for clustering. Moreover, the impact of different similarity measures (e.g., Pearson, and Spearman correlations, Euclidean, and Canberra Distances, etc.) on MvC has been studied in Ref. [61].
Later on, Xue et al. [62] proposed a group-aware multi-view fusion method, which adopts different weights to characterize the pairwise similarity between different groups. Furthermore, even though some MvC methods learned a weight for each graph, such methods have additional parameters. In order to address these challenges, Nie et al. [63] developed a parameter-free multiple graph framework to learn a set of weights automatically for all graphs. In addition, unsupervised feature selection for multi-view data has also been investigated. There, these selected features were used for a clustering task or other learning tasks. For instance, Wei et al. [64] proposed a method, named cross diffused matrix alignment based on feature selection, in order to select features for each view by performing alignment on a cross diffused matrix. Then they applied co-regularized spectral clustering [29] on these selected features in order to produce the final results. Moreover, since traditional approaches such as Ref. [63] evaluate the similarity by a predefined or fixed graph Laplacian in each view, separately, and neglect the underlying common structures across different views, Hou et al. [65] presented a multi-view unsupervised feature selection algorithm with adaptive similarities and view weights. In total, this feature selection method employs three types of data information, i.e., data similarity, data clustering structure, and the correlation between different views.
On the other hand, methods with a combination of the nearest neighbor techniques have also been investigated. For instance, Hamzaoui et al. [66] designed a multi-source Shared Nearest Neighbors (SNN) scheme for multi-modal image clustering. The central idea was to extend the existing SNN-based similarity measures to the case of multiple sources, and then introduce an original automatic source selection step in order to build candidate clusterings. With consideration of the generative and manifold data structure, Wang et al. [67,68] developed a generative model with an ensemble manifold regularization for MvC. Specifically, they constructed a nearest neighbor graph for each view in order to encode the corresponding manifold information, and a multiple graph ensemble regularization framework was designed in order to learn the optimal intrinsic manifold. Then, the manifold regularization term was incorporated into a multi-view topic model based on PLSA, which resulted in a unified objective function. Unlike the above two methods, Zhang and Mao [69] focused on the task of efficiently selecting clustering consistent neighbors for MvC. The proposed approach used jointly sparse weights in order to filter unreliable neighbors in the union of view-specific neighborhoods by representing each object in a weighted sum of its neighbors under each view. The learning sparse weights were employed in order to generate a similarity graph, and this graph was further utilized for MvC. In addition, Nie et al. [70] introduced a novel multi-view learning model with adaptive neighbors. This model performs semi-supervised classification and local manifold learning and clustering, simultaneously. It modifies the similarity matrix during each iteration until it arrives at the optimal one. Moreover, it automatically allocates the weight coefficient for each view without penalty parameters.
Example 3: It should be noted that most of the existing methods adopt a globally uniform similarity measure over the entire data space. However, in realworld objects such as images, different images have different visual appearance and the visual distribution is also complex. It is difficult to capture the similarity of different objects accurately only by using a globally uniform measure. To solve this problem, Xue et al. [62] presented a group-aware multi-view fusion approach for image clustering. This approach can partition images into different groups with more compact visual cohesiveness, and assign diverse fusion weights for images between and within groups. In comparison to the global fusion methods, this group-aware fusion model provides a more flexible fusion strategy and more effective similarity measures among images. The framework of this method is shown in Fig. 7. Concretely, multiple features such as LBP, GIST, and Centrist, were first extracted from images, which constituted three different features (views). Then, a graph was constructed for each view. Next, all images were divided into different groups, and a fused graph was constructed with the proposed fusion strategy.  Fig. 7 The framework of group-aware multi-view fusion approach [62] . The intent was to assign different fusion weights for images belonging to different groups and the same fusion weights for the images within the same group. Finally, the fusion weights were learned by solving the formulated objective function, and the clustering findings are obtained by performing spectral clustering on this fused graph.

Network-based MvC
Despite previous successes, most graph-based MvC approaches usually assume that the same set of data objects are available for different views. Thus, the relationship between data objects in different views is one-to-one relationship. However, in many reallife applications, such as social networks, literature citation networks, and biology interaction networks, data are collected from different domains and an object in one domain may correspond to multiple objects in another domain, which results in manyto-many mapping relationships. Representing these relationships with networks rather than with graphs may be more appropriate. This is the main reason for distinguishing network-based MvC from graph-based MvC.
Related work on network-based MvC starts from Ref. [71], in which a network based multi-view graph clustering framework is developed, and termed coregularized graph clustering. This framework illustrates several key properties, namely, many-to-many mapping relationships, mapping associated with weights, and partial mapping among different networks. However, different networks may have different data distributions, leading to assumptions such as the one in Ref. [71], by which all networks admit that a common clustering structure no longer holds. To relax this assumption, Ni et al. [72] presented a robust and flexible framework, which allows multiple underlying clustering structures across different networks. It treats domain similarity as the main network, and formulates the clustering problem via NMF on the designed network of network settings. Similar work has also been presented in Ref. [73], where the network grouping and underlying clustering detection are coupled and mutually enhanced during the learning process. Furthermore, Liu et al. [74] stated that existing network-based methods tend to focus on the network clustering task itself, but ignore any associations that may be exhibited between clustering findings from different domains. Given this, they offered a robust clustering approach that can detect network clusters in multiple domains, and their cross-domain associations. In addition, Yu and Zhang [75] studied community detection in multiple social networks, and attempted to find the communities for multiple net-works involving both anchor and non-anchor users simultaneously. Wang et al. [76] proposed an MvC algorithm, named multi-view affinity propagation, based on max-product belief propagation.
The key point was to establish an MvC model consisting of two components that measure the within-view quality and the explicit clustering consistency across different views, respectively.
Example 4: Most multi-view graph clustering approaches usually assume that data objects in different views have a strict one-to-one mapping relationship, while, in many actual applications, data are collected from diverse domains (views), where the cross-domain mapping relationship is many-to-many rather than oneto-one. In other words, an object in one domain may correspond to multiple objects in another domain. Moreover, different domains have their inherent data distributions. This breaks another assumption by which all views share a common clustering structure. To address the above mentioned two challenges, Ni et al. [72] developed a robust and flexible multinetwork clustering framework that allows many-tomany relationship and multiple underlying clustering structures.
Specially, Ni et al. [72] modeled each domain similarity as a network, and also modeled the similarity among different domains as a network to regularize the clustering structures in different networks. They defined a global network, named Network of Networks (NoN) as shown in Fig. 8, where the dashed network represents the main network among six domains fA,  Fig. 8 Network of networks [72] .
B, C, D, E, Fg, and each node in the main network corresponds to a domain-specific network denoted by the solid lines. Correspondingly, the clusterings in the main network and domain-specific networks are referred to as main clusterings and domain clusterings, respectively. Given these concepts, they sought to partition the NoN by using a two-phase approach. By this approach, first, the main network is partitioned, and then the learning information is incorporated from the main network in order to cluster domain-specific networks. The objective function [72] is formulated as follows: where k k 2 F is the Frobenius norm, G 2 R g g (g is the number of nodes in the main network) is the main network, and H 2 R C g k is the factor matrix of the main network. The domain-specific network clustering [72] is formulated as follows: Main cluster guided regularization (6) where A i is the domain-specific network corresponding to node i .i D 1; :::; g/ in the main network G, U i is the factor matrix of A i , V j is the j -th hidden factor matrix, h ij (an element of H ) indicates to which degree the main node i belongs to the j -th main clustering, and is a regularization parameter. Furthermore, based on the above proposed model, they designed a general model that allowed partially aligned domain-specific networks to have different node sizes and a different number of clusterings. This model [72] is formulated as the following function: where where P ij is the mapping matrix between U i and V j , and Q ij is denoted by Q ij D P ij .P ij / T . In the end, the clustering results are driven on the factor matrices U i ; :::; U g .

Spectral-based MvC
Spectral clustering is a classic data clustering paradigm. The basic idea is to form a pairwise affinity matrix between any pairs of objects, normalize this affinity matrix, and compute eigenvectors of this normalized affinity matrix (i.e., graph Laplacian). It has been shown that the second eigenvector of the normalized graph Laplacian is a relaxation of a binary vector solution. This solution can minimize the normalized cut on a graph, which is the relationship between the spectral and graph. In Ref. [19], De Sa developed a spectral clustering algorithm on two independent views, each of which could be fed into a clustering model. This spectral-based MvC algorithm created a bipartite graph with a minimizing-disagreement criterion [46,47] in order to connect the two-view features and then perform available spectral clustering algorithms on this bipartite graph. Zhou and Burges [77] investigated multiview spectral clustering by generalizing a normalized cut from the single view to multiple views, by considering how to learn a clustering close to the optimal solution for all graphs, and to further develop a multi-view transductive inference on the basis of multi-view spectral clustering. Similar work has been carried out in Ref. [78], where it also intended to find a balance cut that could separate all similarity graphs well. In addition, Long et al. [79] developed a general model for multi-view unsupervised learning under a distributed framework, with the aim of detecting hidden patterns individually from each representation of multiple views, and to seek optimal hidden patterns from these finding patterns. The authors put forward the concept of mapping function in order to make the patterns from various pattern spaces comparable in this general model. Hence, an optimal pattern was achieved from these various patterns of multiple representations. Instead of committing to one clustering solution, Niu et al. [80] proposed a method that can provide several nonredundant clustering solutions. This method learns nonredundant subspaces for multiple views, and produces a clustering solution simultaneously for each view. To address the issue that data may have considerable noise, Xia et al. [81] investigated the Markov chain in order to formulate a multi-view spectral clustering model. This model has the flavor of low-rank and sparse decomposition.
It first draws a transition probability matrix from each single view, and then uses these matrices in order to form a shared low-rank transition probability matrix. Finally, this shared matrix is input to a standard Markov chain model for clustering.
To handle a large-scale problem and improve computational efficiency, Li et al. [82] offered a multi-view spectral clustering algorithm for large-scale multi-view data. This algorithm uses a local manifold fusion in order to fuse heterogeneous features, and bipartite graphs so as to approximate the similarity graphs. Moreover, Chikhi [83] presented a multi-view normalized cuts approach, a parameter free multi-view spectral clustering algorithm, based on spectral partitioning and local refinement. Lu et al. [84] studied convex sparse spectral clustering with sparse regularization for single view data, and proposed a pairwise sparse spectral clustering for handling multiview data. However, as the number of views grows, it is scarcely possible to avoid dependencies among views, and these dependencies often delude correct predictions. To address these issues, Son et al. [85] extended traditional spectral clustering in order to deal with the dependencies among views. Especially, they designed a brainstorming process in order to force the information of each view to be shared among them.
Several MvC methods have also been studied by combining spectral clustering and other technologies. For instance, Huang et al. [86] developed an affinity aggregation spectral clustering algorithm by extending spectral clustering to a setting with multiple available affinities. Shao et al. [87] designed a multi-source MvC framework based on collective spectral clustering with a discrepancy penalty across sources. Note that this method is applicable to incomplete multi-view data. Moreover, Feng et al. [88] introduced a multi-view spectral clustering via roust local subspace learning by considering that all views are noisy and derived from a robust unified subspace and noisy. Wang et al. [89] proposed an iterative low-rank based structured optimization method for multi-view spectral clustering, which encodes the local manifold structure of the data from each view-dependent feature space, and arrives at a multi-view agreement based on an iterative process. Zhao et al. [90] discussed the semi-supervised MvC, and presented a multi-view matrix completion method with a pairwise similarity matrix in order to utilize side information; namely, must-link and cannot-link.
In addition, there are also spectral-based MvC methods for multi-type relational data [91] , multi-modal image data [92] , social media data [93] , as well as some applications with spectral-based MvC, such as multimodal brain network inferences [94] , social circle detection [95] , and human microbiome data analysis [96] .
Example 5: In general, spectral clustering methods involve two time-consuming steps. The first step, constructing the similarity (or affinity) graph, takes O.n 2 d / time, while the second step, computing the eigen-decomposition, takes O.k n 2 / time. Moreover, another drawback of spectral clustering is that most spectral clustering methods usually do not provide a natural extension for dealing with the out-of-sample problem. To overcome the above two issues, Li et al. [82] proposed a multi-view spectral clustering approach for large-scale multi-view data. In summary, the designed algorithm uses local manifold regularization to fuse heterogeneous features, and approximates the similarity graphs with bipartite graphs in order to improve efficiency. It is also easily extended in order to deal with the out-of-sample problem. First, it generates a few consensus salient points for all views. These salient points are employed in order to capture the manifold of the original views. Then a bipartite graph is constructed between the raw data points and the salient points for each view. The graphs of all views are fused together with a local manifold regularization item. Finally, it applies a spectral clustering algorithm on the resulting fused graph and outputs the clustering indicator of the salient points, in order to deal with the out-of-sample problem efficiently.
Here, the two important questions that need to be stressed are how to reach consensus across all views, and how to express their relationship. With local manifold learning, the two questions mentioned above are formulated with the following function [82] : where˛i is the non-negative normalized weight factor for the i -th view, tr. / is the trace of a matrix, r is a scalar to control the distribution of different weights among different views, F 2 R n k is the class indicator matrix, and L i is the normalized graph Laplacian matrix for the i -th view. The normalized graph Laplacian matrix [82] is formulated as where W 2 R n n is the adjacent matrix of the graph, D 2 R b n is the degree matrix whose i -th diagonal element is d i i D P n j D1 w ij .
Formula (8) aims to provide a consensus result F among all views. This unique consensus eliminates the requirement for computing the local results for each view, and the computational overhead of communicating back and forth between the local results and the global result. To further uncover inter-view relationships, Formula (8) [82] is rewritten as min F T F DI;˛v tr.F T LF /; where L D P m i .˛i / r L i . Denote that L is regarded as the local manifold fusion of all views. Formula (10) is solved by iterative optimization techniques. The total computational complexity is approximately O.T k n 2 C P m i mn.d i / 2 /, where T is the number of iterations.

Multi-view subspace clustering
Multi-view subspace clustering, i.e., learning a new and unified representation for all view data, from multiple subspaces, or a latent space that makes it easier to deal with high-dimensional data when building clustering models, has become a hot topic in the field of MvC. The general procedure of multi-view subspace clustering is illustrated as Fig. 9; there it obtains such a unified feature representation in two ways: 1 learn a unified representation from multiple subspaces directly, or 2 first learn a latent space and then arrive at this unified representation. Finally, this unified representation was fed into an off-the-shelf clustering model in order to produce the clustering results. After reviewing the literature on MvC, we divided the multi-view subspace clustering methods into two major types, namely, subspace learning-based and NMF-based (a special case in subspace learning) methods.

Subspace learning-based MvC
Subspace learning-based MvC seeks to find a latent space from multiple low-dimensional subspaces by assuming that data points are drawn from this latent subspace. Here, we extend this concept in order to make its role more general. In this paper,  the technologies involved in subspace learning-based MvC include subspace learning, subspace clustering, subspace projection, low-rank approximation, and tensor decomposition.
Assuming that all views are conditionally independent, given the clustering labels, Chaudhuri et al. [97] presented a multi-view subspace learning method based on canonical correlation analysis. This method provides auxiliary results for Gaussian mixtures and log concave distribution mixtures. Guo [98] proposed a convex subspace representation learning method for MvC. The key idea is to detect a shared subspace representation across multiple views, and then adopt standard clustering algorithms on this shared representation. Zhao et al. [99] developed a co-training framework for multi-view subspace clustering. It combined classical K-means and linear discriminant analysis under a co-training scheme, which utilized labels learned automatically in one view in order to generate discriminative subspaces in another. Deng et al. [100] put forward a feature weighting method based on subspace learning, where it locally adapted the feature weighting of each group automatically according to the tightness of views. Moreover, Cao et al. [101] stated that exploiting the specific independently constructed matrices is insufficient for the success of MvC, and exploring the underlying complementarity is of great importance. To this end, they designed a framework, named diversity-induced multi-view subspace clustering framework. Concretely, they extended the existing single-view subspace clustering to the multi-view domain, and utilized the Hilbert Schmidt Independence Criterion as a diversity term in order to explore the complementarity of multi-view representations. However, many studies have usually focused on the combining information rather than on improving the feature representation capability of each view. To solve this problem, Wang et al. [102] presented a framework with an extreme learning machine, and implemented three algorithms on this framework. Unlike the methods in Refs. [98,102] that perform subspace clustering on a common view, Gao et al. [103] performed subspace clustering on every view, simultaneously, while guaranteeing the clustering consistence among different views by adopting a common indicator. In addition, Xu et al. [104,105] proposed an MvC method called discriminatingly embedded K-means, which embedded the synchronous learning of multiple discriminative subspaces into multi-view K-means clustering in order to formulate a unified framework, while controlling the inter coordination between these subspaces adaptively. To effectively exploit data correlation consensus among multi-views, subspace clustering with a similarity matrix for multi-view data was studied in Refs. [106,107], where the authors intended to find a correlation or similarity consensus among all views, which was inspired by the idea that data objects within the same subspace have large similarity, while having small similarity for data objects within the distinct subspaces for each view. Rather than using a similarity matrix, Fan et al. [108] drew global low-rank constraints and local cross topology preserving constraints into subspace clustering for the purpose of characterizing data correlations. There have also been several methods investigated by combing with other technologies, sparse subspace clustering [109,110] , low-rank approximation [111] , and tensor decomposition [112][113][114][115] , to name a few.
Unlike the above mentioned methods or frameworks, which output just a single clustering, Cui et al. [116,117] presented pioneering work in order to find alternative and multiple clustering solutions based subspace learning. They designed an MvC framework in order to find all non-redundant data clustering views, and suggested two methods within this framework, i.e., orthogonal clustering, and clustering in orthogonal subspaces. The difference is that the former seeks orthogonality in the cluster space, while the latter does it in the feature space. Similar work has also been carried out in Ref. [118], where multiple generalizations of the data are provided by using multiple mixture models. Each mixture describes a specific view on the data by using a mixture of the Beta distributions in subspace projections. Moreover, Muller et al. [119] presented a short tutorial on this topic.
In a semi-supervised clustering setting, Günnemann et al. [120] developed a new Bayesian framework for semi-supervised MvC based on their previous work [118] , which also sought to the detections of multiple and alternative clusterings. This new framework treated multi-view data with several multivariate mixture distributions located in subspace projections, and handled prior knowledge in a form of samplelevel constraints in order to indicate which objects should or should not be grouped together. Moreover, Yin et al. [121] presented a pairwise sparse subspace representation model for MvC. The designed model harnessed the prior information in order to obtain the view-specific sparse representation, while utilizing the correlation between different views. Moreover, Cao et al. [122] put forward a constrained multi-view video face clustering method, which considers both the video face pairwise constraints and the multiview consistence simultaneously. Unlike some existing clustering methods that only employ these constraints in the clustering phase, this method strengthens the pairwise constraints through the entire framework, namely, in sparse subspace representation and spectral clustering.
An incomplete MvC based on subspace learning has also been investigated. For instance, Yin et al. [123,124] proposed an incomplete MvC method, which unified subspace learning, feature selection, and inter-view and intra-view similarity into a single objective function. It learns a latent representation for incomplete multiview data, where this latent representation serves as an approximation of the normalized indicator matrix. Xu et al. [125] suggested that the key to deal with incomplete view problem is to exploit the connections between different views. This enables incomplete views to be restored with the assistance of complete views. They investigated the estimation of incomplete views with the help of information from other observed views through this subspace.
Example 6: Multi-view data is often incomplete, namely, data objects have incomplete feature sets. Based on subspace learning, Yin et al. [123,124] studied incomplete multi-view learning for incomplete and unlabeled multi-view data. Figure 10 shows the presented subspace learning model, which learns a unified latent representation for incomplete multi-view data. This model directly optimizes the class indicator matrix, which establishes a bridge for incomplete feature sets. Moreover, feature selection is considered to deal with high dimensional and noisy features. Moreover, the structures of the inter-view data and intra-view data are preserved in order to enhance learning performance. To this end, an objective function was developed along with an efficient optimization algorithm. Let n v represent the data matrix in the v-th view for complete and partial instances, respectively. Similarly,

Incomplete Multi-view Data Sparse Projection Latent Space Learning
Inter-view and intra-view data similarities preserving Clustering Fig. 10 The overview of the proposed model with two views, i.e., text and image [124] . For the incomplete multi-view dataset, it uses a projection matrix in order to project the original features (text and image) to a latent space, which explicitly captures the clustering structure. Moreover, group sparsity is imposed on projection matrices for feature selection. Moreover, inter-view and intra-view data similarities are preserved in order to enhance the model. Finally, the features on this latent space are applied to the clustering task.
matrix, it drives a projection matrix U v 2 R d v k for each view in order to project their original spaces to a unified space. The objective function [124] is formulated as s.t., Y 2 f0; 1g n k ; Y 1 k D 1 n (11) where it has three terms: using the projection matrix to project each incomplete view to the latent space defined by Y ; feature selection for each view based on l 21 -norm regularization; inter-view and intra-view data similarity, which preserves the term defined by the Laplacian matrix L pq . Moreover, the constraints imposed on Y guarantee that each example belongs to one group only.

NMF-based MvC
NMF, which was originally investigated as a dimensionality reduction technique [126] , has emerged as an effective latent feature learning method. The non-negative constraint leads to the parts-based representation of samples, which accords with the cognitive process of the human brain, from the psychological and physiological evidences. Given an input non-negative data matrix X 2 R d n , each column of X is a feature vector of one sample.
The NMF aims to find two non-negative matrices W 2 R d p and H 2 R p n , whose product can adequately approximate the original matrix X . Here, the former matrix W is termed as the basis matrix (basic space), while the latter matrix H represents the coefficient matrix (representation feature), and p (in general, p minfn; d g) denotes the desired reduced dimension. The reconstruction processes [126] can be formulated as a Frobenius norm optimization problem, defined as Moreover, many variants of NMF have also been put forward, such as G-orthogonal NMF [127] , Regularized NMF [128,129] , Convex and Semi-NMF [130] , and Multilayer NMF [131] . Li and Ding [132] proposed a survey on NMF for clustering, in which more details of NMF can be found.
In a multi-view scenario, a late integration approach via NMF was studied in Ref. [133]. The proposed approach takes the clustering results generated independently on each available view, constructs an intermediate matrix representation for these clustering results, and performs NMF on this representation in order to reconcile the groups arising from individual views. Unlike the approach presented in Ref. [133], which plugs the clustering results into NMF, Liu et al. [134] developed a new NMF-based MvC framework, which feeds the data directly into NMF and drives a fused representation. Reference [134] formulates a joint matrix factorization with normalization strategy that pushes the representative result of each view toward a common consensus. Moreover, it provides a new insight into applying NMF to MvC. Inspired by this framework, some improved work has also been investigated. For instance, a semi-supervised MvC algorithm based on NMF with weight for each view was studied in Refs. [16,135], where it discovers a partially shared latent representation. With this learned representation of multi-view data, a robust sparse regression model was introduced in order to predict the clustering results. Embedding the similarity matrices of the data points into NMF, He et al. [136] also focused on learning a shared latent representation for MvC. Chang et al. [137] developed a multi-view NMF algorithm and applied it to clothing image clustering, by taking a new regularization term in order to advocate the structural incoherence between the representing result of each view. Considering the local geometric structure of each view, and penalizing the disagreement of different views at the same time, Ou et al. [138] proposed another types of multi-view NMF with a patch alignment strategy. Instead of treating the objects or views as distinct positions, Xu et al. [139] introduced a new self-paced learning algorithm with a smoothed weighting scheme, which inherits the merits of logistic function and provides probabilistic weights.
Many other methods based on the NMF variants have also been investigated. Based on G-orthogonal NMF [127] , Cai et al. [140] presented a robust multi-view K-means clustering algorithm for large-scale multiview data. Qian and Zhai [141] proposed a multiview unsupervised method for text-image web news data, where image local learning regularized orthogonal NMF was adopted in order to learn pseudo labels, and robust joint l 2;1 -norm was performed in order to select discriminative features. Zhao et al. [142] presented a deep matrix factorization framework via Multi-layer NMF [131] for MvC, where Semi-NMF [130] was used to learn a hierarchical representation for multi-view data in a layer-wise manner. Multiple sparse views clustering approaches with l 2;1 -norm and group l 1norm have also been investigated in Refs. [143][144][145]. Moreover, graph (or manifold) regularized NMF for MvC has also attracted attention. Graph regularized NMF [129] is an extension of NMF, which has been shown to improve the quality of the X factorization based on a manifold assumption, i.e., if two data points are close in the intrinsic geometry, then the representation of these two data points in the new basis space are also close to each other. Motivated by Refs. [129,134], Hidru and Goldenberg [146] , and Wang et al. [147] investigated graph-regularized multi-view NMF-based clustering respectively, both of which had little difference. Rather than taking graph regularization from unlabeled data, Guan et al. [148] constructed the graph embedding framework through partial label information and considered the sparseness constraints at the same time. Multi-manifold regularized NMF goes into Refs. [149,150], in which multiple manifolds were combined linearly and two kinds of MvC methods were led by different strategies. Based on SymNMF [151] , Zhang et al. [152] introduced a graph regularized symmetric NMF framework for MvC. Furthermore, graph regularized MvC approaches via concept factorization [153] were discussed in Refs. [154,155].
In the research field of incomplete MvC, Shao et al. [156,157] made some attempts via NMF. The main idea was to incorporate weighted NMF in order to handle the missing objects in each incomplete view, and pushing the learned latent representation feature towards a consensus. Shao et al. [158] also proposed a general framework for incomplete data via tensor modeling and factorization. This framework first uses the kernel matrices in order to generate an initial tensor across all views, and then formulates a joint tensor factorization process with the sparsity constraint. This process is used to iteratively push the initial tensor towards an exploration of the latent factors. Moreover, the later fusion method based on NMF [133] can also handle incomplete views as author notes. In addition, Li et al. [159] presented a partial multi-view clustering algorithm (named PVC), which is specifically designed for two-view datasets. It employs NMF in order to learn a latent subspace, in which the samples belonging to the same group are close to each other and the similar samples from the same view should be grouped well. Later on, Qian et al. [160] improved PVC by considering the cluster similarity and manifold preserving constraints. Furthermore, Rai et al. [161] extended PVC to support multi-views and view-specific graph Laplacian regularization. With the help of inter-view constraints (i.e., must-link and cannot-link constraints), Zhang et al. [162] defined a disagreement between each pair of views in order to guide the factorization process.
Example 7: Wang et al. [154] proposed an autoweighted Multi-view Concept Clustering (MvCC) based on concept factorization with local manifold regularization. The MvCC framework is shown in Fig. 11. In brief, concept factorization [153] is a variant of NMF, which is available for handling the data containing negative, and is also easily performed in the kernel space. Furthermore, the local manifold regularization is incorporated into the concept factorization process in order to preserve the locally geometrical structure of the original data space. Both weights of each view are determined automatically and the given co-normalized scheme makes fusion meaningful in terms of driving the common consensus representation. In addition, the clustering results are driven directly from the common consensus representation, without requiring additional clustering steps. This is due to that the consensus matrix being sparse.
Given a multi-view data X D fX 1 ; :::; X m g, the objective function of MvCC [154] is formulated as ; (13) where W v is the association matrix, H v is the representation matrix, H is the consensus representation matrix, L v is the Laplacian matrix, ! v is the weight of the v-th view,˛andˇare the trade-off parameters. It is worth stressing that the weight ! is determined automatically, and the parameters˛anď are suggested with empirical values, while, as the objective function Formula (13) is not convex over all the variables simultaneously, and an alternating iterative algorithm based on the multiplicative update rules was developed to optimize it. More details can be found in Ref. [154].

Multi-task multi-view clustering
MvC exploits the consistency and complementarity among different views in order to achieve better clustering quality, as mentioned above. Another concept, namely, multi-task clustering (belonging to the field of multi-task learning [163] ), performs multiple related tasks together and utilizes the relationship between these tasks in order to enhance clustering performance for single-view data. By inheriting the property of both MvC and multi-task clustering, the Multi-task Multi-view Clustering (M 2 vC) treats each individual view data with one task or multiple tasks as shown in Fig. 12. This has received some attention in recent years. The main challenges of M 2 vC consist of finding a way to model the intra-task (within-task) clustering on each view, and a way to exploit the multitask and multi-view relationship, while transferring the inter-task (between-task) knowledge to one another. Here, we provide a review on M 2 vC in order to attract further attention and promote research in this area.
By assuming that a common underlying subspace is shared by multiple related tasks, Gu and Zhou [164] proposed a cross-domain multi-task clustering method by treading each view with a task. This method aims to learn such a subspace, and through it, the knowledge of one task can be transferred to another. Note that the authors also assumed that the dimensionality of the feature vector for each task is the same, and the number of clusters in each task is also the same. Later on, Zhang and Zhou [165] relaxed these assumptions, and introduced an improved cross-domain multi-task clustering, which can performs multiple related clustering tasks simultaneously through domain adaptation. Besides, Xie et al. [166] presented a multi-task co-clustering based on 3-factor NMF. The objective function of this method consisted of two parts, i.e., task-specific co-clustering and cross-task feature space regularization. The multi-task clustering method via SymNMF [151] for multi-view data has also been studied in Ref. [167], where several tasks were performed simultaneously with a geometric affine transformation in order to control intra-task and inter-task knowledge sharing. In addition, Wang et al. [168] explored multi-view spectral clustering by using a multi-objective formulation (seen as a multitask problem), which is solved by Pareto optimization. Wahid et al. [169,170] studied a multi-objective MvC ensemble method based on an evolutionary approach. Zhang et al. [171] developed a multi-task clustering algorithm by transferring the knowledge of instances. The proposed algorithm learns a shared subspace, and constructs a shared nearest neighbor similarity matrix for each individual task. Then, it applies a traditional spectral clustering method on the shared nearest neighbor similarity matrix of each task. Shi et al. [172] incorporated spectral clustering and discriminative analysis into a unified framework by exploiting the correlation information between multiple views, where spectral clustering aims to discover the cluster structure, and the discriminative analysis aims to preserve the structure. Moreover, Zhang et al. [173,174] presented an M 2 vC framework, which integrates within-view-task clustering, multi-view relationship learning, and multitask relationship learning. Under this framework, they proposed two M 2 vC algorithms, i.e., the bipartite graph based M 2 vC algorithm, and the semi-NMF based M 2 vC algorithm.
Upon reviewing the literature on MvC, we found that some existing work is hard to be assigned to any of the five above-mentioned categories. Here, we provide a brief summary, including multi-modal clustering based on Markov random field [175] , multi-view clustering ensembles based on multi-view spectral clustering and multi-view kernel K-means clustering with ensemble technology [176] , bi-level weighted MvC based on Kmeans [177][178][179][180] , and multi-view fuzzy clustering [181][182][183][184] .
Example 8: In multi-task multi-view settings, the tasks are related through common views. The key step of M 2 vC is to link the features in the common view in order to integrate the related tasks. In the field of M 2 vC, Zhang et al. [174] developed a typical M 2 vC framework based on co-clustering. An illustration of this framework is shown in Fig. 13, where the square region represents the set of data samples, and the circular region represents the set of data features under a view in each task as described in Ref. [174]. Note that the samples of task 1 and task 2 have a common view, which contains task shared features (denoted by the light gray overlapping area) and task specific features (denoted by the light gray non-overlapping area). This framework consists of three components: within-view-task clustering, multi-view relationship learning, and multi-task relationship learning. Under this framework, they proposed two M 2 vC algorithms. One is the bipartite graph based M 2 vC algorithm, which only handles the data containing non-negative values. Another one is the semi-non-negative matrix trifactorization based M 2 vC algorithm, which is a general M 2 vC method, i.e., it can deal with the data containing negative or non-negative values.
Given T clustering tasks, each task is covered with m t views. S is the index collection of the common views; T v is the index collection of all tasks under this common view. For within-view-task clustering, it treats the data objects in each view of each task with co-clustering, which accomplishes the essential part of the whole algorithm and ensures the preservation of the knowledge available locally at each view of each task in order to avoid negative transfer. For multi-view relationship learning, it minimizes the disagreement between the clusters of data under each pair of views Task 2
in each task. For multi-task relationship learning, it uses co-clustering in order to drive a shared subspace among the related tasks under each common view by assuming that related tasks should share some common or relevant features. The M 2 vC clustering framework [174] is formalized as H 1 is to co-cluster data objects and features of all the views in each task t , P m t j ¤i H 2 is to minimize the disagreement between the clustering assignments of any two different views in each task t , P i 2S P t 2T i H 3 is to obtain the shared subspace under each common view by the same co-clustering method as the first component. and are tradeoff parameters. Under this framework, two specific clustering algorithms were investigated in Ref. [174].

Publically Available Datasets
To support researchers working in the field of MvC, we summarize some widely used multi-view datasets. For these publically available datasets, we provide their URLs.
3Sources Dataset : A multi-view text corpus, constructed from news articles from three online news services. This repository also has Multi-View Twitter Datasets, a collection of Twitter datasets for social networks discovery, and BBC and BBCSport Datasets, two synthetic text datasets originating from BBC News.
WebKb Datasets : These datasets contain web-page data collected from the computer science departments of four universities, namely, four multi-view datasets.
Newsgroup Datasets : There are subsets of the NG20 dataset with 3 different pre-processings. The description of the subsets, and details on the preprocessing steps can be found in Ref. [ Handwritten Digit Dataset ' : It consists of features of handwritten numerals (0-9) from the UCI repository.
100leaves Dataset : It contains sixteen different kinds of plant leafs, where each kind has one-hundred samples. For each sample, the shape descriptor, fine scale margin, and the texture histogram are given.
Corel Images Dataset : This dataset consists of image features extracted from a Corel image collection. It provides four sets of features, namely, color histogram, color histogram layout, color moments, and co-occurence texture.
NUS-WIDE Dataset : A web image dataset with six types of low-level features extracted from these images.
YouTube Video Dataset : This dataset contains approximately 1.2 10 5 instances, where each instance is described by 13 types of features, and also has its class information.

Conclusion and Discussion
The proliferation of multi-view data calls for advanced clustering technologies that can discover knowledge from multi-view datasets. This paper surveyed most of the existing algorithms and technologies of MvC, and classified these MvC algorithms into five categories, i.e., co-training style algorithm, multi-kernel learning, multi-view graph clustering, multi-view subspace clustering, and multi-task multi-view clustering. For each category of MvC, we did not only review the existing algorithms, but also introduced the ideas and technologies behind them, while giving specific illustrative examples.
Although MvC was proposed around 2003, as we mentioned in the introduction section, there is no criterion to decide which MvC algorithm is the best, since different methods have their own advantages and disadvantages. In brief, co-training style algorithms can enhance the clusters of different views interactively by exchanging information. However, they are intractable when the number of views is more than three. Kernel based MvC inherits the advantage of kernel, while bringing about high computational complexity. Multi-view graph clustering introduces spectral graph theory, while relying on the constructed affinity (or similarity) matrices. Multi-view subspace clustering methods have straightforward interpretability, and also have initialization dependence. Multi-task multi-view inherits both properties of multi-task clustering and multi-view clustering; however, this is still in infancy.
Hopefully, these technologies have close relationship to one another. For example, subspace learning can be performed on the kernel space, therefore, it is valuable in developing the general framework of MvC, which inherits the merits of different categories.
Below, we would like to highlight a number of challenging problems and future directions in order to encourage more research in MvC. Their solutions will have a fundamental impact on MvC, specifically, on multi-view data fusion, machine learning, and artificial intelligence in general.
Correctness of views: Finding a way of knowing whether a view is correct, is crucial for MvC. Since MvC exploits all available views in order to help clustering performance, incorrect views are very harmful. Although some work leverages these views with weights, errors could be propagated from a misleading view to other views. Thus, this problem must be solved or mitigated to a great extent in order to ensure that MvC is effective. The opportune moment of fusion: Existing MvC adopts three fusion strategies for multi-view data in the clustering process, namely, fusion in the data, fusion in the projected features, and fusion in the results. Most of the current research works of MvC focus on the second fusion strategy. However, there is no theoretical foundation to decide which one is the best. Theoretical and methodological research is required in order to uncover their essence.
Incomplete MvC: Although some attempts have been made for incomplete multi-view data, as we mentioned in each section of the category, incomplete MvC is still a challenging problem.
In real-life, data loss occurs frequently, while, the research in incomplete MvC has not been extensive. Effort is expected to be put into the investigation of the incomplete MvC.
Multi-task multi-view clustering: This direction is a new trend in the research of MvC; however, this trend is accompanied by a few challenges such as finding a way to explore the relationships of different tasks and different views, and finding a way to transfer the knowledge between each other views. In addition, several widely used datasets were listed in order to provide convenience for future researchers. In summary, this paper serves as a bridge for readers in order to further promote the research of MvC.