Trade-off Between Efficiency and Effectiveness: A Late Fusion Multi-view Clustering Algorithm

Late fusion multi-view clustering (LFMVC) algorithms aim to integrate the base partition of each single view into a consensus partition. Base partitions can be obtained by performing kernel k-means clustering on all views. This type of method is not only computationally efficient, but also more accurate than multiple kernel k-means, and is thus widely used in the multi-view clustering context. LFMVC improves computational efficiency to the extent that the computational complexity of each iteration is reduced from O n ð Þ to O n ð Þ (where n is the number of samples). However, LFMVC also limits the search space of the optimal solution, meaning that the clustering results obtained are not ideal. Accordingly, in order to obtain more information from each base partition and thus improve the clustering performance, we propose a new late fusion multi-view clustering algorithm with a computational complexity of O n ð Þ. Experiments on several commonly used datasets demonstrate that the proposed algorithm can reach quickly convergence. Moreover, compared with other late fusion algorithms with computational complexity of O n ð Þ, the actual time consumption of the proposed algorithm does not significantly increase. At the same time, comparisons with several other state-of-the-art algorithms reveal that the proposed algorithm also obtains the best clustering performance.


Introduction
Many real-world datasets are naturally made up of different representations or views. Given that these multiple representations often provide compatible and complementary information, it is clearly preferable to integrate them in order to obtain better performance rather than relying on a single view. However, because data represented by different views can be heterogeneous and biased, the question of how to fully utilize the multi-view information in order to obtain a consensus representation for clustering purposes remains a challenging problem in the field of multi-view clustering (MVC). In recent years, various kinds of algorithms have been developed to provide better clustering performance or higher efficiency; these include multi-view subspace clustering [1][2][3], multi-view spectral clustering [4][5][6], and multiple kernel k-means (MKKM) [7][8][9][10][11]. Among these, MKKM has been widely utilized due to its excellent clustering performance and strong interpretability.
Depending on the stage of multi-view data fusion involved, the existing literature on MVC can be roughly divided into two categories. The first category, which can be described as 'early fusion', fuses the information of all views before clustering. Huang et al. [7] optimize an optimal kernel matrix as a linear combination of a set of pre-specified kernels for clustering purposes. To reduce the redundancy of the pre-defined kernels, Liu et al. [10] propose an MKKM algorithm with matrix-induced regularization. To further enhance the clustering performance, a local kernel alignment criterion is adopted for MKKM in [10]; moreover, the algorithm proposed in [11] allows the optimal kernel to be found in the neighbor of the combinational kernels, which increases the search space of the optimal kernel.
Although the above-mentioned early fusion algorithms have greatly improved the clustering accuracy of MKKM in different respects, they also have two drawbacks that should not be overlooked. The first of these is the more intensive computational complexity, i.e., usually O n 3 ð Þper iteration (where n is the number of samples). The second relates to the over-complicated optimization processes of these methods, which increase the risk of the algorithm becoming trapped in bad local minimums and thus lead to unsatisfying clustering performance.
The second category, which is termed as 'late fusion', maximizes the alignment between the consensus clustering indicator matrix and the weighted base clustering indicator matrices with orthogonal transformation, in which each clustering indicator matrix is generated by performing clustering on each single view [12,13]. This type of algorithm can improve the efficiency and clustering performance; however, these methods also simplify the search space of the optimal clustering indicator matrix, which decreases clustering performance. Accordingly, in this paper, we propose a new late fusion framework designed to extract more information from the base clustering indicator matrix. We first assume that each base clustering indicator matrix is a feature representation after the sample is mapped from the original space to the feature space. Compared with the original features, this involves nonlinear relations between data [14]; moreover, this approach can also filter out some noise arising from the kernel matrix. Therefore, the similarity matrix constructed by the base clustering indicator matrix is superior to the original kernel matrix for clustering purposes.
We refer to the proposed method as 'late fusion multi-view clustering with learned consensus similarity matrix' (LF-MVCS). LF-MVCS jointly optimizes the combination coefficient of the base clustering indicator matrices and the consensus similarity matrix. This consensus similarity matrix is then used as the input of spectral clustering [15] to obtain the final result. Furthermore, we design an effective algorithm for solving the resulting optimization problem, as well as analyzing its computational complexity and convergence. Extensive experiments on 10 multi-view benchmark datasets are then conducted to evaluate the effectiveness and efficiency of our proposed method.
The contributions of this paper can be summarized as follows: The proposed LF-MVSC integrates the information of all views in a late fusion manner. It further optimizes the combination coefficient of each base clustering indicator matrix and consensus similarity matrix. To the best of our knowledge, this is the first time that a unified form has been obtained that can better reflect the similarity between samples via late fusion. The proposed optimization formulation has just one parameter to tune and avoids excessive time consumption. We therefore design an alternate optimization algorithm with proved convergence that can efficiently tackle the resultant problem. Our algorithm avoids complex computation owing to fewer optimization variables being utilized, meaning that that the objective function can converge within fewer iterations.

Multiple Kernel K-means
Let x i f g n i¼1 X be a dataset consisting of n samples, while [ Á ð Þ : x 2 X ! H is the feature mapping, which transfers x into a reproducing kernel Hilbert space H. Kernel k-means clustering aims to minimize the sum-of-squares loss over the cluster assignment matrix Z 2 0; 1 f g nÂk , which can be formulated as follows: number of samples in the corresponding cluster.
In order to optimize Eq. (1), we can transform it as follows: , while 1 s 2 R s denotes a vector with all elements being 1. In Eq. (2), the variables Z are discrete, which makes the optimization problem difficult to solve. Dhillon et al. [16] let H ¼ ZL Tr K I n À HH T À Á À Á s:t: Here, I K is an identity matrix with size k Â k. Due to L 1 2 Z T ZL 1 2 > I K , we impose an orthogonality constraint on H. Finally, the optimal H for Eq. (3) can be obtained by taking the k eigenvectors corresponding to the k largest eigenvalues of K: The Lloyd's algorithm is usually executed on H for the clustering labels.
In multiple kernel k-means (MKKM), all samples have multiple feature representations via a group of feature mappings More specifically, each sample can be represented as T are the coefficients of each base kernel. The kernel function over the above feature mappings can be written as: Assume that the optimal kernel matrix can be represented as K a ¼ P m p¼1 K p , we obtain the optimization objective of MKKM as follows: A two-step iterative algorithm can be used to optimize the Eq. (5).
According to the Cauchy-Schwartz inequality, we can obtain the closed- ii) Optimize H with fixed α. With the kernel coefficients α fixed, H can be optimized by taking the k eigenvectors corresponding to the k largest eigenvalues of K a .

Multi-View Clustering via Late Fusion Alignment Maximization
Multi-view clustering via late fusion alignment maximization [12] (MVC-LFA) assumes that the clustering indicator matrix obtained from each single view is H p (p 2 m ½ ), while a set of rotation matrices W p È É m p¼1 are used to maximize the alignment of H p (p 2 m ½ ) and the optimal clustering indicator matrix H Ã . The objective function is as follows: where F denotes the average clustering indicator matrix, while λ is a trade-off parameter. Moreover, Tr H ÃT F À Á is a regularization of the optimal clustering indicator matrix that prevents H Ã from being too far away from the prior average partition. This formula assumes that the optimal clustering indicator matrix H Ã is a linear combination of the base clustering indicator matrix H p (p 2 m ½ ). Readers are encouraged to refer to [12] for more details regarding the optimization of Eq. (7).
The drawback of MKKM is its excessive computational complexity, which is O n 3 ð Þ. Although MVC-LFA does reduce the computational complexity from O n 3 ð Þto O n ð Þ, it also over-simplifies the search space of the optimal solution, which in turn limits the clustering performance. We therefore propose a novel late fusion multi-view clustering algorithm in Section Three, which offers a trade-off between clustering efficiency and effectiveness.

Proposed Method
By taking the eigenvectors corresponding to the first k largest eigenvalues of K p È É m p¼1 , we can obtain the clustering indicator matrices H p È É m p¼1 R nÂk . Each row of H p can be regarded as a feature representation of each sample in the k dimension space. To construct the consensus similarity matrix S, we need to construct the similarity matrix of each view. After normalizing each row of H p (i.e., dividing by the 2-norm of the row), the 2-norm of each row of the processed H p becomes 1. We treat S p ð Þ ¼ H p H p T 2 R nÂn as the similarity matrix of the p-th view.
Based on the assumption that the optimal consensus similarity matrix resides in the neighbor of a linear combination of all similarity matrices, our model can be expressed as follows: The solution to the above formula is clearly equal to a certain S p , such that multi-view information is not fully utilized. Accordingly, inspired by [9], we introduce a regularization item capable of improving the diversity of all views for learning purposes. We let M p; Þ is associated with a higher degree of correlation between the p-th and q-th points. Let M 2 R mÂm be the matrix that stores diversified information of views, while M pq ¼ M p; q ð Þ. Thus, when M pq is large, to avoid the information redundancy caused by similar views, we can set the coefficient of view such that a p a q M pq is small. Based on the above, we add the following regularization item to the formula: At the same time, we aim to have S satisfy the basic properties of the similarity matrix. Hence, we impose certain constraints on S, namely: i) The degree of each sample in S is greater than or equal to 0, i. e., S ij ≥ 0; ii) The degree of each sample in S is equal to 1, i.e., S T 1 ¼ 1. We can accordingly obtain the following formula: Not only can the clustering indicator matrices obtained by KKM, but also the similar basic partitions constructed by any clustering algorithms, be the input of our model. The model developed thus far represents a new multi-view clustering framework based on late fusion. Compared to other late fusion algorithms, such as [12,13], our objective formula has three key advantages, namely: Fewer variables need to be optimized, meaning that the total number of iterations is reduced and the algorithm converges quickly.
Other late fusion algorithms hold that the optimal clustering indicator matrix is a neighbor of the average clustering indicator matrix; however, this is a heuristic assumption that may not hold true in some cases. By contrast, our assumptions in Eq. (9) are more reasonable and will be effective in any multi-view clustering task.
We further introduce a regularization item to measure the correlation of views in order to increase both the diversity and the utilization rate of multi-view information.

Alternate Optimization
In the following, we construct an efficient algorithm designed to solve the optimization problem in Eq. (9). More specifically, we design a two-step algorithm to alternately solve the following: Optimization α with fixed S.
When S is fixed, the optimization problem in Eq. (9) is equivalent to the following optimization problem: Moreover, M þ Z is a nonnegative definite matrix, meaning that the optimization in Eq. (10) w.r.t α is a quadratic programming problem with linear constraint. We can solve this using the Matlab QP toolkit.
Optimization S with fixed α.
When α is fixed, the optimization problem in Eq. (9) w.r.t S can be rewritten as follows: Note that the problem in Eq. (11) is independent of the value of j, meaning that we can optimize each column of S as follows: (12) is equivalent to the following function: . Eq. (13) is a problem on the simplex space, and its Lagrangian function is as follows: where c and s are the Lagrangian multipliers.
According to its Karush-Kuhn-Tucker condition [17], we can obtain the optimal solution S Ã in the form of Eq. (15): Here, x ð Þ þ is a signal function, which sets the elements of x that are less than 0 to 0, while c Ã 2 R makes 1 T > S Ã j ¼ 1. The algorithm is summarized in detail in Algorithm 1. Convergence. In each of the optimization iterations of our proposed algorithm, the optimizations of α and S monotonically decrease the value of the objective Eq. (9). Moreover, this objective is also lowerbounded by zero. Consequently, our algorithm is guaranteed to converge to the local optimum of Eq. (9).

Comparison with State-of-the-Art Algorithms
In our experiments, LF-MVCS is compared with several state-of-the-art multi-view clustering methods, as outlined below.
Average multiple kernel k-means (A-MKKM): The kernel matrix, which is a linear combination of the base kernels with the same coefficients, is taken as the input of the standard kernel K-means algorithm.
Single best kernel k-means (SB-KKM): Stardard kernel k-means is performed on each single kernel, after which the best result is outputted.
Multiple kernel k-means (MKKM) [7]: The algorithm jointly optimizes the consensus clustering indicator matrix and the kernel coefficients.
Robust multiple kernel k-means (RMKKM) [18]: RMKC learns a robust low-rank kernel for clustering by capturing the structure of noise in multiple kernels.
Multiple kernel k-means with matrix-induced regularization (MKKM-MR) [9]: This algorithm introduces a matrix-induced regularization designed to reduce the redundancy of the kernels and enhance their diversity.
Optimal neighborhood kernel clustering with multiple kernels (ONKC) [11]: ONKC allows the optimal kernel to reside in the neighborhood of linear combination of base kernels, thereby effectively enlarging the search space of the optimal kernel.
Multi-view clustering via late fusion alignment maximization (MVC-LFA) [12]: MVC-LFA proposes maximally aligning the consensus clustering indicator matrix with the weighted base clustering indicator matrix.
Simple multiple kernel k-means (SMKKM) [19]: SimpleMKKM, or SMKKM, re-formulates the MKKM problem as a minimization-maximization problem in the kernel coefficients and the consensus clustering indicator matrix.

Experimental Settings
In our experiments, all base kernels are first centered and then normalized; therefore, for all samples x i and the p-th kernel, we have K p x i ; x j À Á ¼ 1: For all datasets, the true number of clusters is known, and we set this to be the true number of classes. In addition, the parameters of RMKKM, MKKM-MR and MVC-LFA are selected by grid search according to the methods suggested in their respective papers. For the proposed algorithm, its regularization parameters λ are chosen from a large enough range 2 À15 ; 2 À12 ; …; 2 15 ½ by means of grid search. We set the allowable error e to 1 Â 10 À4 , which is the termination condition of Algorithm 1.
The widely used metrics of clustering accuracy (ACC), normalized mutual information (NMI) and purity are applied to evaluate the clustering performance of each algorithm. For all algorithms, we repeat each experiment 50 times with random initialization (to reduce randomness caused by k-means in the final spectral clustering) and report the average result. All experiments are performed on a desktop with Intel(R) Core(TM)-i7-7820X CPU and 64 GB RAM.

Experimental Results
Tab. 2 reports the clustering performance of the above-mentioned algorithms on all datasets. From these results, we can make the following observations: First, the proposed algorithm consistently demonstrates the best NMI on all datasets. For example, it exceeds the second-best method (MVC-LFA) by over 8.56% on Flower17. The superiority of our algorithm is also confirmed by the ACC and purity reported in Tab. 2. As a strong baseline, the recently proposed SMKKM [19] outperforms other comparison early-fusion multiple kernel clustering methods. However, the proposed algorithm significantly and consistently outperforms SMKKM by 9.80%, 3.11%, 2.27%, 10.21%, 3.02%, 6.52%, 4.66%, 4.72%, 7.11% and 5.05% respectively in terms of NMI on all datasets in the order listed in Tab. 1.
We further study the clustering performance on Cal for each algorithm depending on the number of classes, as shown in Fig. 1. As can be observed from the figure, our algorithm is consistently at the top of all sub-figures when the number of classes varies, indicating that it achieves the best performance.
In summary, our proposed LF-MVCS demonstrates superior clustering performance over all SOTA multi-view clustering algorithms. From the above experiments, we can therefore conclude that the proposed algorithm effectively learns the information from all base clustering indicator matrices, bringing significant improvements to the clustering performance.

Runtime, Convergence and Parameter Sensitivity
Runtime. To investigate the computational efficiency of the proposed algorithm, we record the running time of various algorithms on the benchmark datasets and report them in Fig. 2. The computational complexity of the proposed LF-MVCS is O n 2 ð Þ; as can be observed, LF-MVCS has slightly higher computational complexity than the existing MVC-LFA (with a computational complexity of O n ð Þ). At the same time, however, the runtime of LF-MVCS is lower relative to with other kernel-based multi-view clustering algorithms.
Algorithm Convergence. We next theoretically analyze the convergence of our algorithm. Fig. 3 reflects the variation of the objective Eq. (9) with the number of iterations on four datasets: namely, Flower17, ProteinFold, Digit, and Cal-30. It can be seen that the objective function achieves convergence over very few iterations, thereby demonstrating that fewer optimization variables can make the algorithm converge faster, which makes our algorithm more efficient than other algorithms. Parameter Sensitivity. The proposed algorithm introduces one hyper-parameter, i.e., the balancing coefficient of the diversity regularization item λ. The performance variation against λ is illustrated in Fig. 4. As the figure shows: i) λ is effective in improving the algorithm performance; ii) It also achieves good performance over a wide range of parameter settings.

Conclusion
This paper proposes a simple but effective algorithm, LF-MVCS, which learns a consensus similarity matrix from all base clustering indicator matrices. We determine that each base clustering indicator matrix is a better feature representation than the original feature and kernel. Based on this finding, we derive a simple novel optimization goal for multi-view clustering, which learns a consensus similarity matrix to facilitate better clustering performance. The proposed algorithm is developed to efficiently solve the resultant optimization problem. LF-MVCS outperforms all SOTA methods significantly in terms of clustering performance. Moreover, our algorithm also reduces the computational cost on benchmark datasets when compared with other kernel-based multi-view clustering methods. In future work, we plan to extend our algorithm to a general framework, then use this framework as a platform to revisit existing multi-view algorithms in order to further improve the fusion method. Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.