Consensus Kernel K-Means Clustering for Incomplete Multiview Data

Multiview clustering aims to improve clustering performance through optimal integration of information from multiple views. Though demonstrating promising performance in various applications, existing multiview clustering algorithms cannot effectively handle the view's incompleteness. Recently, one pioneering work was proposed that handled this issue by integrating multiview clustering and imputation into a unified learning framework. While its framework is elegant, we observe that it overlooks the consistency between views, which leads to a reduction in the clustering performance. In order to address this issue, we propose a new unified learning method for incomplete multiview clustering, which simultaneously imputes the incomplete views and learns a consistent clustering result with explicit modeling of between-view consistency. More specifically, the similarity between each view's clustering result and the consistent clustering result is measured. The consistency between views is then modeled using the sum of these similarities. Incomplete views are imputed to achieve an optimal clustering result in each view, while maintaining between-view consistency. Extensive comparisons with state-of-the-art methods on both synthetic and real-world incomplete multiview datasets validate the superiority of the proposed method.


Introduction
The term "multiview data" refers to data that have different sources or modalities. Each source or modality is considered as one "view," and different views have different physical meanings and statistical properties. For example, a web page can be described by the pictures and text it contains, while a news story may be reported by different sites each with its own different viewpoints. A significant number of studies aimed to investigate and learn from multiple views in the past [1,2]. Multiview clustering, which is one component of multiview learning, aims at grouping samples by utilizing information from different views. Extensive research has been conducted into multiview clustering; these can be roughly categorized into early fusion approaches and late fusion approaches. Early fusion approaches fuse the multiview information in an early stage of the process and then perform clustering [3][4][5][6][7][8][9], while late fusion approaches group data by fusing previously clustered results from separate views [10,11].
However, in real-world applications, some views may be incomplete for a variety of reasons, which hurts the clustering performance of multiview data. For example, in the context of patient grouping, the data from different tests can serve as different views. If a test is too expensive, some patients may be unable to afford it, which leads to an incomplete view for this particular test. Similarly, in webpage clustering, image data and text data are two modalities that represent a page; however, some pages may not contain any images, which makes the data for the image view incomplete.
Existing studies of incomplete multiview clustering can be roughly divided into two categories: subspace methods and imputation methods. The method outlined in [12], which was the first subspace method for incomplete multiview clustering, learns the common subspace of two views via nonnegative matrix factorization. Several variants of this method were proposed following its introduction. In [13], feature learning is integrated into the subspace learning process and the assumption that the data is nonnegative is not required.

Computational Intelligence and Neuroscience
The method proposed in [14] learns a latent global graph representation and the subspace simultaneously by adding a novel Laplacian graph regularization term. The other important category of method for incomplete multiview clustering is imputation methods, which handle incomplete views by filling in the missing parts. The method proposed in [15] fills the kernel of an incomplete view according to the Laplacian regularization of the other complete view. Subsequently, the method proposed in [16] tackles the situation where two views are incomplete by alternately updating one view according to the other view. In [17], the incomplete views are imputed via low rank decomposition. As different views are assumed to be generated from a shared subspace, the data matrices of different views can be decomposed using a common factor. Most of these imputation methods simply execute a conventional multiview clustering algorithm after filling the incomplete views. Most recently, a method was proposed in [18], whereby the imputation is not separated from the multiview clustering process. More specifically, the imputation and the multiple kernel clustering are integrated into a unified procedure for better clustering performance.
Integrating imputation and multiview clustering into a unified learning process makes the imputation better serve the clustering objective. This advantage helps the method in [18] to outperform other methods that perform imputation and clustering separately. However, the disadvantage of the method in [18] is that multiview clustering solution it proposes overlooks the consistency between views, which may reduce the final clustering performance. In [18], multiview clustering is achieved by learning a linear combination of kernels that reaches the optimal kernel -means clustering result. Consequently, the linear combination to build the best kernel for clustering is learned without considering the relationships between views. Similarly, the imputation is guided only by the clustering objective and the consistency between views is neglected. However, the consistency between views is one of the inherent properties of multiview data [1]; if this critical property is not considered, the learning of the linear combination of kernels and the imputation in [18] may lead to poor clustering performance. Previous research into multiview clustering has shown that considering the consistency between views helps to boost the performance of multiview clustering [3]. In this study, we wish to build on the advances made in [18] while also considering the consistency between views in order to further improve clustering performance. Therefore, we propose a novel incomplete multiview clustering method that simultaneously fills the incomplete kernels from incomplete views and learns a consistent clustering result. To model the between-view consistency, the similarity between the consistent clustering result and the clustering result of each view is calculated. The consistency between views is measured by the sum of these similarities. The missing parts of kernels and the consistent clustering result are learned in order to achieve the optimal clustering result in each view while keeping consistency between views. Here, the learning process considers both the data structures within views and the consistent relations between views, which benefits the multiview clustering performance. The proposed objective function is then solved by alternately optimizing partial variables. Each subproblem that optimizes the corresponding partial variables either can be solved by means of eigenvector decomposition or has a closedform solution. To evaluate the performance of the proposed method, we compare it with state-of-the-art methods on three synthetic and one real-world incomplete multiview datasets. Empirical results validate the superiority of the proposed method for incomplete multiview clustering.
The main contributions of this paper can be summarized as follows: (1) We propose a novel incomplete multiview clustering method, which simultaneously learns a consistent clustering decision and fills the incomplete kernels from incomplete views with explicit modeling of between-view consistency.
(2) We design an alternating optimization algorithm to solve proposed method's optimization problem. Here, the optimization problem is divided into three subproblems. The subproblems either can be solved by means of eigenvector decomposition or has a closedform solution.
(3) We also provide thorough convergence analysis of the alternating optimization algorithm, including theoretical proof and empirical validations.

The Proposed Method
Regarding the consistency, we propose that a consistent clustering decision be learned that is similar to each view's kernel -means clustering result. To handle the incomplete views, we simultaneously fill the incomplete views and learn the consistent clustering decision. In the following subsections, we first introduce the notation used in problem formulation, after which kernel -means is briefly reviewed. We then outline how a consistent decision might be found. Next, we introduce the objective function of our method to explain how the kernel filling and decision learning processes are integrated. Finally, we analyse the convergence of the proposed algorithm.

Notation.
Assume that there are samples and views for the multiview data. For clarity, sample's information in a view is referred to as an instance of the sample in this paper. For incomplete multiview data, a sample's instance in a view could be missing. S is an × zero-one matrix that indicates which instances are missing; when S = 0, sample 's instance in view is missing. S denotes the th column of S. Because our method is based on kernel -means, we assume that the input multiview data is kernel data. For each view , we have a × kernel matrix K . The details of how kernel data are built can be found in Section 3.1, where datasets used in this paper are introduced. In a view , some instances may be missing, which will lead to an incomplete kernel K . To describe the visible and missing parts of the incomplete kernel K , we define an operator K(row , col ), which selects corresponding rows and columns of K according to zero-one vectors row and Computational Intelligence and Neuroscience ? ?  Figure 1 shows a simple example of notation with three samples.

Kernel -Means.
Here, kernel -means refers to themeans method developed for kernel data. Define a mapping from X to a reproducing kernel Hilbert space H : indicates that sample is in cluster . The -means objective in kernel space is as follows: where is the average of samples in cluster . The number of samples in cluster is , so that the equivalent matrix form of (1) is as follows: where tr(⋅) is the trace operator and 1 is a -length vector in which all elements are 1.
The discreteness of Z makes (2) difficult to solve. An approximated problem that is easier to solve can be arrived at by relaxing the discreteness constraints on Z. By denoting U = ZL 1/2 , the approximated problem can be expressed as follows: The optimal U can be solved by obtaining eigenvectors corresponding to larger eigenvalues of K [9]. Although U contains the cluster indicator information, -means should be performed on U to recover the actual clustering label.

Finding the Consistent Decision.
So as to consider the consistency between views, we propose to find a consistent clustering decision according to the clustering results of different views. Suppose U is the eigenvector matrix found by kernel -means in view . U , while not the actual clustering label of view , does store the cluster information. Accordingly, we can find a matrix U * that is consistent with all U and then recover the final decision from U * .
To find the consistent U * , it is necessary to define the similarity between U * and U . Inspired by [3,19], the similarity is defined as where ‖ ⋅ ‖ is the Frobenius norm. Adding regularization U * U * = I on U * , we have There may be other possible definitions of similarity between U and U * . However, (4) is chosen because it allows an easy alternating optimization for the proposed method. As expected that the consistent decision should be similar to the kernel -means result of each view, we maximize the sum of similarities to find the consistent decision as follows: It is notable that each view is considered to be equally important in (6). If the importance of each view is prior knowledge, we can weigh the views differently and adapt (6) to a weighted sum of similarities. However, in this paper, we maintain the same weight for all views for model simplicity.
Although learning the accurate weights of views is valuable under circumstances where there are some views with heavy noise, it is beyond the scope of this paper.

Objective Function.
If all views are complete, it is easy to find the consistent decision by maximizing (6). When some views are incomplete, however we need to fill the corresponding kernels of those views for kernel -means. We expect that these filled kernels will lead to better clustering in each view and a consistent decision. In other words, the kernel filling is guided by both the clustering objective in each view and the consistency between views. So the filling procedure considers both the data structure within each view and the relationship between views. To achieve this, we propose the objective function as follows: where (U , U * ) = tr(U U U * U * ). K is the kernel that needs to be learned, which should be positive semidefinite. K (S , S ) is the visible part of the original kernel data. The third constraint actually forces K (S , S ) to be the same as the original kernel data. However, K (S , S ), K (S ,S ), and K (S ,S ) still need to be optimized. It is notable that the objective function consists of two parts. ∑ =1 tr[K (I − U U )] is the sum of kernel -means objective in each view, and ∑ =1 (U , U * ) is the term designed to model between-view consistency. A parameter is added to balance the importance of single view clustering performance and the consistency between views.

Remark 1.
Like the method proposed in [18], our method simultaneously fills incomplete kernels and performs multiview clustering. However, there are also major differences between the two methods. In [18], multiview clustering is achieved by learning the best combination of kernels for the best clustering performance, which overlooks the consistency between views. Differently, our method learns a consensus clustering decision from each view's kernel -means result, which explicitly models the consistency. More importantly, our method does not simply revise the method in [18] incrementally by adding a consistency regularization term; instead, we propose a new objective function that inherits the advantages of simultaneously performing imputation and multiview clustering.
Remark 2. The strategy of learning consistent clustering decision was also applied in [3,19]. The former work is based on spectral clustering, while the latter one is based on kernel -means. But it is worth noting that these works cannot deal with the incomplete multiview situation.

Optimization.
Optimizing all variables of (7) in one step is difficult. Instead, we develop an algorithm to solve the problem where U , U * , and K are optimized alternatively. The optimal solutions of the subproblems can be found easily, and the whole alternating updating process is guaranteed to converge to a local minimum.

Updating U .
When we only optimize U , the subproblem has a similar form to kernel -means and can be solved by means of eigenvalue decomposition in a similar way. The subproblem of updating U is as follows: 2.5.2. Updating U * . Similarly, the subproblem of updating U * can be solved by means of eigenvalue decomposition after reformulation. The subproblem of updating U * is as follows: Equation (9) is equivalent to the following optimization problem: Because K is positive semidefinite, K can be decomposed as A A , where A is a × 1 vector [15,18]. If we obtain A , K can be recovered. For clarity, we divide A into two parts: A V = A (S , 1) and A = A (S , 1). A V is selected according to the indexes of the visible instance in view , and A is selected according to the indexes of the missing instance in view . Therefore, the kernel matrix of view can be divided into four parts as follows: It is notable that K VV is the only visible part. The × matrix V can be divided into four corresponding parts in a similar Computational Intelligence and Neuroscience 5 way to K . According to the first constraint in (11), A V A V = K (S , S ). To obtain A , we have a problem equivalent to (11) as follows: Taking the derivative of (13), we can obtain the closed-form solution for A : By denoting V V V −1 as V V / , the missing parts of K can be calculated as The overall optimization process is summarized in Algorithm 1.

Convergence Property.
In this subsection, we provide a theoretical proof of the convergence of the proposed optimization algorithm. First, we need to prove that the objective value of (7) is lower-bounded.
Proof. According to the definitions of Frobenius norm and trace, we have the following: Following the constraints in (7), we have U U = I and According to Lemmas 3 and 4, the objective value of (7) is lower-bounded. Moreover, because we obtain the optimal solution to the corresponding subproblem in each step of the alternate updating, the objective value of (7) is therefore nonincreasing during this process. Since the objective value is lower-bounded and nonincreasing, the alternate updating algorithm is guaranteed to converge.

Datasets.
One incomplete multiview dataset and three complete multiview datasets are used in the experiments, as shown in Table 1. 3 Sources, the incomplete multiview dataset, has been compiled from three news sites: BBC, Reuters, and the Guardian. The dataset contains 416 news stories, and articles for some stories are missing from each site. More information about 3 Sources can be found in Table 2. Artificial incomplete multiview data are generated from complete multiview datasets using a random missing mechanism. The details of the generating process can be found in Section 3.3. For Digital (https://github.com/HoiYe/DigitalDataset), Flower 17 (http://www.robots.ox.ac.uk/∼vgg/data/flowers/17/) and Flower 102 (http://www.robots.ox.ac.uk/∼vgg/data/flowers/102/), and precomputed kernel matrices are used. As for 3 Sources, we generate Gaussian kernels with widths set as the mean of sample pair distances.

Compared Methods.
The proposed method is compared with three state-of-the-art methods including one of the latest imputation methods and two representative subspace methods. The best clustering result of a single view and the multiview clustering result with zero-filling kernels, as important baselines, are also compared.

Best Result of a Single View (BSV).
We perform clustering with the remaining samples in each view and choose the best. Because the view is incomplete, the missing samples are assigned random labels, after which the overall performance is reported.

Multiple Kernel -Means (MKKM).
Multiple kernel -means is applied to the zero-filling kernels.

Multiple Kernel -Means with Incomplete Kernels (MKKIK).
The algorithm proposed in [18] learns the missing parts and performs multiple kernel -means simultaneously.
Partial View Clustering (PVC). The subspace method proposed in [12], which learns a subspace where two views' instances of the same sample are similar.

Incomplete Multimodal Visual Data Grouping (IMG).
The subspace method proposed in [14], which added a graph Laplacian term to learn a latent global graph representation and the subspace simultaneously.
-Means-Based Consensus Clustering (KCC). The work in [20] proposed a unified framework for -means-based consensus clustering that can handle cases with incomplete partitions. Although this work does not focus on incomplete multiview clustering specifically, if we use the clustering results of each of the views as input partitions, it can deal with incomplete multiview clustering.

Experimental Settings.
In our experiments, the number of clusters is set as the true number of classes. Kernels are centralized and scaled during the preprocessing procedure following the suggestion put forward in [21]. Incomplete multiview data is manually produced for the complete multiview datasets. If the incomplete samples ratio (ISR) is , then × samples are randomly selected as incomplete. We keep the probability that a view is missing set at 0 = 0.5. A random vector g = (g 1 , . . . , g ) ∈ [0,1] is generated for each incomplete sample. The th view of an incomplete sample exists only if g > 0 . Because at least one view should always exist for a sample, a random vector is accepted until there is one view available for this sample. is varied from 0.1 to 0.9 to produce different missing patterns. For each value of , 10 random missing patterns are generated and the average performance reported. For the proposed method, the parameter is searched for in [10 −5 , 10 −4 , . . . , 10 4 , 10 5 ]/ .
represents the number of views, which is divided to avoid the scale difference caused by view number. For the relatively large dataset Flower 102, we search a smaller range: [10 −3 , 10 −2 , . . . , 10 2 , 10 3 ]/ . For PVC, we use the code provided by the authors, and the parameter is tuned from [10 −6 , 10 5 , . . . , 1] following the suggestion in [12]. For IMG, the same parameter as in PVC is set as the tuned value in PVC, and the other two parameters are set as advised in [14]. We use normalized mutual information (NMI) as the clustering evaluation [3,12]. Figure 2 shows the results on 3 Sources, the real-world incomplete multiview dataset. BSV performs worse as it only considers information from one view. Using the multiview information fusion, MKKM with zero-filling reaches a better NMI than BSV, while MKKIK outperforms MKKM for a more reasonable imputation. The proposed method achieves a significant NMI boost of about 30% compared with MKKIK. Our method fills the incomplete kernels to make the clustering result of each view consistent, while MKKIK does not consider the consistency. We suggest that there may be a strong underlying consistency between views on 3 Sources, so the proposed method outperforms MKKIK in part due to the fact that  this consistency is considered. Moreover, our method also outperforms KCC, which is a method that does consider the consistency; we suggest that this occurs because KCC does not have an imputation process. In KCC, the consensus clustering decision is learned from the remaining incomplete partitions. Figure 3 summarizes the results on the three artificial incomplete multiview datasets: Flower 17, Flower 102, and 8 Computational Intelligence and Neuroscience  Digital. It can be observed that the proposed method constantly achieves the best NMI compared with the state-ofthe-art methods with different ISRs. Moreover, the proposed method significantly outperforms the second-best method with different ISRs. For example, the proposed method outperforms the second-best method by around 20% on Digital when ISR is 0.1. It is also notable that when ISR increases, the performance of all methods decreases, which validates the degenerating effect of incomplete views. In Figure 4, we compare our method with two additional methods, PVC and IMG, which are representative subspace methods that focus on two views. We report the results on the view pairs of Digital. The proposed method constantly exhibits better performance than all other methods on all view pairs. This shows that the proposed method can also perform better than the state-of-the-art subspace methods in a two-view situation.

Experimental Results.
To summarize, the proposed method demonstrates its superiority against the state-of-the-art methods on both synthetic and real-world multiview datasets. We suggest that the imputation in the proposed method considers both clustering performances in each view and the consistency between views, which contributes to the superiority of the proposed method.

Convergence Study.
As was proved in the previous section, the proposed algorithm is guaranteed to be convergent. Here we empirically validate the convergence property, as illustrated in Figure 5. Due to space limitations, we show the objective curve and NMI curve when incomplete sample ratio is 0.9 for 3 Sources and Digital. The objective values decrease as the iteration number increases, and the objective values converge within 30 iterations. Although the NMIs do not grow monotonically, they achieve relatively large value when the number of iterations reaches 30.
3.6. Parameter Study. Figure 6 illustrates how parameter influences clustering performance. On Digital, Flower 102, and Flower 17, we present the performance curves for three ISRs: 0.3, 0.5, and 0.7. On 3 Sources, performance is optimal when = 10 2 / . On Digital, the performance remains relatively stable as the parameter changes. On Flower 102, the performance maintains a relatively high level when is greater than 10 −2 / . On Flower 17, the performance is sensitive to the parameter when is larger than 10/ . Overall, across the four datasets, the performance tends to be better when is larger. According to (7), when is larger, the clustering results between views should have greater consistency. Thus, better performance when is larger indicates relatively strong consistency between views on these datasets. It should also be emphasized that although the performance on Flower 17 is relatively sensitive to , Figure 3 indicates that the proposed method still outperforms other methods for worse choices of . When applying the proposed method on other datasets, we recommend a comparatively large value of in cases when views share a substantial amount of common information.

Conclusion
In this paper, we have proposed a consensus kernel -means clustering method for incomplete multiview data in which a consensus clustering decision and the missing parts of the incomplete kernels are learned. In this way, the imputation of incomplete kernels leads to better clustering of each view and maintains consistency between views, which benefits the final clustering decision. Comprehensive experiments validate the clustering performance improvement of the proposed method compared with state-of-the-art methods.

Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.