Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Robust auto-weighted multi-view subspace clustering with common subspace representation matrix

  • Wenzhang Zhuge,

    Affiliation Department of Mathematics and System Science, National University of Defense Technology, Changsha, Hunan, China

  • Chenping Hou ,

    hcpnudt@hotmail.com

    Affiliation Department of Mathematics and System Science, National University of Defense Technology, Changsha, Hunan, China

  • Yuanyuan Jiao,

    Affiliation The College of Nine, National University of Defense Technology, Changsha, Hunan, China

  • Jia Yue,

    Affiliation Key Laboratory, Taiyuan Satellite Launch Center, Taiyuan, Shanxi, China

  • Hong Tao,

    Affiliation Department of Mathematics and System Science, National University of Defense Technology, Changsha, Hunan, China

  • Dongyun Yi

    Affiliation Department of Mathematics and System Science, National University of Defense Technology, Changsha, Hunan, China

Abstract

In many computer vision and machine learning applications, the data sets distribute on certain low-dimensional subspaces. Subspace clustering is a powerful technology to find the underlying subspaces and cluster data points correctly. However, traditional subspace clustering methods can only be applied on data from one source, and how to extend these methods and enable the extensions to combine information from various data sources has become a hot area of research. Previous multi-view subspace methods aim to learn multiple subspace representation matrices simultaneously and these learning task for different views are treated equally. After obtaining representation matrices, they stack up the learned representation matrices as the common underlying subspace structure. However, for many problems, the importance of sources and the importance of features in one source both can be varied, which makes the previous approaches ineffective. In this paper, we propose a novel method called Robust Auto-weighted Multi-view Subspace Clustering (RAMSC). In our method, the weight for both the sources and features can be learned automatically via utilizing a novel trick and introducing a sparse norm. More importantly, the objective of our method is a common representation matrix which directly reflects the common underlying subspace structure. A new efficient algorithm is derived to solve the formulated objective with rigorous theoretical proof on its convergency. Extensive experimental results on five benchmark multi-view datasets well demonstrate that the proposed method consistently outperforms the state-of-the-art methods.

1 Introduction

In many applications such as computer vision, data mining, pattern recognition and machine learning, there exists an assumption that the data points are drawn from multiple low-dimensional subspaces with each subspace corresponding to one category or class. Subspace clustering [1, 2] aims to explore the underlying subspace and cluster the data according to it.

Early subspace clustering methods can be roughly grouped into two categories: algebra based methods such as [3, 4], and statistics based methods such as [5, 6]. And recently, many methods [716] which belong to a new category, i.e., spectral clustering based [1] methods, have been proposed and these methods have achieved state-of-the-art performance. The core idea of spectral clustering based methods is to apply the self-representation property to compute affinities, i.e., represent every data point by a linear combination of other data points. However, these methods mostly focus on the features from single source rather than multiple ones.

In actual applications, data is often collected from diverse domains or obtained from different feature extractors, thus multi-view data are very common in many applications. For example, in computer vision, each image can be described by the color, texture, shapes and so on. In web mining, each web can be characterized by its content and link information, which are two distinct descriptions or views. In multi-lingual information retrieval, a document can be represented by several different languages. Since these different features can provide useful information from different views and these single-view subspace clustering methods have shown good performance, it is crucial to integrate these heterogeneous features to create more accurate robust multi-view subspace clustering methods.

More recently, a number of multi-view subspace clustering methods have been proposed [1719]. The diversity-induced multi-view subspace clustering (DiMSC) was proposed in [17] to perform subspace clustering on different views simultaneously with a diverse term on the multiple representation matrices. The multi-view subspace clustering (MVSC) was introduced in [18] to perform clustering on the subspace representation of each view simultaneously with a common cluster structure. The low-rank tensor constrained multi-view subspace clustering (LT-MSC), which was proposed in [19], performs subspace clustering on different views simultaneously with a low rank tensor constraint, and the tensor is constructed by the subspace representation matrices. After obtaining subspace representation matrices, these methods use them to construct similarity matrices for different views independently and stack up these similarity matrices to a common one which represents the underlying common structures across different views. However, these methods neglect the different importance among views and the performance of their unified similarities may suffer when there is a less informative view.

In this paper, we try to solve the problem of subspace clustering for multi-view data. A novel method, named as Robust Auto-weighted Multi-view Subspace Clustering (RAMSC), has been presented. Different from the previous approaches [1719] which treat different views equally and obtain a representation matrix in each view, our proposed method assigns a suitable weight for each view and purposes to learn a common representation matrix across different views to reflect the underlying common structure. Besides, the view weight factors can be tuned automatically and this process does not need any additional parameters. And by introducing an sparse norm, our proposed method is robust to the inaccurate features. We provide an effective algorithm to solve the proposed non-smooth minimization problem and prove that the algorithm will converge. In the algorithm, for each view, a feature weight matrix can be learned and we also proposes a new way to construct the common similarity matrix by utilizing the view weight factors and feature weight matrix. Compared to related state-of-the-art clustering methods, our proposed method consistently achieves better performance on five benchmark multi-view data sets.

The rest of this paper is organized as follows. Section 2 introduces the background and motivation of this paper. In Section 3, we propose our method RAMSC with a solving algorithm. In Section 4, we present some deep analyses about the proposed algorithm RAMSC, including convergence behavior, computational complexity and parameter determination. Experimental results and conclusions are shown in Section 5 and Section 6, respectively.

2 Background and motivation

In this section, first we introduce some notations, then briefly review the previous subspace clustering methods to show our research motivation.

2.1 Notations

Throughout this paper, vectors and matrices are written in boldface uppercase letters and boldface lowercase letters, respectively. For a vector m, the 2-norm of vector m is denoted by ||m||2. And m(v) denotes that m is derived from v-th view. For a matrix M, we denote its i-th row, j-th column and ij-th element as mi:, m:j and mij respectively. The trace of matrix M is denoted by Tr(M). And we denote M(v) as a matrix M derived from the v-th view representation. The r,p-norm of an matrix is defined as [20, 21] (1) When r ≥ 1 and p ≥ 1, r,p-norm becomes a valid norm because it satisfies the three norm conditions.

2.2 Single-view and multi-view subspace clustering

Suppose is the data matrix with d-dimensional features and n data points. The subspace clustering methods based on spectral clustering mainly have the following two steps:

First, the self-representation property [7] is used to represent data matrix X as (2) where is the self-representation matrix with each z:i being the representation of sample x:i, and E is the error matrix. The nonzero elements of z:i correspond to points from the same subspace. And Z can be obtained by solving: (3) where || ⋅ ||l can be considered a proper norm on error matrix E, Ω(Z) and are the regularizer and constraint set on Z, respectively, and λ > 0 is a balance parameter. The existing methods [713] distinguish each other by employing different constraints or regularizers on Z or E;

Second, the obtained subspace structure Z is used to construct a similarity matrix S which encodes the pairwise similarity between data pairs by [22] (4) Afterwards, spectral clustering algorithm [23] can be used on the computed similarity matrix S to get the final clustering results.

For multi-view data, suppose that V is the number of views and X(1), X(2), …, X(V) are used to denote data matrix of each view, where for v = 1, 2, 3, …, V and d(v) is the v-th view dimensionality. The single-view subspace clustering methods can not be applied on multi-view data to obtain a representation matrix directly. One naive strategy is to concatenate all the features together as a new view, and then employ single view methods on the concatenated features. However, this method ignores the difference among multiple views.

The previous multi-view subspace clustering methods consider that for each single view, a subspace representation should be learned. They stack up these V tasks and focus on how to explore the relationships among these V representation matrices Z(v) so that these Z(v) can be learned simultaneously. Two more reasonable strategies in multi-view learning are adopted by them to achieve their goal: The first one is to explore complementary information from multiple views. DiMSC [17] explores the complementary of these representations Z(v) by applying the Hilbert Schmidt independence criterion as a diversity term. LT-MSC [19] explores the complementary information from multiple views by regarding the subspace representation matrices Z(v) as a tensor, then equipping the tensor with a low-rank constraint. The second strategy is to explore the consistence among multiple views. MVSC [18] explores the consistence of these representations Z(v) by performing subspace clustering on individual modality respectively and then unifying them by a common indicator matrix.

After obtaining a representation Z(v) for each view, all the above-mentioned multi-view subspace clustering methods construct a similarity matrix S by (5) Then they apply spectral clustering algorithm [23] on S to obtain clustering results.

Although these multi-view subspace clustering methods have achieved good performance, there are mainly two drawbacks of these methods which leave room to improve the clustering performance:

  1. These methods treat different views equally and neglect the different importance of different views. When they learn Z(v), each view plays the same important role. When they construct the similarity matrix S, Eq (5) can be considered as , where S(v) = (|Z(v)T| + |Z(v)|)/2 is a graph similarity matrix constructed from v-th view. This strategy may suffer when an unreliable similarity matrix is added to.
  2. Eq (5) can also be considered as S = (|ZT| + |Z|)/2, where can be considered as the underlying common structures across different views. The optimizing objectives of these methods are representation matrices Z(v), however, the final clustering results are determined by the common structure Z which may brings such a drawback that these Z(v) may have good properties because of the constraints or regularized terms, but Z may not keep these properties.

To address these two challenges, we will introduce our proposed novel multi-view subspace clustering method in next section.

3 Formulation and solution

In this section, we will first introduce the formulation of our method, and then an alternative algorithm will be presented to solve it.

3.1 Formulation

To overcome these two drawbacks, we propose a novel robust auto-weighted multi-view subspace clustering method. Our proposed method RAMSC utilizes a reasonable way to set view weight factors automatically and learns a common subspace representation Z which can be directly used to construct the common similarity matrix S across different views. Thus one important view can have a big weight, and the constraints or regularized terms can be set on Z which determines the final clustering results. The objective function of RAMSC is (6) where λ is a tradeoff factor, is the sparsity-inducing norm with 0 ≤ p ≤ 1 and each Ω(v)(Z) is a smooth regularized term. Denote the representation error matrix of the v-th view as (7) The -norm of a matrix is defined as (8) where is the i-th row of E(v) and ||E(v)||2,p is the 2,p-norm as defined in Eq (1) with r = 2. Ω(v)(Z) aims to smooth the distribution of the common representation Z on the v-th view. These v smooth regularized terms Ω(v)(Z) enforce the common subspace representation matrix Z to meet the grouping effect. This analogous smooth regularized term is also employed by [13, 17, 24]. Specifically, in our method, each regularized term Ω(v)(Z) for v = 1, 2, …, V is defined as: (9) is the weight matrix measuring the spatial closeness of the data points on v-th view. L(v) = D(v)W(v) is the Laplacian matrix, in which the degree D(v) is the diagonal matrix with . W(v) can be constructed by many different ways [2529]. To show the robustness of our method, we construct 0-1 binary weighted k-nn graphs for each view and k is set to be 5 in all experiments.

Intuitively, there is no weight factor explicitly defined in Eq (6), and all different views are treated equally. By the following analysis, it can provide a reasonable way to learn the weight factors of each view. The Lagrange function of problem (6) can be written as (10) Taking the derivative of Eq (10) with respect to Z and setting the derivative to zero, we have (11) where (12) Eq (11) can not be directly solved because α(v) is dependent on the target variable Z. However, if α(v) is considered as the weight factor of the v-th view, and its value has been given or set to be stationary, Eq (11) can be considered as the solution of the following problem when these α(v) are calculated or given: (13) Solving the problem (13) to obtain the common representation matrix Z seems more reasonable. This problem can be considered as a sum of two parts with a tradeoff factor λ. The first part is a linear combination of the subspace representation errors on each view. Increasing α(v) tends to reduce representation error on the v-th view. The second part is to smooth Z on a linear combination of Laplacian matrices with suitable weights α(v), i.e., . According to [3032], the accuracy of L can be higher than that of each L(v) or the sum of them .

Supposing that the common representation Z can be calculated from Eq (13), this Z can be used to update α(v) according to Eq (12). Learning α(v) in this way has following reasonable explanations and merits:

  1. If v-th view is good, then and Tr(Z L(v)ZT) should be small, and thus according to Eq (12), the learned α(v) is large.
  2. The 2,p-norm of E(v) enforces the p-norm along the features direction of representation error matrix E(v), and the 2-norm along the data points direction. Thus, when 0 ≤ p ≤ 1, the effect of inaccurate features in the learning of α(v) is reduced by the p-norm.
  3. Unlike [3135], which depends on an extra parameter to smooth the distribution of the view weights, learning α(v) by Eq (12) has no parameter to handle and it naturally avoids the trivial solution.

Although the problem (13) has a more reasonable form to learn a good common Z, there are difficulties to solve it, which comes from the following two aspects: (1) the terms are nonsmooth; (2) when α(v) is calculated by Eq (12), α(v) and Z are coupled with each other. In next subsection, we will propose an alternative algorithm to tackle them efficiently.

3.2 Optimization algorithm

To solve Eq (13), we consider the following problem to tackle the non-smooth norm problem: (14) where (15) is the diagonal matrix corresponding to the v-th view and the i-th entry on the diagonal is defined as: (16) is a subgradient of w.r.t. . To avoid the situation , which makes can not be calculated when p < 2, in practice, we replace the 2,p-norm with the regularize 2,p-norm. And it is defined as: (17) when ϵ → 0, the regularized of E(v) approximates the . Thus now can be regularized as (18) This strategy avoids a bad situation, 0 on the denominator, and guarantees that we can repeat the following alternative steps.

  • The first step is fixing U(v) and α(v), updating the common subspace representation Z.
    Differentiating the objective function J with respect to Z and setting it to zero (19) where (20) Eq (19) is a standard Sylvester equation, and according to [36], it has a unique optimal solution.
  • The second step is fixing α(v) and Z, updating the feature weight matrix U(v) for each view.
    The representation error matrix E(v) of each view is calculated by current Z, and then each diagonal element of U(v) is updated by Eqs (16) or (18).
  • The third step is fixing Z and U(v), updating the view weight factors α(v) for each view by Eq (12).

By the above three steps, we alternatively update Z, U(v) as well as α(v), and repeat the process iteratively. Until now, we can draw the following conclusions:

  1. In the above procedures, the alternating optimization converges, and Z* which denotes the converged value of Z is at least a local optimal solution to Eq (6). (We will prove this conclusion in next section).
  2. The second one is about initialization. Since these procedures can reach a local optimum of Eq (6), it is important to have a sensible initialization. We initialize all views with equal as in previous approaches [37, 38]. And as in previous researches [21, 39], we initialize U(v) = I(v) since every feature on each view has the same importance at the beginning.

After obtaining the common self-representation matrix Z*, the similarity matrix S1 can be defined as (21) and use the spectral clustering algorithm to produce the final clustering results, as has been adopted by traditional single-view subspace clustering methods.

Some single-view subspace clustering methods also use other ways to construct similarity matrix [13]. In this paper, to better exploit the merit of grouping effect, we further utilize the learned view weight factors α(v) and feature weight matrices U(v) to define a new similarity matrix S2 as (22) where (23) denotes the new i-th data point which concatenates re-weighted features, and is the re-weighted feature on the v-th view. γ > 0 is utilized to control the similarity variances. The new similarity measure can be considered as the inner product of the new common representation vectors normalized by the norms of their new features which are weighted by view weight factors α(v) and feature weight matrices U(v).

Based on the above analysis, we summarize the procedures of our method RAMSC in Algorithm 1.

Algorithm 1 Algorithm to solve RAMSC in Eq (6)

Input:

1. Data for V views {X(1), ⋯,X(V)} and

2. The expected number of clusters c,

3. The parameter λ, p, k and γ.

Initialize:

1. Initialize the feature weight matrix U(v) = I(v) for each view, where is the identity matrix.

2. Initialize the view weight factor for each view.

3. Build the 0-1 weighted k-nn graphs W(v) and compute the corresponding Laplacian matrices L(v) for each view.

while not converged do

 1. Compute the common representations Z by solving the Sylvester eq (19).

 2. Update the diagonal feature weight matrix U(v) for each view. Its diagonal elements can be updated by Eqs (16) or (18).

 3. Update the view weight factor α(v) for each view by Eq (12).

end while

4. Compute similarity matrix by either Eqs (21) or (22).

5. Use spectral clustering algorithm to obtain c clusters.

Output: Clustering result.

4 Performance analysis

4.1 Convergence analysis

To prove that the proposed Algorithm 1 converges and it can reach at least a local optimal solution of Eq (6), we first need to introduce the following lemma [21].

Lemma 1 When 0 < p ≤ 2, for any positive number a and b, the inequality holds:

(24)

Theorem 1 Each updated Z in Alg. 1 will monotonically decrease the objective of the problem (13) in each iteration.

Proof: Denote as the updated Z in each iteration and is the v-th representation error matrix calculated by . According to the optimization to in Alg. 1, reaches the unique optimal solution of the problem (14) when α(v) and U(v) are fixed, so (25) Combining weight matrix U(v) which , this inequation can be rewritten as: (26) Generally, and , and the regularized 2,p-norm can be used to guarantee it. According to Lemma 1, we can derive (27) Thus the following inequality holds (28) Summing Eqs (26) and (28), we have (29) Thus the objective of the problem (13) has been decreased by in each iteration.

Theorem 2 Each updated Z in Alg. 1 will monotonically decrease the objective of the problem (6) in each iteration, which makes converged Z* be a local optimal solution.

Proof: Denote as (30) According to Eq (29), makes the objective of Eq (13) have smaller value than Z. Combining view weight factors , we can derive (31) Since and , according to Lemma 1, when p = 1, we have (32) Summing Eqs (31) and (32), we arrive at (33) Thus the alternating optimization will monotonically decrease the objective of the problem (6) in each iteration until it converges. In the convergence, the converged Z* satisfy the Eq (11) which is the KKT condition of problem (6). Therefore, Z* is at least a local optimal solution of the problem (6).

4.2 Computational complexity and parameter determination

As seen from the procedure of RAMSC in Algorithm 1, we have solved this problem in an alternative way. The computational complexity in solving each problem is listed as follows. (1) The problem in Eq (19) can be solved by the Bartels-Stewart algorithm which has a computational complexity of ; (2) The problem in Eq (16) can be effectively solved by computing the 2-norm of a vector. The computational complexity is ; (3) Solving the problem in Eq (12) to update the optimal weight for each view has complexity . In summary, the total computational complexity of RAMSC is , where T is the number of iteration.

Since parameter determination is still an open problem [40, 41], we determine the parameters of our method empirically as in previous researches. As for p, it is designed to add sparsity to representation error matrices E(v) which can alleviate the effect of inaccurate features in the learning of α(v). Paper [43] is a timely and comprehensive survey, and a very good material to master the sparse learning field. According to it, we set p = 1, and this setting has been proven to be effective in most applications [20, 42]. As for k, it is the neighbor number to construct graphs W(v). Methods [13, 17] using similar regularized terms perform stably with different k, so we construct 5-nn graphs.

As for the parameter λ, it is very vital to the final performance since it is employed to balance the self representation accuracy and the smoothness of Z. Since there is no prior information about λ, we determine it by grid search in a heuristic way as in previous researches [13, 17, 42]. Concretely, λ is tuned from 1,2 and 5 to 60 with an incremental step 5 to get the best λ. When Eq (22) is used to construct the similarity matrix, we search it from 0.1 to 2 with an incremental step 0.2 to get the best γ.

5 Experiments

In this section, our proposed RAMSC has been evaluated on five widely used data sets, and some numerical results of its convergency behaviors and also have been shown.

5.1 Data set descriptions

To validate the effectiveness of our method, we use five multiview benchmark datasets. They are various kinds of data arisen in many real applications with different characters and commonly used in multiple view learning. They are Microsoft Research Cambridge Volume 1 (MSRC-v1) [44], Caltech101 [45], NBA-NASCAR [46], Handwritten Dutch Digit Recognition (Digit) [47] and Web Knowledge Base (WebKB) [48]. The statistics information of the five data sets is concluded in Table 1 and the detailed information about them is shown as the following

  • MSRC-v1 data set consists 240 images and is divided into 8 classes. Following [49], we select 7 classes composed of tree, building, airplane, cow, face, car, bicycle and each class has 30 images. To distinguish all of scenes, we extract 256 Local Binary Pattern (LBP), 100 Histogram of Oriented Gradient (HOG), 512 GIST, 1302 CENTRIST, 48 Color Moment (CMT) and 200 SIFT features.
  • Caltech101-7 data set is composed of 8677 objective images which belong to 101 categories. We selected 7 widely used classes, including DollaBill, Faces, Garfield, Motorbikes, Snoopy, Stop-Sign and Windsor-Chair. Following [50], the data set has totally 441 images. In order to obtain different views, we extract 256 LBP, 100 PyramidHOG (PHOG), 512 GIST, 32 Gabor texture, 200 SURF and 200 SIFT features.
  • Digit data set contains 2,000 data points for 0 to 9 ten digit classes and each class has 200 data points. Six published features can be used for multi-view clustering: 76 Fourier coefficients of the character shapes (FOU), 216 profile correlations (FAC), 64 Karhunen-love coefficients (KAR), 240 pixel averages in 2 × 3 windows (PIX), 47 Zernike moment (ZER) and 6 morphological (MOR) features.
  • NBA-NASCAR data set is collected from the sports gallery of the yahoo! website in 2008. Following [46], this data set consists 420 NBA images and 420 NASCAR images. For each image, there an attached short text describing information. To get different views, each image is normalized to have 1024 gray features, and from each text, 296 TFIDF features have been extracted.
  • WebKB data set is a subset of web documents from four universities. This data set consists 1051 pages which are classified 2 classes: 230 Course pages and 821 Non-Course pages. Each page has 2 views: Fulltext view contains 2949 features representing the textual content on the web page, and Inlinks view consists 334 features recording that the anchor text on the hyperlinks pointing to the pages.
thumbnail
Table 1. Details of the multiview datasets used in our experiments (view type (dimensionality)).

https://doi.org/10.1371/journal.pone.0176769.t001

5.2 Experimental setup

To evaluate the performance of our method, we have compared our method with each single view counterpart. Single view methods on the concatenated features are also compared. Besides, we compare with other state-of-the-art methods, including robust multi-view K-means clustering (RMKMC) [33], pair-wised co-regularized multi-modal spectral clustering (PC-SPC) [30], centroid co-regularized multi-modal spectral clustering (CC-SPC) [30], multi-view subspace clustering (MVSC) [18] and diversity induced multi-view subspace clustering (DiMSC) [17].

  • SPC: We employ the standard spectral clustering (SPC) [23] algorithm directly on each view, and report the results as baselines.
  • SMR: We first run smooth representation clustering (SMR) [13] on each view features to get the subspace representations, and then run spectral clustering on such representations.
  • SPC-CON and SMR-CON: We first concatenate all features together as a new single view, and then run SPC [23] and SMR [13] respectively on it.
  • RMKMC: The robust multi-view K-means clustering method obtains the common cluster indicators across multiple views by minimizing the linear combination of the relaxed K-means on each view with learned weight factors.
  • PC-SPC: This method enforces the corresponding point in different modality to have the same cluster membership by a pair-wised co-regularization term, which makes different views be same to each other.
  • CC-SPC: This method is similar to PC-SPC, other than a centroid-based co-regularization term, which makes different views be same to a common one.
  • MVSC: This method perform subspace clustering on individual modality respectively and then unify them with a common indicator matrix.
  • DiMSC: This method learns subspace representations and employs the Hilbert-Schmidt Independence Criterion to enhance complementary information.

For fair comparison, we download the source codes of the compared methods from the authors’ websites and follow their experimental settings and the parameter tuning steps in their papers to get their best parameters. And for RAMSC, we construct 0-1 binary 5-nn graphs W(v) for each view and the p is fixed 1 in all experiments. Thus only one parameter λ in our method needs to be tuned. We search the best λ from 1, 2 and 5 to 60 with incremental step 5. RAMSC(S2) denotes that we use Eq (22) to construct S2, and the best parameter γ is searched from 0.1 to 2 with incremental step 0.2. And the experimental results are corresponding to their best parameters.

Before we do the clustering work, we first normalize each view of the multi-view data to make all the values in the range [−1, 1]. All the experiments are repeated 50 times independently, and the mean and standard deviation of the results are reported.

Three standard clustering evaluation metrics are utilized to measure the multi-view clustering performance, that is, Clustering Accuracy (ACC), Normalized Mutual Information (NMI) and Purity.

5.3 Experimental results

The experiment results of the five datasets with three metrics are shown in Tables 2, 3, 4, 5 and 6. In terms of the clustering accuracy, we have the following observations.

  1. From Tables 2, 3, 4, 5 and 6, we conclude that our proposed method outperforms the competing methods on all the benchmark datasets. And although ACC, NMI and Purity are three different evaluation metrics, they all indicate the advantages of our method. The clustering results show the effectiveness of the way to construct similarity matrix by Eq (22), and compared the way of Eq (21), it can achieve better or at least comparable performance.
  2. From Tables 2, 3 and 4, it can be seen that some individual view features are more discriminative for performing clustering. And as for the comparison between single view methods and previous multi-view approaches, the previous multi-view clustering methods can not always achieve better performances. This may be caused by the fact that previous methods characterize the structures of each view data separately and combine them by simply addition operations, which makes the final clustering results affect by these inaccurate structures. Our approach can perform better than single view methods in most cases because our method distributes small weight factors for inaccurate views and learns a common self representation matrix Z which can be used to construct a common similarity matrix S among different views.
  3. Tables 5 and 6 show the robustness of our method. On NBA-NASCAR data set, all the competing methods except RMKMC can not achieve reasonable performance. It is because that RMKMC utilizes a weight factor for each view and the sparsity-inducing norm to eliminate the influence of the outliers, while the other competing methods do not consider the ouliers and sparsity of the input data. Compared with RMKMC, our method learns view weight factors automatically without an additional parameter and use the norm to eliminate the influence of the inaccurate features. Our method has better performance on NBA-NASCAR data set and still achieve good performance on WebKB data set when all other compared methods do not work.
thumbnail
Table 2. Clustering results of different methods on MSRC-v1 data set. (mean(± std)).

(On the following five result tables, two best results of each metrics are bold).

https://doi.org/10.1371/journal.pone.0176769.t002

thumbnail
Table 3. Clustering results of different methods on Caltech101-7 data set. (mean(± std)).

https://doi.org/10.1371/journal.pone.0176769.t003

thumbnail
Table 4. Clustering results on dight data set. (mean(± std)).

https://doi.org/10.1371/journal.pone.0176769.t004

thumbnail
Table 5. Clustering results on NBA-NASCAR data set. (mean(± std)).

https://doi.org/10.1371/journal.pone.0176769.t005

thumbnail
Table 6. Clustering results on WebKB data set. (mean(± std)).

https://doi.org/10.1371/journal.pone.0176769.t006

5.4 Convergence behavior

In order to verify the convergence of Algorithm 1, we present the numerical results of the convergence behavior on datasets MSRC-v1 and Caltech101-7.

The convergence curves are displayed in Fig 1. As shown in Fig 1, the objective values of Eq (6) are non-increasing during the iterations and converge to a fixed value. Additionally, our algorithm converges within 10 iterations which means it has fast convergence speed.

thumbnail
Fig 1. Convergence behaviors of RAMSC with λ = 50 on two datasets.

(A) MSRC-v1; (B) Caltech101-7.

https://doi.org/10.1371/journal.pone.0176769.g001

5.5 Parameter determination

As for the parameter determination problem, we conduct experiments on two data sets, i.e., NBA-NASCAR and WebKB, for evaluation. Since we fix k = 5 and p = 1, only the balance parameter λ needs to be tuned when we use Eq (21) to construct S1. We vary it from {1, 2, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60} and ACC and NMI are employed as the evaluation criterions. The results are shown in Fig 2.

thumbnail
Fig 2. ACC and NMI of RAMSC with different selection of parameter λ.

(A) NBA-NASCAR; (B) WebKB.

https://doi.org/10.1371/journal.pone.0176769.g002

When we use Eq (22) to construct S2, there is an additional parameter γ. To show the influence of λ and γ on RAMSC(S2), we vary λ from {2, 10, 20, 30, 40, 50, 60}, and γ is varied from {0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3, 1.5, 1.7, 1.9}. ACC is employed as the evaluation criterion.

As we can see from the results in in Figs 2 and 3, it is clear that the final clustering results of RAMSC and RAMSC(S2) are affected by different λ and the combinations of λ and γ, respectively. Besides, on the two data sets, RAMSC has different optimal λ, and RAMSC(S2) has different optimal combination of λ and γ because the two data sets have different data characteristics.

thumbnail
Fig 3. ACC and NMI of RAMSC(S2) with different combinations of parameters λ and γ.

(A) NBA-NASCAR; (B) WebKB.

https://doi.org/10.1371/journal.pone.0176769.g003

6 Conclusion

In this paper, we have proposed a novel robust auto-weighted multi-view subspace clustering model, named RAMSC. This model can naturally assign suitable weights for each view and learn a common representation matrix. The common representation matrix can be used to construct a similarity matrix directly. Moreover, by imposing the structured sparsity norm, our method is robust to the inaccurate features. And the relative proof guarantees that the proposed method can converge to a local optimal solution. Experimental results on five data sets show that our proposed method enables a higher degree of accuracy than the state-of-the-art methods. However, there still remains several problems for future work:

  • A series of relative methods need to be developed and systematically compared. The core idea of our method is to learn view weights automatically and find a high-quality common subspace representation matrix. Based on it, we list three possible ways to develop new relative methods. First, the smooth regularized terms of our method can be replaced by others; Second, the sparsity norm on error matrix can be considered to replace by other reasonable norms; Third, our method has no constraint, and some constraints on the common representation matrix or the error matrix can be added. According to specific applications, corresponding relative methods can be proposed.
  • Another open problem lies in the selection of the parameters, especially in the balance parameter λ, which is still an unsolved problem in many learning algorithms. In this paper, we determine it empirically. Additional theoretical analysis is also needed for this topic.

Supporting information

S1 Appendix. RAMSC.

A file contains matlab codes of RAMSC and the normalized datasets used in this paper.

https://doi.org/10.1371/journal.pone.0176769.s001

(ZIP)

Acknowledgments

Chenping Hou was the corresponding author of this paper.

Author Contributions

  1. Conceptualization: CH WZ.
  2. Data curation: YJ.
  3. Formal analysis: WZ CH HT.
  4. Investigation: JY.
  5. Methodology: WZ CH DY.
  6. Project administration: DY.
  7. Resources: YJ.
  8. Software: WZ JY.
  9. Supervision: DY.
  10. Writing – original draft: WZ.
  11. Writing – review & editing: HT YJ.

References

  1. 1. Vidal R. Subspace Clustering. IEEE Signal Processing Magazine. 2011;28(2):52–68.
  2. 2. Gu Q, Zhou J. Subspace maximum margin clustering. In: CIKM; 2009. p. 1337–1346.
  3. 3. Costeira J, Kanade T. A Multibody Factorization Method for Independently Moving Objects. International Journal of Computer Vision. 1998;29(3):159–179.
  4. 4. Vidal R, Ma Y, Sastry S. Generalized principal component analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2005;27(12):1945—1959. pmid:16355661
  5. 5. Fischler MA, Bolles RC. Random Sample Consensus: A Paradigm for Model Fitting with Applications To Image Analysis and Automated Cartography. Communications of the Acm. 1980;24(6):381–395.
  6. 6. Kuhn KW. The Hungarian Method for the assignment problem. In: Naval research logistics quarterly; 1955. p. 83–97. https://doi.org/10.1002/nav.3800020109
  7. 7. Elhamifar E, Vidal R. Sparse Subspace Clustering: Algorithm, Theory, and Applications. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2013;35(11):2765–81. pmid:24051734
  8. 8. Liu G, Lin Z, Yan S, Sun J, Yu Y, Ma Y. Robust Recovery of Subspace Structures by Low-Rank Representation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2013;35(1):171–184. pmid:22487984
  9. 9. Wang S, Yuan X, Shen J, Yao T, Yan S. Efficient Subspace Segmentation via Quadratic Programming. In: AAAI; 2011.
  10. 10. Luo D, Nie F, Ding C, Huang H. Multi-Subspace Representation and Discovery. Machine Learning and Knowledge Discovery in Databases. 2011;6912(1):405–420.
  11. 11. Lu CY, Min H, Zhao ZQ, Zhu L, Huang DS, Yan S. Robust and Efficient Subspace Segmentation via Least Squares Regression. In: ECCV; 2012. p. 347–360.
  12. 12. Zhang Z, Zhao M, Chow TWS. Binary- and Multi-class Group Sparse Canonical Correlation Analysis for Feature Extraction and Classification. IEEE Transactions on Knowledge and Data Engineering. 2013;25(10):2192–2205.
  13. 13. Hu H, Lin Z, Feng J, Zhou J. Smooth Representation Clustering. In: CVPR; 2014. p. 3834–3841.
  14. 14. Nie F, Wang X, Huang H. Clustering and projected clustering with adaptive neighbors. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2014. p. 977–986.
  15. 15. Gu Q, Zhou J. Learning the Shared Subspace for Multi-task Clustering and Transductive Transfer Classification. In: ICDM; 2009. p. 159–168.
  16. 16. Gui J, Sun Z, Jia W, Hu R, Lei Y, Ji S. Discriminant sparse neighborhood preserving embedding for face recognition. Pattern Recognition. 2012;45(8):2884–2893.
  17. 17. Cao X, Zhang C, Fu H, Liu S, Zhang H. Diversity-induced Multi-view Subspace Clustering. In: CVPR; 2015. p. 586–594.
  18. 18. Gao H, Nie F, Li X, Huang H. Multi-view Subspace Clustering. In: ICCV; 2015. p. 4238–4246.
  19. 19. Zhang C, Fu H, Liu S, Liu G, Cao X. Low-Rank Tensor Constrained Multiview Subspace Clustering. In: ICCV; 2015. p. 1582–1590.
  20. 20. Nie F, Huang H, Cai X, Ding CHQ. Efficient and Robust Feature Selection via Joint l2,1-Norms Minimization. In: NIPS; 2010. p. 1813–1821.
  21. 21. Tao H, Hou C, Nie F, Jiao Y, Yi D. Effective Discriminative Feature Selection With Nontrivial Solution. IEEE Transactions on Neural Networks and Learning Systems. 2015;27(4):3013–3017.
  22. 22. Zhang Z, Yan S, Zhao M. Pairwise sparsity preserving embedding for unsupervised subspace learning and classification. IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society. 2013;22(12):4640–51. pmid:23955747
  23. 23. Ng AY, Jordan MI, Weiss Y. On Spectral Clustering: Analysis and an algorithm. Proceedings of Advances in Neural Information Processing Systems. 2002;14:849–856.
  24. 24. Wu B, Liu Z, Wang S, Hu BG, Ji Q. Multi-label learning with missing labels. In: ICPR; 2014. p. 1964–1968.
  25. 25. He X, Niyogi P. Locality Preserving Projections (LPP). Advances in Neural Information Processing Systems. 2005;45(1):186–197.
  26. 26. Nie F, Wang X, Jordan MI, Huang H. The Constrained Laplacian Rank Algorithm for Graph-Based Clustering. In: AAAI; 2016.
  27. 27. Zelnik-Manor L. Self-Tuning Spectral Clustering. Advances in Neural Information Processing Systems. 2004;17:1601–1608.
  28. 28. Zhang Z, Chow TW, Zhao M. M-Isomap: Orthogonal Constrained Marginal Isomap for Nonlinear Dimensionality Reduction. IEEE Transactions on Systems Man and Cybernetics Part B Cybernetics A Publication of the IEEE Systems Man and Cybernetics Society. 2012;43(1):180–191.
  29. 29. Gui J, Jia W, Zhu L, Wang SL, Huang DS. Locality preserving discriminant projections for face and palmprint recognition. Neurocomputing. 2010;73(13–15):2696–2707.
  30. 30. Kumar A, Rai P, Daum�� H. Co-regularized Multi-view Spectral Clustering. In: NIPS; 2011. p. 1413–1421.
  31. 31. Li Y, Nie F, Huang H, Huang J. Large-Scale Multi-View Spectral Clustering via Bipartite Graph. In: AAAI; 2015. p. 2750–2756.
  32. 32. Xia T, Tao D, Mei T, Zhang Y. Multiview Spectral Embedding. IEEE Transactions on Systems, Man and Cybernetics, Part B: Cybernetics. 2010;40(6):1438–46. pmid:20172832
  33. 33. Cai X, Nie F, Huang H. Multi-View K-Means Clustering on Big Data. In: IJCAI; 2013.
  34. 34. Karasuyama M, Mamitsuka H. Multiple Graph Label Propagation by Sparse Integration. IEEE Transactions on Neural Networks and Learning Systems. 2012;24(12):1999–2012.
  35. 35. Shen H, Tao D, Ma D. Multiview locally linear embedding for effective medical image retrieval. Plos One. 2013;8(12):82409–82409. pmid:24349277
  36. 36. Bartels RH, Stewart GW. A solution of the matrix equation AX+XB = C. Communications of the ACM. 1972;15(9):820–826.
  37. 37. Nie F, Li J, Li X. Parameter-Free Auto-Weighted Multiple Graph Learning: A Framework for Multiview Clustering and Semi-supervised Classification. In: IJCAI; 2016.
  38. 38. Nie F, Cai G, Li X. Multi-view Clustering and Semi-supervised Classification with Adaptive Neighbours. In: AAAI; 2017.
  39. 39. Nie F, Huang H, Ding C. Low-rank matrix recovery via efficient schatten p-norm minimization. In: AAAI; 2012. p. 655–661.
  40. 40. Gui J, Sun Z, Cheng J, Ji S. How to Estimate the Regularization Parameter for Spectral Regression Discriminant Analysis and its Kernel Version? IEEE Transactions on Circuits and Systems for Video Technology. 2014;24(2):211–223.
  41. 41. Gu Q, Li Z, Han J. Learning a Kernel for Multi-Task Clustering. In: AAAI; 2011.
  42. 42. Hou C, Nie F, Li X, Yi D, Wu Y. Joint embedding learning and sparse regression: a framework for unsupervised feature selection. IEEE Transactions on Cybernetics. 2014;44(6):793–804. pmid:23893760
  43. 43. Gui J, Sun Z, Ji S, Tao D. Feature Selection Based on Structured Sparsity: A Comprehensive Study. IEEE Transactions on Neural Networks and Learning Systems. 2016; p. 1–18. pmid:27116754
  44. 44. Winn J, Jojic N. LOCUS: Learning Object Classes with Unsupervised Segmentation. In: ICCV; 2005. p. 756–763.
  45. 45. Li FF, Fergus R, Perona P. Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories. Computer Vision and Image Understanding. 2004;106(1):178.
  46. 46. Sun S. Multi-view Laplacian Support Vector Machines. Applied Intelligence. 2011;41(4):209–222.
  47. 47. Asuncion A, Newman DJ. UCI Machine Learning Repository Irvine. 2007.
  48. 48. Sindhwani V, Niyogi P, Belkin M. Beyond the point cloud: from transductive to semi-supervised learning. In: ICML; 2005. p.824–831.
  49. 49. Yong J, Grauman K. Foreground Focus: Unsupervised Learning from Partially Matching Images. International Journal of Computer Vision. 2009;85(2):143–166.
  50. 50. Dueck D, Frey BJ. Non-metric affinity propagation for unsupervised image categorization. In: ICCV; 2007. p. 1–8.