Weighted Nuclear Norm Minimization on Multimodality Clustering

Generally, multimodality data contain different potential information available and are capable of providing an enhanced analytical result compared to monosource data. *e way to combine the data plays a crucial role in multimodality data analysis which is worth investigating. Multimodality clustering, which seeks a partition of the data in multiple views, has attracted considerable attention, for example, robust multiview spectral clustering (RMSC) explicitly handles the possible noise in the transition probability matrices associated with different views. Spectral clustering algorithm embeds the input data into a lowdimensional representation by dividing the clustering problem into k subproblems, and the corresponding eigenvalue reflects the loss of each subproblem. So, the eigenvalues of the Laplacian matrix should be treated differently, while RMSC regularizes each singular value equally when recovering the low-rankmatrix. In this paper, we propose a multimodality clustering algorithmwhich recovers the low-rank matrix by weighted nuclear normminimization. We also propose a method to evaluate the weight vector by learning a shared low-rank matrix. In our experiments, we use several real-world datasets to test our method, and experimental results show that the proposed method has a better performance than other baselines.


Introduction
Clustering, a task of partitioning data points into multiple clusters, is a fundamental research problem in data mining and machine intelligence. A series of algorithms have been proposed over the past decades [1][2][3][4][5][6][7]. One of the representative methods is spectral clustering, which has a lot of applications [8][9][10][11]. With the development of information and communication technologies, which led to data production in most areas, it is relatively easy to capture features from a given subject. So, it is necessary to design new pattern recognition methods to deal with views of the same subjects. For example, in multilingual information retrieval, the same document can be represented by different languages, and each language can be regarded as a view. ese individual views can provide complementary information to each other which can lead to improved performance on the learning task. In the context of multimodality clustering, it seeks to get a better clustering performance by leveraging the information from multiple views.
Many multimodality clustering methods have been proposed in recent years. In general, there are three steps when clustering multiple data X � [X (1) , X (2) , . . . , X (m) ] [12]: (1) Obtain a similarity matrix S i from each view X i , (i � 1, 2, . . . , m) (2) Compute a projection of each similarity matrix into a space suitable for clustering (3) Produce a clustering assignment (i.e., K-means) e main difference between the multimodality clustering methods lies in the step where the information is collapsed to produce a single new representation. e first category (information merges in Step 1) merges S � [S (1) , S (2) , . . . , S (m) ] to get a new similarity matrix. e method presented in [13] is a Markov chain method for the generalized normalized cut on multimodality data; the method described in [14] uses the philosophy of co-regularization to make the clustering in different views agree with each other, as described in [15]; RMSC is a Markov-chainbased multimodality spectral clustering method via lowrank and sparse decomposition. e second category (information merges in Step 2) of methods merges the information to generate a compatible projection for all views. In [16], the authors used canonical correlation analysis to maximize the correlation of subjects across the projected views. In the third step, spectral clustering produces the assignment by K-means. e assignment is not stable for the randomness of K-means, so the third category learns a stable assignment. For example, ensemble clustering [17,18] methods are designed to find a stable assignment. e standard nuclear norm minimization regularizes each singular value equally to pursue the convexity of the loss function, while the singular values have different meanings and should be treated differently. Gu [19] proposed a weighted nuclear norm method and applied it to image denoising. e weight vector is evaluated by the singular values of image patches and the noise variance, but it is not useful for multimodality clustering.
In [15], we presented a Markov-chain-based multimodality spectral clustering method via low-rank and sparse decomposition. In this paper, as shown in Figure 1, we extend our previous study by applying the weighted nuclear norm to multimodality clustering and propose a method to evaluate the weight vector. e difference between them is that RMSC recovers the low-rank matrix P by solving a nuclear norm minimization (NNM) problem, while the proposed method recovers that by solving a weighted nuclear norm minimization (WNNM) problem. For the experiments, we use several real-world datasets to test our method. Experimental results show that the proposed method has a better performance than other baselines. is paper is organized as follows. Section 2 briefly describes the related work from which our method is based on. Section 3 describes the reason for using WNNM on multimodality clustering, defines our algorithm, and presents the optimization procedure. Section 4 presents the results of our method and other multimodality clustering methods. Section 5 outlines the main contributions of the work presented in this paper.

Related Work
To make this paper clear, Table 1 summarizes the symbols used in this paper.

Spectral Clustering.
Finding good clusters has been a focus of considerable research in pattern recognition. Spectral clustering applies the spectral graph theory [20] which gives the conditions where a graph can be divided into several nonconnected subgraphs. e method embeds the input data into a low-dimensional representation and then applies K-means.

Robust Multimodality Spectral Clustering via Low-Rank and Sparse Decomposition (RMSC).
Consider a set of multimodality data X � [X (1) , X (2) , . . . , X (m) ] with X (i) ∈ R d (i) ×n , where m is the number of views, n is the number of data points, d (i) represents the feature dimension of the i-th view, and the j-th column X (i) .j in X (i) represents the features of the j-th data point in the i-th view (j � 1, 2, . . . , n; i � 1, 2, . . . , m). e first step of RMSC is using Gaussian kernels to define the similarity matrix, i.e.,S ij � exp(− ‖x i − x j ‖ 2 2 /σ 2 ) where ‖.‖ 2 denotes the ℓ 2 norm and σ 2 denotes the standard deviation (e.g., one can set σ 2 to be the average Euclidean distance over all pairs of data points). e second step is to construct the transition matrix P by P � D − 1 S where D is a diagonal matrix with D ii � d i � n j�1 S ij . Under the low-rank and sparse assumptions, they formulate the transition matrix construction problem as (2) RMSC: NNM Our method: WNNM = +

Clustering via Markov chains
View (2) View (n) Figure 1: Framework of the robust multimodality on spectral clustering via low-rank and sparse decomposition; the proposed method is similar to RMSC [15]; the difference between them is that RMSC recovers the low-rank matrix P by solving a nuclear norm minimization (NNM) problem, while the proposed method recovers that by solving a weighted nuclear norm minimization (WNNM) problem.
where the ℓ 0 norm ‖E (i) ‖ 0 is the number of nonzero elements in E (i) , rank (P) represents the rank of P, 1 is a vector with all ones, and λ is a trade-off parameter. Note that the constraints P ≥ 0, P1 � 1 enforce P to be a transition probability matrix, i.e., each of its rows is a probability distribution.
As the problem is nonconvex, they replace rank (P) with the trace norm ‖P‖ * , and ‖E (i) ‖ 0 with the ℓ 1 norm ‖E (i) ‖ 1 , resulting in the following convex optimization problem: (2) e ℓ 1 norm ‖E (i) ‖ 1 � (i,j) |E ij | is well known to be a convex surrogate of ‖E‖ 0 . en, they propose an optimization procedure to solve this problem via the augmented Lagrangian multiplier (ALM) scheme, which has shown its good balance between efficiency and accuracy in many matrix learning problems.
Let σ i (Σ are in a nonascending order) represent the i-th singular value (Σ � (σ 1 , σ 2 , . . .)). When updating Q, the subproblem is Let UΣV T be the SVD form of (P + t(Z/μ)), and the solution is as follows: In the optimization procedure, each single value adds or subtracts the same value. So, RMSC treats each singular value equally, which may degrade the performance of the result of clustering.

Weighted Nuclear Norm.
Gu et al. [19] studied the weighted nuclear norm minimization (WNNM) problem, where the singular values are assigned different weights. e definition of weighted nuclear norm of a matrix X is as follows: ey analyzed the solutions of the WNNM problem under different weight conditions and proposed a method to evaluate the weight vector according to many image patches when applied the WNNM algorithm to image denoising. e difference between WNNM and our method is as follows: (1) we extend the weighted nuclear norm to multimodality clustering; (2) the methods which evaluated the weight vector were different; the former evaluates the weight vector according to image patches, while our method evaluates that by matrix decomposition.

The Proposed Method: Weighted RMSC
In this section, we present how to apply the weighted nuclear norm to multimodality clustering.
As described in Section 2.2, RMSC treats each singular value when updating Q(P � tQ), while for spectral clustering, different eigenvalues of L have different meaning.
According to [23], the RatioCut object function is defined as where Tr (.) denotes the trace of a matrix and h i represents the i-th column of H. So, the RatioCut object function is defined as According to Rayleigh-Ritz theorem [24] the problem has a fixed solution and H is constructed by the top k eigenvectors of L. From equation (6), we can find that the normalized spectral clustering divides the problem into k subproblems. Each subproblem partitions the points into 2 clusters, and h i (i � 2, . . . , k) is the solution of the subproblem (h 1 is an all-one vector, which represents it dividing all the points into the same cluster; this partition is useless). Output: the result of k-means.

Security and Communication Networks
As H T H � I, the RatioCut object function can be rewritten as So, we can find that the loss of each subproblem has a relation with the corresponding eigenvalue; the small eigenvalue reflects the little loss of the subproblem, so the smaller the eigenvalue λ is, the larger the weight of corresponding eigenvector should be assigned. It is also known that So, the larger eigenvalues (or singular values, σ i � λ i ) of P are more important than the smaller ones when updating P in RMSC; the larger the eigenvalues, the less they should be shrunk. erefore, the weight assigned to σ i should be inversely proportional to σ i . We let where c > 0 is a constant; ϵ � 10 − 16 is to avoid dividing by zero.
For multimodality clustering, we can construct m Laplacian matrices, leading to m groups of σ i , which are not equal. So evaluating accurate σ i is a challenging procedure, leading to the difficulty to make sure the weight of singular value. As we know, the output of RMSC is a shared low-rank matrix, and all the views share the same singular values. So, one way to evaluate the singular values is making use of other multimodality clustering algorithms, such as RMSC, to get the shared Laplacian matrix and then evaluating the final Σ by the singular values of the shared Laplacian matrix.
Following RMSC, under the low-rank and sparse assumptions, we formulate the transition matrix construction problem as follows: e optimization problem (11) is still challenging because the matrix P has two constraints. We introduce an auxiliary variable Q to solve this problem. e optimization problem (11) becomes as follows: e corresponding augmented Lagrange function of (12) is where Z, Y (i) represent the Lagrange multipliers, 〈·, ·〉 denotes the inner product of matrices (i.e., for two matrices A and B, 〈A, B〉 � A T B), and μ > 0 is an adaptive penalty parameter. e sketch of the proposed algorithm is shown in Algorithm 2. Next we will present the update rules for each of P, Q, and E (i) .
When other variables are fixed, the subproblem with respect to Q is More specifically, let UΣV T be the SVD form of (P + t(Z/μ)). We use RMSC to evaluate the final Σ and use it to evaluate W via Equation (10). According to [19], the solution to (14) is as follows: e subproblem with respect to E (i) (i � 1, 2, . . . , m) can be simplified as which has a closed form solution E (i) � S λ/μ (P (i) − P − (Y (i) /μ)).
With other variables being fixed, we update P by solving e solution is given by RMSC, which can be decomposed into n independent subproblems. Each subproblem is a proximal operator problem with probabilistic simplex constraint, which can be efficiently solved by the projection algorithm.

Experimental Setup
e proposed method was tested on several real-world datasets; the details are shown in Table 2.
In all the experiments, we use six metrics to measure the clustering performances: F-score, precision, recall, normalized mutual information (NMI) [25], entropy, and adjusted rand index (Adj-RI) [26]. Note that higher values indicate better performance except for entropy.
When evaluating the weight vector, there is a constant parameter c. We set c � 0.00001 in all the experiments. Similarity matrices are constructed by Gaussian kernels. σ 2 is set to the median of the Euclidean distance between every pair of data points for all of the datasets except BBCSports (σ 2 � 100). λ is set to be 0.005.

Experimental Results
We chose the following six multimodality clustering algorithms as baselines: (1) Single view: performing spectral clustering on a single view. Markov chains defined on each view [13]. (5) Co-regularized spectral clustering (Co-Reg): making use of the philosophy of co-regularization to make the clustering in different views agree with each other [14].
Following the settings in [14], we use the Gaussian kernel to construct similarity matrix for each view if needed in all algorithms. Table 3 shows the results of the proposed method and the baselines on BBCSports. As can be seen, the proposed method shows superior performance gains over the baselines with respect to all the six metrics. Here are some statistics: the results of our method indicate a relative increase of 4.82%, 2.88%, 2.10%, and 6.58% with respect to Fscore, precision, NMI, and Adj-RI, respectively, compared to the corresponding second best baseline. Table 4 shows the results of the proposed method and the baselines on UCI. As can be seen, the proposed method shows superior performance gains over the baselines with respect to all the six metrics. Here are some statistics: the results of our method indicate a relative increase of 1.35%, 1.62%, 1.32%, and 1.647.59% with respect to F-score, precision, NMI, and Adj-RI, respectively, compared to the corresponding second best baseline. Table 5 shows the results of the proposed method and the baselines on WebKb. As can be seen, the proposed method shows superior performance gains over the baselines with respect to most of the six metrics. Here are some statistics: the results of our method indicate a relative increase of 0.32%, 2.46%, and 0.82% with respect to F-score, NMI, and Adj-RI, respectively, compared to the corresponding second best baseline. Although the precision value of the proposed method is lower than that of kernel addition, the difference is small. Table 6 shows the results of the proposed method and the baselines on Reuters. As can be seen, the proposed method shows superior performance gains over the baselines with respect to all the six metrics. Here are some statistics: the results of our method indicate a relative increase of 1.89%, 3.56%, 5.85%, and 4.74% with respect to Input: λ, P (i) ∈ R n×n (i � 1, 2, . . . , m).
ALGORITHM 2: Weighted nuclear norm on robust multimodality clustering. F-score, precision, NMI, and Adj-RI, respectively, compared to the corresponding second best baseline. Although the recall value of the proposed method is lower than that of kernel addition and feature concatenation, the difference is small.

Conclusion
With the development of information and communication technologies, it is necessary to design new pattern recognition methods to deal with views of the same subjects. It is a    challenge task to deal with multimodality problems. Inspired by the previous work, we proposed a method applying the weighted nuclear norm to RMSC and gave a method to evaluate the weight vector, which distinguishes different single values. To solve the optimization problem, we designed a procedure based on ALM. To evaluate the proposed method, we apply it to four real-world datasets. Experimental results show that the proposed method has a superior performance than other baselines. In the future, we will continue the studies in multimodality clustering, including evaluating the weight vector more accurately and clustering on the large-scale datasets.

Data Availability
e BBCSports dataset used to support the findings of this study has been deposited in the UCD repository (http:// mlg.ucd.ie/datasets/bbc.html) and the other datasets have been deposited in the UCI repository (http://archive.ics.uci.edu/ml/index.php).

Conflicts of Interest
e authors declare that they have no conflicts of interest.