Multi-view intrinsic low-rank representation for robust face recognition and clustering

In the last years, subspace-based multi-view face recognition has attracted increasing attention and many related methods have been proposed. However, the most existing methods ignore the speciﬁc local structure of different views. This drawback can cause these meth-ods’ discriminating ability to degrade when many noisy samples exist in data. To tackle this problem, a multi-view low-rank representation method is proposed, which exploits both intrinsic relationships and speciﬁc local structures of different views simultaneously. It is achieved by hierarchical Bayesian methods that constrain the low-rank representation of each view so that it matches a linear combination of an intrinsic representation matrix and a speciﬁc representation matrix to obtain common and speciﬁc characteristics of different views. The intrinsic representation matrix holds the consensus information between views, and the speciﬁc representation matrices indicate the diversity among views. Furthermore, the model injects a clustering structure into the low-rank representation. This approach allows for adaptive adjustment of the clustering structure while pursuing the optimization of the low-rank representation. Hence, the model can well capture both the relationship between data and the clustering structure explicitly. Extensive experiments on several datasets demonstrated the effectiveness of the proposed method compared to similar state-of-the-art methods in classiﬁcation and clustering.


INTRODUCTION
views. In [5], Xia et al. attempted to separate noise from graphs through a low-rank transition probability matrix. In [6], Huang et al. performed face recognition tasks through an embedding graph learned simultaneously from different views. In [7], Cao et al. proposed a constrained multi-view method, which applied pairwise constraint to facilitate finding a consistent graph. In [8], Li et al. attained high accuracy and low computational complexity by using a bipartite graph to approximate the similarity graphs of different views. Nonetheless, these methods designed based on traditional spectral learning [9][10][11] did not utilize the self-expressiveness property of data. Therefore, they can be susceptible to noise in the different views.
Recently, multi-view subspace methods based on Low-Rank Representation (LRR) [12] and Sparse Subspace Clustering (SSC) [13] have become increasingly popular. Because both baseline methods utilize the self-expressiveness property of data to represent each data point as a linear combination of others in order to learn a computationally efficient data representation. Even so, some studies such as [14] suggested more that LRR may indeed provide a more robust approach to tackle the above drawback considering that it captures a global structure of data. Thus, many methods, which LRR inspired have been proposed over the years. For instance, Structure-Constrained Low-Rank Representation [15] analyzed the structure of multiple disjoint sub-spaces, which is more general for computer vision data. Latent Low-Rank Representation (LatLRR) [16] was proposed for latent detection in images using subspace segmentation. Robust multi-view subspace learning [17] used dual low-rank decompositions to uncover local manifold structures. Multi-Spectral Low-Rank Structured Dictionary Learning (MLSDL) [18] applied low-rank structural incoherence term to reduce the discrepancy between different views. Low-Rank Graph Optimization for Multi-View Dimensionality Reduction (LRGO-MVDR) [19] used an adaptive non-negative weight vector to capture the complementary structure in different face images. Common Subspace based Low-Rank and Joint Sparse Representation (CSLRJSR) [20] found common face representation fusing discriminative features in different views through a joint low-rank and sparse methods. Despite these efforts, two main drawbacks limit the overall performance of the subspace methods. First, because subspace methods assume that similar data points stay near each other, they often do not perform well in practical cases when the distribution of data points within a subspace is arbitrary due to excessive noise. This weakness is because two unrelated data points may stay connected, and it is even more feasible knowing that these methods utilize a fixed shared representation. Secondly, most of these methods consider only the consistency of multi-view data or enhance only the diversity in these data similar to [21], so both common structure and intrinsic structure can mix up easily to degrade recognition accuracy.
Regarding the above limitations, we propose a novel method, which integrates clustering structure into low-rank representation to pursue optimization of multi-view intrinsic representation, which is used subsequently for face recognition. To achieve this, first, we follow the intuition of Hierarchical Bayes [22][23][24] by decomposing the low-rank representation learned for each view into two matrices, namely the intrinsic representation and the specific representation. The intrinsic representation matrix captures the common features of different views. And the specific representation matrix captures specific features of each view through diversity regularization. Then the core of our idea is introduced by imposing the spectral clustering structure into our model in order to guarantee more optimum representation that is devoid of specific redundant information. In other words, an affinity matrix in introduced into our model that captures the clustering structure explicitly, such that the clustering structure is integrated into the low-rank representation to allow our method to find a better intrinsic representation adaptively so that it can be used for face recognition reducing the impact of noisy data. Therefore, our model can be used for face recognition and clustering simultaneously. Figure 1 shows the framework of our proposed method.
The main contributions of our work are: 1. We propose a novel method where a low-rank representation learned for each view is composed of two low-rank components: an intrinsic representation across views and a specific view representation in each view. The intrinsic representation captures the common structure across all views, while the specific representation captures the diversity among views. With these structures, we can discover the correlation and discrepancy in multi-view face data comprehensively.
2. Furthermore, different from the existing methods which utilize a fixed representation of multi-view face data, our method obtains an intrinsic representation of the data flexibly. We achieve this by integrating a clustering structure into the low-rank representation to allow the clustering structure to be adjusted adaptively while pursuing an ideal low-rank face representation that will suppress the effect of noisy data. As a result, our proposed model can be used for face recognition and clustering simultaneously.
3. We conduct experiments on face benchmark datasets, and the experimental results demonstrate the effectiveness of our proposed method.

RELATED WORK
Many studies have been conducted in search of an efficient technique that can explore the complementary information in multiview face data to improve recognition accuracy. The preliminary methods such as feature concatenation [1][2][3], spectral graph [4][5][6][7][8] are not computationally efficient, and they do not provide robustness against noise. Recently, there has been much more attention on the subspace approach because it provides a mechanism to tackle the above problems. So far, LRR [12] and SSC [13] are the two most effective approaches in subspace segmentation, and they both utilize the self-expressiveness property of data [25]. SSC, however, does not capture a global structure of the data. The reason is that it uses the L1 norm to find a sparse representation of data points individually, which means that SSC may be susceptible to noise. LRR, on the other hand, provides an efficient tool to recover an actual manifold structure from corrupted data by using the data itself as the dictionary FIGURE 1 Framework of our proposed method. For multi-view data matrix X i , we obtain intrinsic low-rank matrix Z 0 by diversity constraint and affinity matrix W by k-nearest neighbor graph. Z 0 holds the intrinsic information across each view and Z i preservers difference between different views. Then with a dynamic approach on the affinity matrix and intrinsic low-rank matrix, the local manifold and clustering structure are simultaneously learned (self-expressiveness matrix) to seek a representation matrix with the lowest rank [26]. To illustrate further, let's assume that we have a set of data points X = {x 1 , … , x n } ∈ R d ×n where n is the number of data points and d is the dimensionality of each data point, then LRR will represent each data point by a linear combination of others to capture a global structure of data, that is, x i = Xp i , i = 1, … , n, where P = {p 1 , … , p n } ∈ R n×n is the self-expressiveness matrix, which can be found in the following way: However, Equation (1) is hard to achieve due to the discrete nature of the rank function. Instead, the nuclear norm provides a good convex surrogate as follows: For the data x i , it is one of the k classes and can be connected by all the data points with weighted value p i j , and such a weighted value can be seen easily as the similarity between them. In other words, the degree of correlation between x i and x j can be characterized by p i j , where closer data points will have a larger weighted value. Therefore, p i j can correctly depict the local manifold structure of data when two similar data points reconstruct one another.
Considering the robustness of LRR, many methods have been proposed recently for multi-view face representation based on its principles. In [18], Jing et al. learn common and separate dictionaries. The main idea is to utilize these dictionaries to explore the correlated information and the complementary information simultaneously to find a robust face representation that will enhance face recognition. In [28], Xu et al. proposed a structured low-rank matrix recovery method to solve the problem of huge divergence in multi-view face data. This approach eliminates graph embedding by introducing a class label matrix to learn a discriminative unified matrix. Yet, it fails to consider the specific local structure of different views so, the problem may persist. In [17], Ding et al. had previously used a dual low-rank decomposition approach to separate local structures from the global structure. However, the dual minimization of singular values may cause the face representation matrix to be of high rank since the rank may not approximate correctly in practice [29]. Alternatively, Zheng et al. [30], Brbic and Kopriva [27], and Wang et al. [20] combine the idea of SSC and LRR to find a joint subspace representation that is robust against large-scale noise. In [31], Meng et al. proposed a Multi-View Low-Rank Preserving Embedding (MvLPE) method, which learns the complimentary features in different views by maximizing the agreement between an individual view and a centroid view. Nonetheless, these methods equally ignored the specific local structure of different views as [28]. Therefore, inspired by [17], we utilize the L1 norm to capture the specific local structure of each view such that a common face representation learned is through a cooperative effort to guarantee a more optimum solution that suppresses the effect of noisy data adaptably.

THE PROPOSED APPROACH
In this section, we present a novel multi-view learning algorithm, which can capture the relationship among data and unfold the clustering structure at the same time. The notations are summarized in Table 1.

Structure consistency
LRR is a promising method, which intends to build a good affinity matrix by using self-expression. However, it ignores the important data locality. To better exploit local manifold structure, we follow the intuition of Hierarchical Bayes [22][23][24] and assumed that P can be decomposed into two parts; one is the intrinsic structure shared across each view and the other is the diversity structure of each view. We utilize the intrinsic matrix Z 0 to capture the general structure across different views. Simultaneously, we use Z i to capture specific features of each view. We utilize the constraint P i = Z 0 + Z i to model the assumption above. We envision that the parts unique to each view are relatively sparse structures, so we utilize l 1 norm on Z i and leading to the problem below:

Diversity regularization
Multi-view face data contain intrinsic and complementary features and general shared information at the same time. In this way, the intrinsic feature matrices should have difference between each other. Meanwhile, Z 0 and Z i should be different and complementary. In another word, each view should have less similarity and have weak connection with common feature as well. To achieve this, we introduce a diversity regularization formulated as follow: where Z v and Z u are different views. According to the equation above, when Θ(Z u , Z v ) is minimized, one of Z v and Z u value will be large and the other will be small. Therefore, we can utilize this to compute the differences between each view. We formulate the diversity regularization in our model as:

Integrating spectral clustering structure
Self-representation is a good method to capture the local manifold structure. The representation coefficient is larger when two different samples are similar while it is smaller when two samples are different. However, in many real-world applications, there are possibilities that similar samples belong to different classes. Thus, only a self-representation of data will mislead the clustering task, for the reason that many samples sharing different labels may be clustered together. Therefore, directly using the self-representation of data to clustering will not get ideal results at all time. Because local manifold captured by self-representation is not an ideal clustering structure. Basically, an ideal graph should ensure that samples with the same label should have strong connection while samples from different classes should have weak connection or even not be connected. Therefore, to exactly separate sample data into k clusters, we need a clustering structure with k connected components. In this way, we can deal with the problem that samples with different labels have strong connection. To this end, we impose a rank constraint on the Laplacian matrix derived from the affinity matrix so that it can be an ideal clustering structure. This rank constraint can guarantee that the affinity matrix has exact the k connected components which is the same as clusters number . Meanwhile, the estimated affinity matrix will dynamically approximate local manifold captured from self-representation to exploit structure information among data. To solve the problem, we define an ideal affinity matrix W = {w T i , .., w T n } ∈ R n×n that can not only utilize self-representation structure, but also is an ideal clustering structure. Theorem 1. If affinity matrix W is nonnegative, the multiplicity K of the eigenvalue zero of the Laplacian matrix L is equal to the number of connected components in the graph associated with W .
Given an affinity matrix W , Theory 1 indicates that if rank(L) = n − k, then W is an ideal affinity matrix that can separate the data points into exact k clusters. Moreover we hope W dynamically approximates Z used to capture the local manifold structure. Thus, we impose a F-norm to appropriately minimize the loss between Z 0 and W . Thus, we have the following formulation:

Optimization
We propose an optimization method to approximatively solve the objective function in Equation (6) and apply the Augmented Lagrange Multiplier (ALM) [32,33] to iteratively update all the variables. We denote Θ i as the i th smallest eigenvalue of L.
Because L is a positive semidefinite matrix, so Θ i ≥ 0. In this way, the rank constraint in Equation (6) can be removed after given a large enough . Therefore, the problem (6) can be rewritten as follows.
When 3 is large enough, the term ∑ k i=1 i will equal to 0 and the constraint rank(L) = n − k will be satisfied. According to Ky Fan's Theorem [19], (7) can be rewritten as: We introduce auxiliary variables U i = P i to make the objective function become separable and can be easily solved. Then, we use the Augmented Lagrange Multiplier (ALM) [34] to obtain the following augmented Lagrangian function.
Step 1: Solving U -We update the variable U by fixing the other variables and dropping the unconcerned terms. Then, we have the following subproblem with respect to the variable U : We solve the subproblem in Equation (10) by using Singular Value Thresholding [35] to get the closed-form solution as where S [⋅] is a shrinkage thresholding operator which is defined by Step 2: Solving E-Similar to the U subproblem, the optimization problem with respect to E can be formulated as: Based on lemma 3.3 in [36], we have the optimal solution as follows: Step 3: Solving P i -Fixing the other variables, the optimization problem concerned with P i can be rewritten as: With respect to P i and setting the derivative of above, we obtain the following solution: Step 4: Solving Z i -According to the above subproblems, we can rewrite the optimization problem about Z i as: . This optimization problem can be solved via Singular Value Thresholding.
Defining z as an element of Z i , the following is obtained: The solution in Equation (17) is obtained as z = H 1 (−a∕ ).
Step 5: Solving Z 0 -By fixing the other variables and removing the irrelevant terms, we rewrite the subproblem as: Setting the derivative of the above with respect to Z 0 to zero we obtain the following solution: Step 6: Solving W -We update W by solving the following problem: The problem Equation (20) is independent for different i, therefore we can solve the following problem for each i.
Denoting t i j =∥ f i − f j ∥ 2 2 , and t i as vector with the i th element equal to t i j , Equation (21) above can be written as [37,38]: The Lagrangian function of Equation (22) is: where and i are the Lagrangian multipliers. By taking the derivative of Equation (23) with respect to w i and setting to zero, we can obtain: According to the KTT condition, the optimal solution w i is We use Euclidean Projection onto the Simplex [39] to solve Equation (25). Finally, the whole algorithm is presented in Algorithm 1.

Model complexity
In this subsection, we analyze the complexity of our proposed model. The first step is to solve the problem in Equation (10)

EXPERIMENTS
In this section, we first describe three face datasets, six similar state-of-the-art compared methods, and general experimental settings. Secondly, we show experimental results with analysis for face recognition and clustering. The final clustering labels can be obtained directly from the affinity matrix W obtained from Algorithm 1 because each connected component belongs to one cluster, according to the strongly connected component algorithm.

Datasets
• ORL 1 : This face dataset contains 400 images of 40 distinct subjects. For each category, images were taken at different For all three datasets, we extract three types of features to represent different views. And these features are Intensity, LBP, and Gabor. See Table 2 for a summary of the datasets, while Figure 2 gives example images of ORL and Extended yale B datasets.

Compared methods
We compare our method with six similar state-of-the-art ones, including Diversity-Induced Multi-View Subspace Clustering (DiMSC) [21], Multi-View Low-Rank Sparse Subspace Clustering (MLRSSC) [27], Co-regularized multi-view spectral clustering (CoReg) [4], Nuclear Norm Based Matrix Regression (NMR) [26], Iterative Re-Constrained Group Sparse Face Recognition (IRGSC) [40] and RRC [41]. The following is a brief description of the different methods. For some methods that mainly focus on representation learning, such as NMR, IRGSC, and RRC, we used k-means on their optimal data representation to obtain clustering labels.
• DiMSC: This method utilizes the Hilbert Schmidt Independence Criterion as a diversity term to enhance the diversity between views. • MLRSSC: This method combines the idea of SSC and LRR to find a joint subspace representation. • CoReg: This method maximizes the agreement between different views by regularizing the eigenvectors of specific graph Laplacian matrices. • NMR: This method uses the minimal nuclear norm of representation error image as a criterion to find an ideal face representation. • IRGSC: This method uses a group sparse representation classification (GSRC) approach where the weighted features and groups are captured cooperatively so as to encode more discriminative information and structure information. • RRC: This method uses a unified sparse weight learning approach to allow random noises and occlusions in images to be resolved simultaneously.

Experimental settings
Without loss of generality, we downloaded the source code from the authors' website and followed the experimental settings and the parameter tuning steps described in their paper. We ran each method ten times, and the mean performance is reported. The tuned parameters from corresponding papers range from 0.001,0.01,0.1,1,10,100,1000 and the best parameters are fine-tuned using the grid-search strategy. In our experiments, the parameter is a Lagrangian penalty variable.
Referring to [42], we initialize by 0.01 and set its maximum value as 10 8 . For the parameter , we set its value as 1.1. According to parameters analysis in subsection 4.6, the parameters of proposed method are set as follow: 1 = 0.01, 2 = 10, 3 = 1. We initialize all the matrices with a zero matrix except E i with a sparse matrix. Besides, we performed the experiments for NMR, IRGSC, and RRC methods separately on each view, and the results for the view with the best performance is reported In experiments for face recognition, we used 60% of the data point as the training set (X ) and 40% as the testing set (X text ). Then for each of DiMSC, and MLRSSC methods, we obtained representation matrices for X and X text , respectively, while the intrinsic low-rank representation matrix Z 0 of X and (Z 0text ) of X text in our experiment is obtained by solving Equation (6). Finally, we used the support vector machine (SVM) algorithm [43] to measure the classification performance of our method, DiMSC, and MLRSSC. Note that steps 2 and 3 above are integrated into NMR, IRGSC, and RRC algorithms because they are designed mainly for face recognition.

Face recognition
In this section, we evaluate the recognition accuracy of the different state-of-the-art methods. First, we look at the recognition effect of various embedding dimensionalities in Section 4.4.1. Then, in Section 4.4.2, we investigate the robustness of each method to noise and occlusion.  experiments were performed on the AR dataset. In other words, we explored different embedding dimensionalities in all three datasets to evaluate the face recognition accuracy of the different methods, and we show the performances of each method in Figure 3 and Tables 3-Table 5 in which the bold numbers denote the best results, respectively. Hence, we see clearly that all different methods have significant improvement by increasing the dimensionality in all three datasets. For example, in Table 3, our method has an accuracy rate of 82.32% and 91.28% with a dimension of 25 and 50, respectively. However, by projecting all data points from 50 to 200 dimensions, the performance improved significantly by 5.04%, which is better than RRC, DiMSC, and NMR, which has 4.7%, 4.5%, and 4.3%, respectively. Furthermore, we can observe that the single view methods RRC, NMR, and IRGSC have relatively better performances than DiMSC and MLRSSC, which are multi-view methods. The reason is easy to establish since the trio of RRC, NMR, and IRGSC methods are designed solely for face recognition.

Robustness to Noise and Occlusion
To investigate the robustness of different methods to noise and occlusion, we use two separate experimental settings. Firstly, by randomly injecting different levels of pixel corruption into the testing data of ORL. And secondly, by randomly injecting different levels of block occlusions into that of Extended Yale B. Therefore, Tables 6 and 7 display the face recognition accuracy of different methods in which the bold numbers denote the best results. We can see that our method significantly out-   Table 7, we can observe that our method only outperforms NMR and RRC by 0.58% and O.62%, respectively, with a 20% random block occlusion. However, by increasing the level to 70%, our method shows more robustness than NMR and RRC with a difference of 6.4% and 22.46%, respectively. Besides, one may notice equally in Table 5 that our method only performs better than DiMSC by 0.59% with 0% noise corruption. Yet, our method outperforms DiMSC more significantly by increasing the level of pixel corruption. Overall, Tables 6 and 7, with the aid of Figure 4, show clearly that our proposed method is generally more robust to noise and occlusion than the compared state-of-the-art methods. This further buttress the effectiveness of our intrinsic low-rank representation.

Face clustering
We utilize six kinds of evaluation metrics to verify the effectiveness of the proposed method, which are clustering accuracy (ACC), normalized mutual information (NMI), adjusted rand index (AR), F-score, Precision, and Recall. And these metrics are calculated by comparing the obtained label of each sample with the ground-truth labels provided in datasets, where a larger value will indicate the better clustering performance. Specifically, we conducted experiments on three widely used face datasets to evaluate different methods' clustering performance. The results are shown in Tables 8-10 in which the bold numbers denote the best results. Table 8 shows the clustering results on ORL dataset, and one can observe that our method outperforms most of the compared methods by over 4% in ACC, 5% in NMI, and 4% in AR. In particular, the performance improvements over DiMSC on the ORL dataset in terms of ACC, NMI, and F-score are 1.64%, 1.22%, 0.32%, respectively. The reason can be attributed to the large variation of illumination of this dataset. However, it can be seen clearly that our approach still achieves significant improvements, which are better when compared to other methods. And the quantita-  tive result fully demonstrates the superiority of our approach since it better captures the general low-rank structure of the data space. Moreover, the proposed method can obtain a better graph from the general low-rank structure, so, the results are better than others. For instance, the face images which belong to the same track are remarkably placed similar to each other which implies significant low-rank property. Therefore, the proposed method under the low-rank assumption and integrated ideal global graph guarantees better results. Table 9 shows the clustering results on Extended Yale B dataset, yet, our method still outperforms all the alternative methods. For instance, our method outperforms MLRSSC significantly with above 4% in terms of accuracy and Recall. In fact, on this dataset, the results show not only the influence of local connectivity in constructing the similarity graph but also the superiority of the unified optimization. Table 10 shows the clustering results on AR dataset. On this dataset, some compared methods achieve promising performance. For example, RRC, which is a single-view method has better performance in ACC than CoReg and MLRSCC, which finds complementary information in multi-view face data in a shared affinity matrix of all views. Furthermore, we observe that DiMSC has a better performance than CoReg and MLRSSC as well. This is not unusual because DiMSC enhances the diversity among different views. Notwithstanding, our method improves DiMSC by over 2%, 3%, 10% in terms of ACC, NMI, and Fscore, respectively, which demonstrates the importance of combined general low-rank structure and clustering structure.

Ablation experiments
We also conducted ablation experiment on single-view and multi-view datasets to demonstrate the performance of proposed method. In this experiment, we utilize the first view of each dataset as the only input data matrix. The results are shown in Table 11 in which the bold numbers denote the best results. From Table 11, we can observe that results on all multi-view datasets outperform results on single-view dataset, respectively. Especially in ACC on multi-view AR dataset, which improves single-view dataset by over 4%. This further buttress the effectiveness of our intrinsic low-rank representation.

Parameter analysis
We analyze the sensitivity of parameters in Figures 5 and 6 to show the importance of the parameters 1 , 2 and 3 . The base The results of NMI and ACC of different 1 and 2 when fixed 3 on Extended Yale B dataset

FIGURE 6
The results of NMI and ACC of different 1 and 2 when fixed 2 on ORL dataset of the logarithm is 10. In Figure 5, we show the NMI and ACC results on the Extended Yale B dataset when 3 is fixed to 1. Figure 6 shows the results when 2 is fixed to 1. In particular, these figures show that our method is not sensitive to 2 when 1 and

Convergence analysis
The Augmented Lagrange Multiplier method was used to update all the matrices in our optimization problem iteratively, and the convexity of the Lagrange function guarantees the effectiveness of our proposed method to some degree. Figure 7 shows the objective function values on the ORL and AR

FIGURE 7
The convergence curves on AR and ORL datasets datasets. It can be seen that on both real-world datasets, the algorithm converges quickly within 20 iterations.

CONCLUSION
In this paper, we proposed a novel multi-view low-rank representation method. Our method follows hierarchical Bayesian methods and learns intrinsic and specific representation among each view via a consistent structure and diversity regularization. The intrinsic representation holds the consistent information between different views while the specific representation unfolds the diversity information across views. Meanwhile, our model combines the clustering structure and low-rank local manifold to learn both the relationship in data and the clustering structure. Furthermore, by adjusting the clustering structure adaptively during optimization, our method can achieve a better performance on face recognition and clustering simultaneously. Several experiments on face datasets, in comparison with the state-of-the-art algorithms, show the effectiveness of our method. In the future, we will try to extend this method to handle incomplete face data.