Data representation using robust nonnegative matrix factorization for edge computing

: As a popular data representation technique, Nonnegative matrix factorization (NMF) has been widely applied in edge computing, information retrieval and pattern recognition. Although it can learn parts-based data representations, existing NMF-based algorithms fail to integrate local and global structures of data to steer matrix factorization. Meanwhile, semi-supervised ones ignore the important role of instances from different classes in learning the representation. To solve such an issue, we propose a novel semi-supervised NMF approach via joint graph regularization and constraint propagation for edge computing, called robust constrained nonnegative matrix factorization (RCNMF), which learns robust discriminative representations by leveraging the power of both L2,1-norm NMF and constraint propagation. Specifically, RCNMF explicitly exploits global and local structures of data to make latent representations of instances involved by the same class closer and those of instances involved by different classes farther. Furthermore, RCNMF introduces the L2,1-norm cost function for addressing the problems of noise and outliers. Moreover, L2,1-norm constraints on the factorial matrix are used to ensure the new representation sparse in rows. Finally, we exploit an optimization algorithm to solve the proposed framework. The convergence of such an optimization algorithm has been proven theoretically and empirically. Empirical experiments show that the proposed RCNMF is superior to other state-of-the-art algorithms.


Introduction
With the development of automatic driving technology, edge computing (EC) has attracted more and more attention. Compared with cloud computing that exploits distributed computing to decomposing many data processing tasks through the network, EC is initiated at the edge of the data source, and thus reduces the process of data transfer on the network. To recognize a face image, for example, cloud computing will upload this image to the server first, and then the cloud server will complete such a task. EC directly calculates the recognition result after obtaining the image. Obviously, it requires a variety of techniques to complete different tasks. Generally, the data EC processes usually has high-dimensional and complex structure [1]. Furthermore, in multimedia mining, pattern recognition and bio-informatics [2][3][4], one is often faced with high-dimensional data. Directly dealing with such high-dimensional data requires massive time and memory cost for learning tasks. In fact, the features of data are not all discriminative and important, since many of them are redundant or noisy [4][5][6]. Important and meaningful features always lie on (or near) a low-dimensional space [7,8]. This leads one to investigate a new technique to process these data. Dimensionality reduction (DR) is to find a low-dimensional representation of high-dimensional data by preserving the inherent desired structure contained in the original data [9][10][11]. It has been proved to be an effective method to decrease the dimensionality of data [12][13][14][15][16][17][18]. Existing popular representative methods for DR are Linear Discriminant Analysis (LDA) [19,20], Principal Component Analysis (PCA) [10], Locality Preserving Projections (LPP) [21], LLE [12], ISOMAP [13] and Laplacian Eigenmap [22], and so on. The basis vectors and coefficients obtained by these approaches above are not constrained the non-negativity and thus usually contain negative values. Data in real applications like texts, images, audios and videos, are naturally nonnegative. Basis vectors and coefficients with negative values lack of clear physical meaning and interpretability for nonnegative data [23].
As a new DR algorithm, Nonnegative Matrix Factorization (NMF) has recently been presented to solve the nonnegative data. NMF is to decompose a given nonnegative matrix into two nonnegative factor matrices, so that the product of the two factor matrices can well approximate the given one [24]. It constrains the two factors to be nonnegative. That is, all elements of two factor matrices must be greater than or equal to zero. NMF is a parts-based DR method since it only allows the addition combination of the original data space, not the subtraction combination [25]. It has successfully been used in many fields, such as face recognition [24,26], document clustering [27,28], image processing [29][30][31], and molecular pattern discovery [32,33]. Therefore, many improved NMF-based algorithms have been put forward. Ding et al. [34] presented Convex-NMF and Semi-NMF to increase its applicable range by relaxing the data matrix to hold positive and negative numbers, respectively. A graph regularized NMF (GNMF) [35] is put forward to promote discriminating ability of the ordinary NMF by defining an affinity graph to encode the geometrical data structure. Li et al. [36] incorporated distant repulsion and basis redundancy elimination into the cost function of GNMF and thus developed a structure preserving NMF approach. Zhang et al. [37] proposed manifold regularized low-rank matrix decomposition by extending GNMF to the nonlinear space. A structure constraint NMF [38] is developed to apply the intra-sample structures to promote the matrix decomposition process. Cichocki et al. [39] proposed a general framework for NMF based on Csizar's divergences, which improved the efficiency and robustness of NMF. Cichocki et al. [40] exploited orthogonality and sparsity constraints to extend NMF. Fevotte et al. [41] proposed a novel NMF based on the β-divergence cost function (β-NMF). Devarajan et al. [42] presented a unified method for NMF based on the generalized linear model Therefore, it is necessary and important to investigate a novel algorithm to improve the robustness of NMF.
To address the above problems, we put forward a novel NMF algorithm for data representation, called robust constrained nonnegative matrix factorization (RCNMF), which discover the intrinsic geometric and discriminating structure of data in semi-supervised scenario. In the proposed RCNMF model, pairwise constraints are propagated to unlabeled instances for encoding global and local structures. Then, we construct a specific intrinsic graph to depict the between-cluster separability and the within-cluster compactness in the latent representation space. Additionally, we introduce the L2,1norm cost function to enhance the robustness of the new model. Hence, RCNMF can be naturally applied to practical learning tasks. The main contributions of the proposed algorithm are worth highlighting here: 1) Our algorithm explicitly exploits global and local structures of the original data space via constraint propagation and merges it into our model to guide matrix decomposition. The proposed RCNMF approach makes latent representations of instances involved by the same class closer and those of instances involved by different classes farther. Thus, the proposed RCNMF has more discriminating ability than other semi-supervised ones. 2) Different from CNMF, NMFCC and CPSNMF which are not robust to outliers and noise, the new model can address the problem of noise and outliers since RCNMF exploit the L2,1-norm loss to achieve the nonnegative matrix decomposition. Moreover, the new model imposes the L2,1-norm on the coding matrix to ensure the new representation sparse in rows.
3) The proposed RCNMF can take advantage of the ability of L2,1-norm NMF and constraint propagation, and characterize it as an optimization problem. The new model explores an efficient iterative algorithm to solve such an optimization problem with the theoretical analyses and the experimental verification. In addition, it is easy to observe that several algorithms, such as RNMF and CPSNMF, are special cases of the proposed algorithm. Thus, our method is a general framework. 4) Our method propagates cannot-link and must-link constraints to unlabeled samples, thus can get more supervised information. Moreover, the new model can adapt to class label and pairwise constraint scenarios. This naturally leads to various applications.
The rest of this paper is structured as follows. We briefly review related work in Section 2. We elaborate the proposed RCNMF model and the corresponding optimization in Section 3. We provide convergence proof and computational complexity of RCNMF in Section 4. Extensive experiments on clustering are reported in Section 5. Finally, we give some conclusions in Section 6.

Related work
Given an original nonnegative data matrix = [ , . . . , ] ∈ ℝ × with N instances, ∈ ℝ is an instance vector. NMF is to find two nonnegative matrix factors NMF makes use of the square of the Euclidean distance to solve the optimal approximation U and V, respectively. The Euclidean square distance of NMF is defined as [24,59]: where ‖·‖ is the Frobenius norm of the matrix. The problem (2) is not convex for optimizing two factors U and V simultaneously, but is convex when one variable is fixed and the other is solved. Thus, Lee and Seung [24] presented two iterative update rules to optimize the problem (2), which is formulated as:

Manifold nonnegative matrix factorization (MNMF)
Cai et al. proposed the graph regularized NMF (GNMF) to GNMF [35] explicitly takes into account the local invariance of data and sets up the nearest neighbor graph to formulate the geometrical structure of the original data space. It depicts this graph structure as the Laplacian matrix, which is added to the loss function of NMF as the regularization term. Thus, GNMF optimizes the following problem: where ≥ 0 is a regularization parameter and (·) denotes the matrix trace. L = D -W is a Laplacian matrix of the neighbor graph. W denotes a connection weight matrix of the nearby instances and D denotes a diagonal matrix whose entries = .
Two alternate rules are exploited to iteratively solve the optimal model (5). The updating formulas are respectively described as followed: Zhang et al. [37] expended GNMF to the nonlinear space for characterizing the nonlinear structure of the data, which is stated as followed: where can be a Laplacian matrix, e.g., formulated as the weight matrix I -W in the LLE [12]. Huang et al. [56] integrated GNMF and L2,1-NMF [47] into a joint framework and proposed a robust manifold NMF method (RMNMF). Wu et al. [59] also presented a robust manifold NMF algorithm similar to RMNMF. The objective function in [59] increases the constraint U 0  and reduces the orthogonal constraint in comparison to RMNMF. We call the above four methods as Manifold Nonnegative Matrix Factorization (MNMF), since they also construct the nearest neighbor graph to depict the local information of the input space.

NMF-based constrained clustering (NMFCC)
Zhang et al. [47] put forward an NMF-based constrained clustering method, called NMFCC. NMFCC enforces the similarity between two instances belonging to a must-link constraint to approach one and that of two instances belonging to a cannot-link constraint to approach zero, which is composed of a similarity matrix. It uses the square of the class indicator matrix to compute the similarity matrix and regards as a regularization term to cluster. NMFCC minimizes the following problem: where ∘ refers to the dot product between two matrixes and denotes the constraint matrix whose entry is 1 if ( , ) ∈ (must-link set) or = else 0. A is a coefficient matrix defined as After the Lagrange function for Eq (10) is constructed and used to solve the derivatives with respect to U and V, NMFCC get the update rules of Eqs (11) and (12): where ϒ = + 2( ∘ ∘ ) + 4( ∘ )( ∘ ∘ )

Constrained nonnegative matrix factorization (CNMF)
Liu et al. [31] put forward a new semi-supervised NMF method, called CNMF. Different from the above-mentioned methods in this section, CNMF treats a small amount of class label as hard constraints to ensure that instances with the same label are embedded into the same low-dimensional space. Given l labeled instances and n-l unlabeled instances, CNMF first constructs the indicator matrix C where = 1 if belongs to class j; = 0 otherwise. The matrix A of label constraints is formulated as CNMF introduces an auxiliary matrix Z to incorporate label constraint information and redefines a coefficient matrix V as V = AZ. Hence, CNMF solves the following problem: The solutions of CNMF are expressed as:

Discriminative nonnegative matrix factorization (DNMF)
Babaee et al. [60] presented a label constrained NMF method, called Discriminative Nonnegative Matrix Factorization (DNMF). Similar to CNMF, DNMF first constructs the indicator matrix ∈ ℝ × where = 1 if belongs to class j; = 0 otherwise. Different from CNMF, unlabeled instances are assigned 0 in Q. Then, DNMF defines the label information constraint as: . . , , 0, . . . ,0] ∈ ℝ × and the matrix A which is calculated as part of the NMF optimization. Therefore, the formulation of DNMF is as follows: DNMF uses the following iterative update rules and obtains the corresponding U, V, and A:

Robust structured nonnegative matrix factorization (RSNMF)
A semi-supervised robust structured NMF algorithm (RSNMF) [8] is proposed to arrive at the separated low-dimensional space. The key idea of RSNMF is to use the block-diagonal structure of the labeled instances to increase the discriminating ability. Specifically, RSNMF introduces an indicator matrix = [ ̅ , 0 ] ∈ ℝ × , where 0 ∈ ℝ × is a zero matrix for the unlabeled instances. Given labeled instances, the definition of ̅ ∈ ℝ × is expressed as follows: Thus, RSNMF designs the following updating rules for these two matrix factors: where = /2‖ ‖ and = − .

Robust constrained NMF
As mentioned above, NMF and its variants are unsupervised algorithms, and fail to exploit some supervised information to guide the decomposition progress and to improve the discriminating ability. In fact, utilizing a small amount of supervised information has always been an important issue in many fields of computer vision and machine learning [15,16,18]. Generally, supervised information has various forms. The two commonly used forms are pairwise constraint and class label, respectively. In this paper, we exploit cannot-link and must-link constraints to conduct matrix decomposition. A mustlink constraint specifies that two instances have the same cluster label. A cannot-link constraint indicates that two instances have different cluster labels. In many applications, it is more practical to get pairwise constraints than to get class labels, because users can easily demonstrate whether two samples belong to the same cluster [16,31]. Besides, one can use the class label of the sample to obtain pairwise constraint, but not vice versa. Consequently, pairwise constraints are weaker supervised information than class label.
Inspired by recent research on NMF and semi-supervise learning [23,61], in this paper, we propose a novel robust constrained nonnegative matrix factorization (RCNMF), which explicitly combines constraint propagation and L2,1-norm NMF in a new way. We use cannot-link and mustlink constraints to propagate over the whole data set, thus obtain the constraint information as a regularizer added to the L2,1-norm NMF. Next, we describe the RCNMF model in detail.

L2,1 norm NMF
For any matrix A, its j-th column, i-th row are denoted by aj and a i , respectively. Tr (B) means the trace of B if B is square. A T is the transposed matrix of A. I denotes an identity matrix and 1 is a column vector whose entries are all 1. We define the L2,1 norm of the matrix B as [62]: One can use NMF and the L2,1 norm to group the data matrix X into k clusters ( , . . . ) as: where U denotes the basis matrix and V is the encoding matrix. Because of the constraints on V, the optimization problem (25) is difficult to solve. A commonly used method is to relax this constraint to orthogonality, i.e., V T V=I. Thus, the problem (25) is rewritten as:

Constraint propagation
For the data set = [ , . . . , ] ∈ ℝ × , we describe initial must-link constraints as = {( , ) = } and initial cannot-link constraints as = {( , ) ≠ } , where denote the clustering label of the instance . To propagate both cannot-link and must-link constraints on the whole data set, Lu et al. [61] used "+1" and "-1" to represent the difference between the two types of constraints. The propagation operation aims to determine the constraint weight between two unconstrained instances, which essentially clusters instances between classes marked with +1 or classes marked with -1. Thus, the initial pairwise constraint matrix = [ ] ∈ ℝ × is denoted as: The matrix of propagated pairwise constraints is defined as = [ ] ∈ ℝ × , where ≤ 1. The matrix F represents a set of pairwise constraints with relevant clustering weight. ≥ 0 means ( , ) ∈ , while ≤ 0 means ( , ) ∈ . Constraint propagation in [61] is formulated as: The specific algorithm of constraint propagation is formulated as: 1) Set up a nearest neighbor graph vis denoting its nearby weight ∈ ℝ × as: According to the above analysis, the learned weight matrix has several distinct advantages: 1) The matrix is symmetric and nonnegative, and ∈ [0,1] . Thus, it can well describe the relationship between instances. 2) . This shows that is derived from W. It is increased when * ≥ 0, and decreased when * < 0.

3)
= 1 − * . This indicates that cannot-link and must-link constraints an equally significant role in computing the relationship between instances. By using the constraint propagation approach, local and global structures is taken into account so that more pairwise constraint is obtained. Clearly, the propagated pairwise constraint matrix characterizes that nearby instance have the same clustering label, and instances with the same structure have the same clustering label. Then, one can use pairwise constraint information to build a new weight matrix in which instances involved by the same class have larger weight values and those involved by different classes have smaller weight values.

Robust constrained nonnegative matrix factorization
We apply constraint propagation to guide the matrix decomposition process in the L2,1-norm NMF. For dimensionality reduction or classification, there are two kinds of consistency. One is that latent representations are considered locally consistent. That is, nearby representations are supposed to have the same label. Specifically, if the two representations and are close to each other, they belong to the same class. A k-nearest neighbor graph is constructed to figure the similarity between nearby instances. The other is that latent representations should be globally consistent, i.e., instances with the same structure have the same label. We introduce constraint propagation to find an appropriate propagated pairwise constraint matrix F for meeting the two properties. In summary, we construct a new weight graph to model the relationship between latent representations with the weight matrix : If the two instances and have the same label, they should be close to each other in the input data space. * has a large positive value so that also has a relatively large value in the light of Eq (29). Minimizing ( ) , the distance between the latent representations and should be small. Thus, and are neighbor in the low-dimensional space. On the other hand, if the two instances and belong to different classes, they should be kept away in the original data space. * has a large negative value and thus has a relatively small value, which indicates that and are far from each other. Therefore, minimization Eq (30) makes the distance between data points belonging to must-link constraint as small as possible and the distance between data points belonging to must-link constraint as large as possible. Besides, the problem (30) has the same formulation as spectral dimensionality reduction [21,22], which plays a significant role in semi-supervised manifold learning algorithms and spectral clustering.
Combining Eqs (25) and (30), we propose a general NMF framework, called robust constrained nonnegative matrix factorization (RCNMF). RCNMF optimizes the following problem: where α and β are two trade-off parameters. We use the L2,1 norm on the basis matrix U as the regularization term to select the latent representation, since U is sparse and easy to be controlled by noise features.

Optimization algorithm
Like other NMF-based methods, the objective function in the model (31) is not convex for optimizing two factors and simultaneously, but is convex when one variable is fixed and the other is solved. To this end, we introduce an iterative optimization to solve optimizing these two factors. We rewrite the objective function of RCNMF as follows: where is a diagonal matrix with ̅ = ∑ , and = − . and are used as the Lagrangian multiplier of two matrix factors U and V, respectively, where = [ ] and = [ ].
> 0 is a parameter to adjust the weight of the orthogonality condition. The corresponding Lagrange function ℒ is written as By using ‖ ‖ = ( ), the partial derivative of ℒ with respect to U and V is calculated as where E and H are diagonal matrices and their diagonal elements are given by Applying the Karush-Kuhn-Tucker (KKT) conditions = 0 and = 0, we obtain two equations for and as follows: Thus, we get the following updating rules from Eqs (38) and (39): Based on the above analysis, we can update U and V iteratively with Eqs (40) and (41) until the objective value of Eq (31) remains unchanged. The detailed optimization algorithm is described in Algorithm 1. We can combine the KL divergence cost function with structure learning and the regularizer of the two factors. A novel KL divergence cost function is defined as The partial derivatives of Eq (43) with respect to and are If two instances and are adjacent, their corresponding low-dimensional representations and are close to each other. Thus, / is close to 1. We adopt the following approximation: With Eq (46), we can rewrite Eq (45) as Since = − , we can easily verify that ∑ ( − ) is equal to the j-th element of vector . Eq (47) is formulated as We can obtain the following equations for and by adopting the KKT conditions = 0 and = 0: Two updating rules can be obtained: We call the proposed algorithm based on the KL divergence cost function as RCNMF-KL.

Proof of convergence
To get the optimal solution, we theoretically analyze the convergence of the proposed RCNMF. Similar to other NMF-based methods, we afford the convergence proof of the monotone property of the new model (31) under the alternate update rules of Eqs (40) and (41). Proof.

( ) ≤ ( , ) ≤ ( , ) = ( ).
We first demonstrate that the cost function of Eq (31) is decreasing under the update rule Eq (41) while fixing U with an appropriate auxiliary function. The cost function of RCNMF in Eq (31) can be rewritten where = 1/‖ − ( ) ‖ and ℎ = 1/‖ ‖ . With the help of (54), the auxiliary function regarding V is described as the following Lemma 2.
Proof. It is easy to verify that Eq (40) is the solution to the following model: Thus, in the t-th iteration, when V is fixed, we get = argmin (( − ) ( − ) ) which is equivalent to That is to say, According to the Lemmas in [6], This completes the proof of Lemma 3. Thus, we arrive at Based on Eqs (59) and (68), Theorem 1 is proved. Thus, we can obtain that the cost function in Eq (31) is non-increasing via exploiting the update rules of Eqs (40) and (41).

Complexity analysis
In this section, the computational complexity of the proposed method will be analyzed. Generally, a capital symbol O is used to formulate the complexity of one approach.
According to the update rules of Eqs (40) and (41)

Experiments
In this section, we present the experimental results and analysis of the proposed RCNMF. Following [8,23,31,35], we verify the performance of our new method on the basis of clustering. In our experiments, we apply six publicly available data sets to compare the performance of the RCNMF and other state-of-the-art methods. These data sets are derived from different application scenarios, including three face image data sets, i.e., UMIST, YaleB and ORL, two digital image data sets, i.e., USPS and MNIST, one Letter image data set Isolet. Negative elements in Isolet and USPS are set to zeros. There are 400 grayscale images of 40 objects in the ORL data set. we resize those images to 28 × 23. Thus, the feature size of each image is 644. The statistics of the six data sets are shown in Table 1.

Experimental setting
To evaluate the superiority of our RCNMF for data representation, we compare it with six representative NMF-based methods, including unsupervised and semi-supervised ones. The compared methods are described below. 1) GNMF [35]: the GNMF encodes the local geometrical information of the data to the NMF objective function in an unsupervised learning scenario and is regarded as the baseline for comparison here. 2) MNMFL21 [59]: It utilizes the L2,1 norm to measure the quality of matrix factorization and the geometrical structure of the data to consider the local invariance; it is the sparse version of NMF. 3) CNMF [31]: It takes the class label as additional hard constraints to combine instances with the same class label in the low-dimensional space so that the part-based representation obtained by this method has the same label as the input data. 4) NMFCC [47]: It constructs a cost function to punish the violation of cannot-link and must-link constraints, which is considered as a regularization term added to NMF. 5) CPSNMF [23]: It applies pairwise constraint to preserve the geometrical structure of the input space, which promote the performance of the original NMF; it actually extends the GNMF to semi-supervised scenarios via using constraint propagation. 6) RSNMF [8]: It uses labeled instances to set up block-diagonal structure for learning the image representation so that instances involved in the same class are projected in the same lowdimensional subspace.

Evaluation metric
After data presentations are learned, we comprehensively verify the performance of several methods in terms of two popular metrics widely applied in both DR and clustering approaches, i.e., normalized mutual information (NMI) and accuracy (ACC). For an approach, the higher NMI and ACC are, the better the clustering performance is. Given two random variables Y and Z, NMI [8,23,63] is defined as where Y and Z represent the weight matrix of the real data label and clustering indicator provided by the algorithm, respectively. Specifically, if clustering indicator is achieved, NMI can be rewritten as where denotes the number of instances belonging to (1 ≤ ≤ ) cluster, is the number of instances involved in the j-th class (1 ≤ ≤ ), and describes the number of instances that are located at the intersection between the j-th class and cluster .
The other metric is ACC that is employed to test the percentage of the result obtained by the clustering method. If the real data label and the clustering indicator are expressed as and respectively, ACC [8,23] is formulated as where ( , ) = 1 if a = b and ( , ) = 0 otherwise. ( ) is a permutation mapping function, in which the clustering indicator is mapped into the real data label .

Parameter description
CNMF does not require any parameters. MNMFL21 and GNMF have the graph regularization parameter α and the nearest neighbor size k. CPSNMF has a propagation parameter η except the above two parameters. NMFCC has two regularization parameters α and β. RCNMF has two regularization parameters, besides k and η. For CPSNMF, MNMFL21, RCNMF, and GNMF, we specify the neighborhood size by setting k = 5 for all the data sets. Following CPSNMF, the propagation parameter η is set to 0.2. We choose the graph regularization parameter α within {10 -3 , 10 -2 , 10 -1 , 1, 10 1 , 10 2 , 10 3 } for all the compared algorithms. For RCNMF, to satisfy the orthogonality, we set λ = 105 in the experiment. RSNMF has three essential parameters, the regularization parameter α, the norm parameter q and the dimensionality parameter m. According to RSNMF, the parameter q and m are set to 0.5 and 2, respectively. Values of various parameters are shown in Table 2.

Results and analysis
To obtain the comprehensive comparison, we will conduct two types of experiments. One is that the evaluation is performed with different cluster numbers. The other is that the evaluation is done with different pairwise constraints. We describe the experiments as follow. 1) Since those typical NMF methods exploit lots of supervised information to guide matrix factorization, for the fair comparison, we apply ground-truth label to generate pairwise constraints to seek these two matrix factors. Specifically, we can randomly select 10% of all the given instances as labeled ones, which meets the requirements of CNMF and RSNMF. These labeled data are employed to generate pairwise constraints for NMFCC, CPSNMF and our RCNMF. 2) For the first type of experiment, we randomly choose 10% labeled instances for every category in the data set except the ORL data set. For ORL, we choose 20% labeled instances following [8,31], because each class has only one image if 10% labeled instances are chosen. The chosen labeled instances are applied to generate the corresponding cannot-link and must-link constraints. We randomly select different clusters k (1 ≤ k ≤ K) from each data set to check the performance of seven compared algorithms. For the second type of experiment, we take the number of ground-truth class as cluster numbers k for every data set. Furthermore, we randomly choose different labeled instances for each category in the data set to check the effectiveness of the compared algorithms.
3) For fair comparison, we exploit a random strategy to initialize these two matrix factors U and V. Following [8,23,31,47,59], we use the classical K-means algorithm to cluster the learned coefficient matrix in the low-dimensional data space. 4) For given cluster number k, labeled instances, and pairwise constraints, we perform 20 experiments independently and report the average clustering results of these 20 experiments.  Table 3. Figures 1 and 2 illustrate the graphical clustering results of ACC and NMI with different cluster numbers on the UMIST, YaleB, ORL, USPS, MNIST, and Isolet data sets, respectively. Figures 3 and 4 show the clustering results of ACC and NMI with different numbers of labeled instances on the UMIST, YaleB, USPS, Isolet, and MNIST data sets, respectively. In the second type of experiment, we do not compare the clustering performance of seven algorithms on ORL in the light of different numbers of labeled instances, since there are only 10 instances in each category of ORL. From the experimental results, a few interesting observations can be observed.    1) The proposed RCNMF algorithm can provide better performance on the given six data sets and is superior to other algorithms. This is due to the fact that our RCNMF can learn a compact and discriminating representation. Besides, the RCNMF is the only one to achieve high performance on all the data sets used by this paper. Obviously, local and global structures of data via constraint propagation play a significant role in formulating the intrinsic data representation. Therefore, the RCNMF is able to effectively apply the pairwise constraints to promote the performance of NMFbased algorithm.
2) RCNMF, RSNMF, CPSNMF, NMFCC and CNMF are consistently superior to the baseline method on almost data sets. This indicates that semi-supervised NMF learning from a combination of both labeled and unlabeled instances can guide matrix decomposition better than unsupervised peer. Interestingly, these semi-supervised algorithms have significant differences on the clustering performance. For example, for the MNIST data set, the proposed RCNMF method achieves the highest clustering performance of 68.45% by ACC, while CNMF just achieves 55.23%. For two unsupervised methods, as a sparse version of GNMF, MNMFL21 always achieves better performance than GNMF.
3) The results show that RCNMF outperforms CPSNMF, although both RCNMF and CPSNMF use constraint propagation to discover discriminating representations for data. This is due to two reasons. One is that the L2,1-norm objective function of RCNMF is robust to noise and outliers. The loss function using L2,1-norm can substantially improve the performance of NMF, which is testified by NMF-based methods [8,56,58,59]. The other is that the L2,1-norm regularization term on the basis matrix U in RCNMF can guarantee U sparse in rows and selects some representative features. The sparse formulation can bring encouraging performance improvements. Additionally, it is easy to get the optimal solution for RCNMF because of the additional orthonormal constraint on the factor V in the objective function. 4) CPSNMF outperforms other three semi-supervised methods on most data sets. RSNMF embeds instances from the same class into the same subspace by exploiting the block-diagonal structure of data. NMFCC enforces a cannot-link to approximate 0 and a must-link to approximate 1, which is added to NMF as the regularization term. CNMF forces instances with the same label to have the same representation in the low-dimensional space. However, RSNMF, NMFCC and CNMF ignore the important role of the local structure of the data space for steering matrix decomposition. Besides, the three methods do not map instances from different classes sufficient far in the low-dimensional representation space. We can observe that NMFCC and CNMF perform relatively poorly among the five semi-supervised NMF methods. In addition to the above shortcomings, another disadvantage is that both NMFCC and CNMF are sensitive to data noise and outliers. 5) Another interesting observation is that semi-supervised NMF methods achieve relatively low clustering accuracy when a few labeled instances from data sets are chosen in the experiment. For example, on UMIST, YaleB and MNIST data set, MNMFL21 is comparable with CPSNMF, RSNMF, NMFCC and CNMF. Few labeled instances cannot represent the data distribution well when the data set distribution is complex. In fact, two or three labeled instances of each class are not enough to characterize a data representation. However, as the number of available labeled instances, the performance of semi-supervised methods is steadily improved. The propose method can bring encouraging performance improvements compared to other semi-supervised and unsupervised NMF methods, when there are a small number of labeled instances. The reason is that RCNMF efficiently use the supervised information and sufficiently maps instances from different clusters far away from each other in the low-dimensional space.

Robustness investigation
To investigate the robustness of these methods, we add salt & pepper noise with different densities to four data sets used in this paper. It is worth noting that the data with noise density of 0 is clean data. The experimental results shown in Figures 5 and 6. It can be observed that the performance of all algorithms decreases when the data are contaminated by noise. Moreover, the higher the noise density, the more performance degradation of the algorithm. Although the performance of the proposed RCNMF degrades when decomposing noisy data, it still outperforms the other compared algorithms. As we can see, RCNMF-KL can also obtain relatively satisfactory results. Compared with RCNMF, its performance decreases more with the increase of noise density of the data. MNMFL21 and RSNMF are superior to the remaining algorithms. This is due to a fact that the L2,1 norm-based cost function is robust to noise and outlier.

Parameter selection
The RCNMF method has three necessary parameters: nearest neighbors p, two trade-off parameters α and β. The proposed algorithm will degenerate into the L2,1-norm NMF (L2,1-NMF) [58] when α = 0 and β = 0. Our RCNMF becomes the L2,1 norm version of CPSNMF when β = 0. Figure 7 demonstrates how the average accuracy of the RCNMF becomes with the trade-off parameters α and β, respectively. We can see that different parameters have different effects on the performance of the algorithm. When these two parameters are in the range of [0.1, 100], RCNMF can achieve consistently good performance.  As we can see, different values of k have great influence on the RCNMF. Graph-based methods generally construct k-nearest graph to depict the local structure of the data space. These methods are based on the assumption that nearby instances have the same class label [9,15,[21][22][23]35,55,56]. Obviously, this assumption likely fails to hold when k enlarges. This is why the performance of RCNMF drops with the increase of k. In fact, graph-based methods suffer from this torment as reported in [23,35]. Generally, the value of p ranges from an integer of 3 to 9.

Convergence study
In Section 4, the convergence of the proposed method has been proven and also the computational complex is analyzed. Here, we inspect the convergence speed of the proposed algorithm. Figure 9 demonstrates the convergence cures of RCNMF on the two data sets. In every figure, the x-axis represents the iteration number and the y-axis describes the value of the cost function. We can observe that the updating rule of the proposed RCNMF method converge relatively fast. Generally, the iterations of our algorithm on these data sets are less than 300.

Conclusions
We have proposed a novel robust semi-supervised nonnegative matrix factorization algorithm, called RCNMF. RCNMF models local and global structures of data via constraint propagation to make latent representations of instances involved by the same class are mapped closer and those of instances involved by different classes farther. The proposed method introduces the L2,1-norm into the objective function and thus is robust to noise and outliers. Furthermore, the L2,1-norm constraint for the factorial matrix is added to the loss function as the regularizer, which ensures the new representation sparse in rows. Experimental results on six real-world data sets show that our proposed framework is superior to other state-of-the-art algorithms.