Dual Space Latent Representation Learning for Image Representation

: Semi-supervised non-negative matrix factorization (NMF) has achieved successful results due to the signiﬁcant ability of image recognition by a small quantity of labeled information. However, there still exist problems to be solved such as the interconnection information not being fully explored and the inevitable mixed noise in the data, which deteriorates the performance of these methods. To circumvent this problem, we propose a novel semi-supervised method named DLRGNMF. Firstly, dual latent space is characterized by the afﬁnity matrix to explicitly reﬂect the interrelationship between data instances and feature variables, which can exploit the global interconnection information in dual space and reduce the adverse impacts caused by noise and redundant information. Secondly, we embed the manifold regularization mechanism in the dual graph to steadily retain the local manifold structure of dual space. Moreover, the sparsity and the biorthogonal condition are integrated to constrain matrix factorization, which can greatly improve the algorithm’s accuracy and robustness. Lastly, an effective alternating iterative updating method is proposed, and the model is optimized. Empirical evaluation on nine benchmark datasets demonstrates that DLRGNMF is more effective than competitive methods.


Introduction
Dimensionality reduction for image representation is a fundamental task in machine learning. Dimensionality reduction can not only shorten computation time and reduce storage space, but also find out latent and discriminative features by removing the noise and irrelevant features. Over the past decade, many dimensionality reduction methods have been presented such as locally linear embedding (LLE) [1], non-negative matrix factorization (NMF) [2,3], feature selection (FS) [4], and principal component analysis (PCA) [5]. In particular, due to its excellent low-dimensional learning ability, NMF has attracted a lot of attention and been successfully applied to image analysis [6], text classification [7], face recognition [8], and biometrics [9].
NMF is expressed as low-dimensional representations of high-dimensional data [10]. Since there is a constraint that these low-dimensional matrices are negative in NMF, the partof-whole interpretation can be guaranteed and is consistent with human perception [11]. However, there are some limitations to NMF. Firstly, NMF ignores local data structure [12] since the data with a non-Gaussian distribution do not satisfy the constraint condition that the samples are only restricted to the number of classes in NMF. Secondly, recent studies show that sparse representation can improve the robustness and recognition of the algorithm to some extent, whereas NMF is not able to generate sparse representation. Furthermore, NMF is an unsupervised method that fails to investigate partial label information. However, in the real world, a small amount of label information is easily available. Lastly, in NMF there is an assumption that data instances are distributed and independent, which results in ignoring the interconnection of data instances in the real world. Nonetheless, whether data instances come from homologous sources or heterogeneous sources, there will be some interconnection between them [13], as well as between feature variables. For data instances, each data instance exhibits a certain degree of correlation that stronger correlation among samples may occur in the same class and weaker correlation among samples may exist in different classes. For example, as shown in Figure 1, Figure 1a-c represent three kinds of faces belonging to three different classes in the ORL dataset [14]. At first glance, there exists a common object, i.e., eyeglasses, in Figure 1a,b, whereas there is no common object in Figure 1b,c. Despite these groups of images belonging to different classes, the eyeglasses feature exhibits a moderate correlation to some extent within these two categories in Figure 1a,b. However, the correlation in the eyeglasses feature between Figures 1b and 1c is low. By leveraging NMF to learn these interconnected relationships as prior knowledge and performing matrix factorization, the resulting low-dimensional representation will produce satisfactory outcomes within the constraints of this information. Furthermore, due to the diversity of samples, as in Figure 1a, the eyes are unable to act as a dominant feature for classification because the glasses play an important role to some extent. Hence, the interconnection of the eyes and the glasses can successfully distinguish the faces in Figure 1a,c.
Mathematics 2023, 11, x FOR PEER REVIEW 2 of 26 recognition of the algorithm to some extent, whereas NMF is not able to generate sparse representation. Furthermore, NMF is an unsupervised method that fails to investigate partial label information. However, in the real world, a small amount of label information is easily available. Lastly, in NMF there is an assumption that data instances are distributed and independent, which results in ignoring the interconnection of data instances in the real world. Nonetheless, whether data instances come from homologous sources or heterogeneous sources, there will be some interconnection between them [13], as well as between feature variables. For data instances, each data instance exhibits a certain degree of correlation that stronger correlation among samples may occur in the same class and weaker correlation among samples may exist in different classes. For example, as shown in Figure 1, Figure 1a-c represent three kinds of faces belonging to three different classes in the ORL dataset [14]. At first glance, there exists a common object, i.e., eyeglasses, in Figure 1a,b, whereas there is no common object in Figure 1b,c. Despite these groups of images belonging to different classes, the eyeglasses feature exhibits a moderate correlation to some extent within these two categories in Figure 1a,b. However, the correlation in the eyeglasses feature between Figure 1b and Figure 1c is low. By leveraging NMF to learn these interconnected relationships as prior knowledge and performing matrix factorization, the resulting low-dimensional representation will produce satisfactory outcomes within the constraints of this information. Furthermore, due to the diversity of samples, as in Figure 1a, the eyes are unable to act as a dominant feature for classification because the glasses play an important role to some extent. Hence, the interconnection of the eyes and the glasses can successfully distinguish the faces in Figure 1a,c. To solve the above problems in NMF, various NMF variants have been proposed. To establish local data structure, i.e., the first limitation of NMF, Huang et al. [15] and Cai et al. [16] presented new NMF variants with a neighborhood graph to enable a compact representation of similar data points in the data space consistent with coefficient vectors. Different from the above two methods, Liu et al. [17] introduced the local coordinate constraint, which can enable the base vectors close to the samples. To sufficiently exploit the local manifold structure in data space and feature space, Shang et al. [18] constructed a dual graph model by encoding geometric information. Since the sparsity of NMF is not enough, Hoyer [19] introduced the sparsity of the coefficient matrix into NMF to explore parts-based representations and control the degree of sparseness explicitly. On the basis of the method in [19], Meng et al. [20] obtained the sparse basis matrix by imposing the ℓ2,1-norm while considering the local manifold structure in the dual space. To utilize partial label information, some researchers [21][22][23] constructed a stronger constraint and embed it into NMF to perform semi-supervised NMF. Different from previous semi-supervised NMF algorithms that shared the same coordinates for data points with the same labels, an additional constraint in the new representation was imposed to distribute the samples of the same class on the same axis, which could improve the robustness of To solve the above problems in NMF, various NMF variants have been proposed. To establish local data structure, i.e., the first limitation of NMF, Huang et al. [15] and Cai et al. [16] presented new NMF variants with a neighborhood graph to enable a compact representation of similar data points in the data space consistent with coefficient vectors. Different from the above two methods, Liu et al. [17] introduced the local coordinate constraint, which can enable the base vectors close to the samples. To sufficiently exploit the local manifold structure in data space and feature space, Shang et al. [18] constructed a dual graph model by encoding geometric information. Since the sparsity of NMF is not enough, Hoyer [19] introduced the sparsity of the coefficient matrix into NMF to explore parts-based representations and control the degree of sparseness explicitly. On the basis of the method in [19], Meng et al. [20] obtained the sparse basis matrix by imposing the 2,1 -norm while considering the local manifold structure in the dual space. To utilize partial label information, some researchers [21][22][23] constructed a stronger constraint and embed it into NMF to perform semi-supervised NMF. Different from previous semisupervised NMF algorithms that shared the same coordinates for data points with the same labels, an additional constraint in the new representation was imposed to distribute the samples of the same class on the same axis, which could improve the robustness of clustering [24]. Integrating the merits of dual graph regularization, sparse regression, and semi-supervised, Meng et al. [25] imposed biorthogonal constraints to perform a semisupervised non-negative matrix factorization. Thus, a unique basis vector corresponded to each image, which efficiently improved the discrimination ability of clusters and the exclusion between different classes. In fact, the interconnection information in NMF has not been adequately investigated. Recently, latent representation has been applied in the data space [26,27] or feature space [28] due to its good performance in exploiting the interconnection information. However, how to employ a fraction of label information to associate a small number of data instances and feature variables in latent representation space remains a challenging problem.
In summary, there are still some problems to be solved in NMF. These methods related to NMF are listed in terms of their characteristics in Table 1 to enhance the comparative of overcoming the limitations of NMF.
Inspired by latent representation, we propose a novel method named semi-supervised NMF via dual space latent representation learning and dual graph regularization (DL-RGNMF) to alleviate the above issues. The main contributions of our work are as follows: (1) The dual latent representation mechanism is embedded into the semi-supervised NMF framework by the affinity matrix to explicitly exploit global interconnection information in dual space, which can reduce the adverse impacts caused by noise. (2) To steadily describe local data structures, the dual graph is introduced into latent representation learning to further fully investigate the coherent information structure in dual space. (3) To achieve a sparse representation of matrix factorization, the 2,1 -norm is incorporated into the basis matrix U in the proposed framework, which can simplify the measurement process and improve the clustering performance.
In conclusion, our proposed method can overcome the limitations of NMF to some extent. In real applicated fields our proposed method can be applied to image analysis, text classification, attribute community detection, face recognition, and recommender system. This paper is constructed as follows: we review the related studies in Section 2; Section 3 illustrates the DLRGNMF algorithm, gives the iterative update rules, and provides a detailed analysis of convergence analysis; Section 4 evaluates DLRGNMF in terms of clustering analysis and ablation experiments; lastly, we summarize the conclusion in Section 5.

Related Work
In this section, we briefly review the algorithms related to our work.

CNMF
Studies show that, in semi-supervised algorithms, a small fraction of label information can improve the accuracy of learning [37,38]. As an extension of NMF, CNMF [21] incorporates the label information to improve the discriminating power, thus attaching label information to NMF as hard constraints. The objective function of CNMF is formulated as follows: where Z is a label constraint matrix and A is a label auxiliary matrix. This strategy by constraint of Z can require the samples with the same label to be consistent to coordinate in the mapping space. Liu et al. [21] gave proof of the update iteration rule for CNMF and its convergence. CNMF integrates the merits of NMF and a semi-supervised mechanism to improve its discriminating power.

DNMF
Motivated by manifold learning theory, Shang et al. proposed DNMF [18], which constructs neighborhood graphs according to the observation data on the structure of data and feature manifold. The objective function of DNMF is formulated as follows: where λ and µ are graph regularization parameters. DNMF integrates the virtues of dual graph regularization to perform matrix factorization to further enhance its learning ability.

SODNMF
Inspired by CNMF and DNMF, Meng et al. [25] designed SODNMF. The geometric manifold information can be effectively modeled by adopting dual graph regularization constraints to NMF. Furthermore, the sparse constraint is also embedded to guarantee the sparsity. The objective function of SODNMF is expressed as F + α Tr P T L P P + Tr A T C T L S CA + θ P  where α is the regularization parameter, and θ is the sparse constraint parameter. The two terms associated with α are dual graph regularization terms, and the term is controlled by θ is a sparse constraint.

LRLMR
Traditional unsupervised feature selection methods ignore the interrelationship between data instances. For this reason, Tang et al. presented unsupervised feature selection via latent representation learning and manifold regularization (LRLMR) [26], integrating latent representation learning into the local manifold information to perform feature selection, which can explore the global interrelationship of data space by constructing an affinity matrix. LRLMR formulates the objective function as follows: where α is the sparse constraint parameter, β is the latent representation learning parameter, and γ is the manifold regularization parameter. LRLMR learns the latent representation matrix V into a pseudo-label matrix to yield the clustering index and further reduce noise.

Latent Representation
Latent representations of data have yielded promising results and attracted considerable attention in machine learning tasks. In the real world, the purpose of latent representations is to establish the link information through DMvNM [31], DSSNMF [32], ADGCF FS [29], symmetric NMF [39], etc. Generally, latent representation constructs an affinity matrix P ∈ R n×n to describe the interconnection information between samples by decomposing P through a symmetric NMF, which is represented as follows: where V ∈ R n×k is the low-dimensional representation by mapping to the new representation space, while k and n denote latent factors and the number of samples, respectively. The affinity matrix P indicates the global interconnection information between samples. However, interconnection information exists not only in data instances but also between features [28]. Figure 2 illustrates Pearson correlation coefficients between data instances and feature variables of the Soybean dataset to demonstrate the interconnection relationship. It can be clearly observed that, in Figure 2a,b, the absolute values of Pearson correlation coefficients are higher than 0.3 in most cases, especially in Figure 2a, which implies that there is strongly correlated interconnection information between data instances and correlated interconnection information feature variables; that is, interconnection information exists in dual space.

Proposed Method
Inspired by the theory in the literature [28,39], we propose a novel model dual space based on latent representation learning named DLRGNMF, whose pipeline is visualized in Figure 3.

Proposed Method
Inspired by the theory in the literature [28,39], we propose a novel model dual space based on latent representation learning named DLRGNMF, whose pipeline is visualized in Figure 3.

Proposed Method
Inspired by the theory in the literature [28,39], we propose a novel model dual space based on latent representation learning named DLRGNMF, whose pipeline is visualized in Figure 3.

Dual Space Latent Representation Learning and Dual Graph Regularization
Conventional non-negative matrix factorization algorithms always assume that the data are independently and uniformly distributed by default. However, this kind of distribution is not ideal in actual application. Since noise is generated by various factors, the data instances derived from homology or heterogeneity are often interdependent [13,40]. Therefore, it is significant to exploit the intrinsic data structure and feature structure through link information. Some dimensionality reduction algorithms [26,27,41] implemented this idea by virtue of latent representation learning and achieved excellent results. Inspired by these algorithms, we embed dual space latent representation learning into the semi-supervised learning of NMF by mapping the original data to learn the dual latent space of matrix factorization. Thus, the interconnection of information is investigated in mining dual space simultaneously to reduce the influence of the noise, and to improve the robustness of matrix factorization.
To perform matrix factorization in the learned dual latent space, we construct the objective function of the dual space latent representation learning as follows:

Dual Space Latent Representation Learning and Dual Graph Regularization
Conventional non-negative matrix factorization algorithms always assume that the data are independently and uniformly distributed by default. However, this kind of distribution is not ideal in actual application. Since noise is generated by various factors, the data instances derived from homology or heterogeneity are often interdependent [13,40]. Therefore, it is significant to exploit the intrinsic data structure and feature structure through link information. Some dimensionality reduction algorithms [26,27,41] implemented this idea by virtue of latent representation learning and achieved excellent results. Inspired by these algorithms, we embed dual space latent representation learning into the semisupervised learning of NMF by mapping the original data to learn the dual latent space of matrix factorization. Thus, the interconnection of information is investigated in mining dual space simultaneously to reduce the influence of the noise, and to improve the robustness of matrix factorization.
To perform matrix factorization in the learned dual latent space, we construct the objective function of the dual space latent representation learning as follows: where A ∈ R (n+c−l)×c is label auxiliary matrix, U ∈ R m×c is the basis matrix, C ∈ R n×(n+c−l) is the label constraint matrix [21], n is the number of samples, l is the number of labeled data points, and c is the number of categories. The first term is to the latent representation of the data space by constructing an affinity matrix M ∈ R n×n in the data space, which can reflect inherent information between instances. The unlabeled dataset is defined by X = [x 1 , x 2 , . . . , x n ] ∈ R m×n , where m is the feature dimension of the sample. The affinity matrix M is defined as follows: Mathematics 2023, 11, 2526 7 of 25 The second term in Equation (6) is constructed by an affinity matrix N ∈ R m×m to learn the feature latent space and investigate inherent information between features. The affinity matrix N is defined as follows: where x i , y i , and σ denote the i-th sample, the i-th feature of the data matrix X, and a Gaussian bandwidth parameter, respectively, with 0 < M ij ≤ 1, 0 < N ij ≤ 1. M ij and N ij represent the interrelation between the i-th and j-th data instances and the interrelation between the i-th and j-th feature vectors, respectively. Therefore, M ij and N ij describe the global interconnection information in the dual space and learn this information through the situation of the symmetric NMF; therefore, the corresponding low-dimensional representation can be accomplished by matrix factorization under the constraints of this information.
With the development of graph theory and spectral theory, graph regularization has been successfully applied to NMF. In Equation (6), we introduce latent representation learning which maps the raw data into the dual latent space, thus constructing a Laplacian regularization term for dual graph [18], which is represented as follows: where L M ∈ R n×n is the Laplacian matrix of the data space and and S M is the similarity matrix constructed by the k-neighborhood graph M k (x i ) as follows: Similarly, the Laplacian matrix of the feature space is denoted as L N ∈ R m×m and L N = D N − S N , D N satisfies D N ii = ∑ j S N ij , and S N is the similarity matrix, constructed as follows: where N k (y i ) is a k-neighborhood graph.

Objective Function
First, we construct the objective function for the semi-supervised NMF as follows [25]: We incorporate biorthogonal constraints in the semi-supervised NMF, which possesses the robust ability to classify basic and coefficient vectors simultaneously to improve cluster performance. Since biorthogonal constraints are strong constraints, diagonal scaling matrix R can be utilized to avoid the problem of generating unreliable solutions. To simplify the calculation, the 2,1 -norm is introduced to the basis matrix U, and the novel objective function is expressed as follows: where θ is the sparse constraint parameter, and θ > 0 controls the sparsity of the model. DLRGNMF integrates dual space latent representation learning and dual graph regularization into a semi-supervised NMF framework. Combining Equations (6), (9), and (13), the final objective function of DLRGNMF is transformed into the following: where α and γ are the dual graph regularization parameter and the dual space parameter, respectively, and α, γ, θ > 0, which can balance the weight of the above terms.

Optimization
In this subsection, the optimization principle of DLRGNMF is illustrated. Since the objective function in Equation (14) is simultaneously nonconvex with U, R, C, and A, it is challenging to solve. To solve the objective function in Equation (14), the objective function with a single variable is convex and can be optimized by the alternate iterative method. Thus, the entire optimization problem can be decomposed into four subproblems. By optimizing this objective function of DLRGNMF, the parameter β is expressed as the biorthogonal parameter to restrict the biorthogonal term and β > 0. Therefore, the Lagrange Equation (15) is constructed as follows: where Q ∈ R m×m is a diagonal matrix, and we can calculate the i-th diagonal element q ii of Q as follows: To avoid overflow, we introduce a small constant ε, and Equation (16) can be represented as In order to optimize U, R, C, and A, the partial derivatives of the Lagrange function in Equation (15) are formulated, yielding Then, according to the KKT condition [42], the iterative update equations for U, R, C, and A are written as follows: With the above analysis, the procedure of DLRGNMF is summarized in Table 2. Table 2. DLRGNMF algorithm steps.

Algorithm 1 The optimization process of DLRGNMF
1: Input: Data matrix X ∈ R m×n , class number of samples c, neighborhood size k, balance parameters α, β, θ and γ, the maximum number of iterations N Iter, and the ratio of training samples per. 2: Initialization: Normalize the data matrix X, generate U, R, and A, and pick up per% as the label information from the original data to construct matrix C, iteration time t = 0; 3: Construct the dual space latent representation learning; 4: Construct the dual-graph regularized model; 5: while not converged do 6: Update U using Equation (22); 7: Update R using Equation (23); 8: Update A using Equation (24); 9: Update C using Equation (25); 10: Update Q using Equation (17); 11: Update t by: t = t + 1, t ≤ NIter; 12: end while 13: Output: The label constraint matrix C, the basis matrix U, the diagonal scaling matrix R, and the label auxiliary matrix A.

Convergence Analysis
In this section, we prove the convergence of DLRGNMF by demonstrating that under the update rules in Equations (22)-(25), the objective function in Equation (14) is monotonically decreasing.
First, we need to introduce a theorem [43], which theoretically guarantees the convergence of DLRGNMF.
F is a nonincreasing function when the updating formula is written as where G(x, x ) is an auxiliary function of F(x).
If the objective function is proved to be monotonic, it is retained to contain the U term and transformed into We can get the partial derivatives of F(U) in the first order and second order with respect to Lemma 1.
where G U ij , U (t) ij is the auxiliary function of F ij .
Proof. The Taylor series expansion of F ij U ij is Since, Next, we prove that according to the iterative update rule in Equation (22) F ij is monotonically decreasing.
It can be seen from the updating rules of U that F ij monotonically decreases under updating Equation (22). The proof of the updating rules of R, A, and C is similar to that of U; thus, we can obtain updating Equations (23)- (25). Therefore, we can conclude that DLRGNMF is convergent.

Experiments
In this subsection, the effectiveness of DLRGNMF is demonstrated by comparing it with nine state-of-the-art algorithms on nine public datasets. Through the low-dimensional representations, we use k-means to obtain clustering results. We implement experiments with MATLAB R2018b on a Windows machine with a 3.10 GHz i5-11300H and 16 GB main memory.

Results on the Synthetic Dataset
To demonstrate the clustering effectiveness and noise robustness of DLRGNMF, we perform clustering experiments on this synthetic dataset and calculate clustering accuracy (ACC) [35] for the compared algorithms. The synthetic dataset includes three categories, with each category consisting of 300 data instances and seven feature dimensions, which involves the former two dimensions generated by Gaussian distribution, as shown in Figure 4a, and five noise dimensions randomly generated (0-5). We compare DLRGNMF with CNMF [21], GNMF [16], DNMF [18], DSNMF [20], NMFAN [30], EWRNMF [33], and SODNMF [25], as described in detail in Section 4.2.
Under the same experimental environment, DLRGNMF was conducted and compared with other algorithms firstly in terms of the dimension reduction on the synthetic dataset, and the clustering results are illustrated in Figure 4b-i. From Figure 4, it is demonstrated that DLRGNMF, in Figure 4i, achieves the best clustering result because the samples of each category are accurately assigned to the corresponding clusters, while the clustering results of other comparison methods are incorrect. The reason may be that, for the comparison algorithms with the graph model, the graph model may not be reliable when the data noise is high. In other words, the neighborhood graph will result in inaccurate clustering results in the period of constructing the neighborhood graph.
Moreover, the first two dimensions in the synthetic dataset are generated by Gaussian distribution, which implies that there are tighter interconnections between the samples and features of the data than other dimensions. DLRGNMF investigates the global interconnection information of the data through latent representation learning in dual space to reduce noise interference and further preserve the local manifold structure of samples. Consequently, the clustering result of DLRGNMF is significantly improved, which tends to alleviate noisy dimensions and the discriminative of features, as well as promote clustering precision.
(ACC) [35] for the compared algorithms. The synthetic dataset includes three categories, with each category consisting of 300 data instances and seven feature dimensions, which involves the former two dimensions generated by Gaussian distribution, as shown in Figure 4a, and five noise dimensions randomly generated (0-5). We compare DLRGNMF with CNMF [21], GNMF [16], DNMF [18], DSNMF [20], NMFAN [30], EWRNMF [33], and SODNMF [25], as described in detail in Section 4.2.  Under the same experimental environment, DLRGNMF was conducted and compared with other algorithms firstly in terms of the dimension reduction on the synthetic dataset, and the clustering results are illustrated in Figure 4b-i. From Figure 4, it is demonstrated that DLRGNMF, in Figure 4i, achieves the best clustering result because the samples of each category are accurately assigned to the corresponding clusters, while the clustering results of other comparison methods are incorrect. The reason may be that, for the comparison algorithms with the graph model, the graph model may not be reliable when the data noise is high. In other words, the neighborhood graph will result in inaccurate clustering results in the period of constructing the neighborhood graph.
Moreover, the first two dimensions in the synthetic dataset are generated by Gaussian distribution, which implies that there are tighter interconnections between the samples and features of the data than other dimensions. DLRGNMF investigates the global interconnection information of the data through latent representation learning in dual space to reduce noise interference and further preserve the local manifold structure of samples. Consequently, the clustering result of DLRGNMF is significantly improved, which tends to alleviate noisy dimensions and the discriminative of features, as well as promote clustering precision.

Results on Public Datasets Benchmarks
In this section, the experimental performance is performed in terms of clustering accuracy (ACC) and normalized mutual information (NMI) on nine public datasets [32].

Results on Public Datasets Benchmarks
In this section, the experimental performance is performed in terms of clustering accuracy (ACC) and normalized mutual information (NMI) on nine public datasets [32].

Experimental Settings
DLRGNMF and other comparison algorithms are clustered by K-means, and the average and standard deviation of the clustering results over 20 iterations are calculated as the final result. For parameter setting, the neighborhood size k was set to 5, the maximum number of iterations was set to 100, and the ratio of training samples per was set to 0.1. For other parameters of the comparison algorithm, the ranges were set to be consistent with the corresponding literature. For DLRGNMF, we tuned the balance parameters α in the range of {10 0 , 10 1 , . . . , 10 6 }, γ in the range of {10 −3 , 10 −2 , . . . , 10 2 , 10 3 }, β in the range of {10 −8 , 10 −5 , 10 −3 , 10 −1 , 10 0 }, θ in the range of {10 0 , 10 3 , 10 8 , 10 18 , 10 28 }, and the kernel function σ to 1.

Performance
The comparative clustering results are presented in Tables 4 and 5 with the best clustering results in bold, and the computation time on real-world datasets is listed in Table 6. To improve the comprehensibility of Tables 4-6, we conducted a visual analysis of the results, as demonstrated in Figures 5-7. From these tables and figures, DLRGNMF outperforms the compared algorithms in ACC and NMI on all datasets, which fully demonstrates that it can effectively extract features and has excellent low-dimensional learning ability. The detailed conclusions can be drawn as follows.
They outperformed LRLMR, KMM, and PCA on most datasets except for Lung_dis and ORL. The reason is that the graph models are constructed to retain the local manifold structure, which can achieve promising clustering results. (2) CNMF is superior to GNMF on the ORL, Yale32, and Yale64 datasets, which reflects the merits of semi-supervised NMF in improving clustering accuracy with less label information. (3) LRLMR achieves better performance than PCA, KMM, and CNMF on most test datasets since LRLMR exploits interconnection information by latent representation learning, which can further enhance the discriminative of the model. (4) NMFAN and EWRNMF are two adaptive methods that construct an adaptive graph and adaptive weights, respectively. Unsatisfactory results are achieved on most datasets because they fail to fully explore interconnection information and label information. Therefore, it learns more information about interconnection and achieves the best clustering results. (6) In terms of running time, DLRGNMF is a little slower than some compared methods on some datasets because latent representation learning inevitably entails more operations, whereas it outperforms SODNMF on Jaffe50, UMIST, and warpPIE10P datasets due to rapid convergence.     (g) PIE10P (h) Yale32 (i) Yale64

Intuitive Presentation
In Figure 8, the low-dimensional representation of DLRGNMF on nine benchmark datasets is processed with t-SNE to visualize the clustering results in the 2D plane. We can explicitly observe that the features extracted by DLRGNMF for COIL20, JAFFE50, Soybean, and warpPIE10P result in the intraclass and interclass distances of the samples differing significantly, which clearly implies that clustering boundaries and higher ACC values are obtained. However, the number of features in ORL, Yale32, and Yale64 is 10 times that of the samples in each class; thus, the low-dimensional representations may have fuzzy boundaries between some classes, which leads to inferior clustering results compared with other datasets. Overall, the low-dimensional representations extracted by DLRGNMF can clearly indicate the spatial differences between different classes of samples, enhancing its discrimination performance between various classes of samples.

Intuitive Presentation
In Figure 8, the low-dimensional representation of DLRGNMF on nine benchmark datasets is processed with t-SNE to visualize the clustering results in the 2D plane. We can explicitly observe that the features extracted by DLRGNMF for COIL20, JAFFE50, Soybean, and warpPIE10P result in the intraclass and interclass distances of the samples differing significantly, which clearly implies that clustering boundaries and higher ACC values are obtained. However, the number of features in ORL, Yale32, and Yale64 is 10 times that of the samples in each class; thus, the low-dimensional representations may have fuzzy boundaries between some classes, which leads to inferior clustering results compared with other datasets. Overall, the low-dimensional representations extracted by DLRGNMF can clearly indicate the spatial differences between different classes of samples, enhancing its discrimination performance between various classes of samples. fering significantly, which clearly implies that clustering boundaries and higher ACC val-ues are obtained. However, the number of features in ORL, Yale32, and Yale64 is 10 times that of the samples in each class; thus, the low-dimensional representations may have fuzzy boundaries between some classes, which leads to inferior clustering results compared with other datasets. Overall, the low-dimensional representations extracted by DLRGNMF can clearly indicate the spatial differences between different classes of samples, enhancing its discrimination performance between various classes of samples.

Ablation Study
To investigate the influence of each item on the clustering performance of DLRGNMF, ablation experiments were conducted on nine datasets, and the comparison results are presented in Figure 9. DLRGNMF without dual graph regularization term is denoted as DLRG-1, the model without sparse constraints is denoted as DLRG-2, and the model without latent representation learning is denoted as DLRG-3.

Ablation Study
To investigate the influence of each item on the clustering performance of DLRGNMF, ablation experiments were conducted on nine datasets, and the comparison results are presented in Figure 9. DLRGNMF without dual graph regularization term is denoted as DLRG-1, the model without sparse constraints is denoted as DLRG-2, and the model without latent representation learning is denoted as DLRG-3.
As shown in Figure 9, DLRGNMF achieves the best performance. In terms of clustering performance, DLRG-2 is nearly similar to DLRGNMF, while DLRG-1 achieves the worst performance. It is demonstrated that the dual graph regularization term plays the most important role in the model because the dual graph regularization term can preserve both the local manifold structure and the global inherent information, which are also complementary by latent representation learning. By comparison, without latent representation learning, the performance of DLRG-3 leads to a more declined degree than DLRG-2. Consequently, the graph regularization term has the most significant contribution to the clustering performance. Meanwhile, latent representation learning is also contributive and yields a degree of improvement.

Ablation Study
To investigate the influence of each item on the clustering performance of DLRGNMF, ablation experiments were conducted on nine datasets, and the comparison results are presented in Figure 9. DLRGNMF without dual graph regularization term is denoted as DLRG-1, the model without sparse constraints is denoted as DLRG-2, and the model without latent representation learning is denoted as DLRG-3.  As shown in Figure 9, DLRGNMF achieves the best performance. In terms of clustering performance, DLRG-2 is nearly similar to DLRGNMF, while DLRG-1 achieves the worst performance. It is demonstrated that the dual graph regularization term plays the most important role in the model because the dual graph regularization term can preserve both the local manifold structure and the global inherent information, which are also complementary by latent representation learning. By comparison, without latent representation learning, the performance of DLRG-3 leads to a more declined degree than DLRG-2. Consequently, the graph regularization term has the most significant contribution to the clustering performance. Meanwhile, latent representation learning is also contributive and yields a degree of improvement.
To validate the sparsity, we calculated the sparseness of the learned basis vectors using DLRGNMF and DLRG-2 on nine datasets according to Equation (27) from [49], and the experimental results are shown in Figure 10. Generally, a higher value of sparseness indicates better sparsity. It is evident that the sparseness of DLRGNMF is higher than that of DLRG-2 on all test datasets, which implies that DLRGNMF can yield better sparse representation. To validate the sparsity, we calculated the sparseness of the learned basis vectors using DLRGNMF and DLRG-2 on nine datasets according to Equation (27) from [49], and the experimental results are shown in Figure 10. Generally, a higher value of sparseness indicates better sparsity. It is evident that the sparseness of DLRGNMF is higher than that of DLRG-2 on all test datasets, which implies that DLRGNMF can yield better sparse representation. As shown in Figure 9, DLRGNMF achieves the best performance. In terms of clustering performance, DLRG-2 is nearly similar to DLRGNMF, while DLRG-1 achieves the worst performance. It is demonstrated that the dual graph regularization term plays the most important role in the model because the dual graph regularization term can preserve both the local manifold structure and the global inherent information, which are also complementary by latent representation learning. By comparison, without latent representation learning, the performance of DLRG-3 leads to a more declined degree than DLRG-2. Consequently, the graph regularization term has the most significant contribution to the clustering performance. Meanwhile, latent representation learning is also contributive and yields a degree of improvement.
To validate the sparsity, we calculated the sparseness of the learned basis vectors using DLRGNMF and DLRG-2 on nine datasets according to Equation (27) from [49], and the experimental results are shown in Figure 10. Generally, a higher value of sparseness indicates better sparsity. It is evident that the sparseness of DLRGNMF is higher than that of DLRG-2 on all test datasets, which implies that DLRGNMF can yield better sparse representation.  Figure 11 shows the convergence curves of DLRGNMF on nine datasets. The curves converge efficiently and stably on all datasets, with fewer than five iterations on most datasets, which further demonstrates that DLRGNMF is convergent, verifying Section 3.4.  Figure 11 shows the convergence curves of DLRGNMF on nine datasets. The curves converge efficiently and stably on all datasets, with fewer than five iterations on most datasets, which further demonstrates that DLRGNMF is convergent, verifying Section 3.4.

Parameter Sensitivity Experiment
From ablation experiments, it can be concluded that a significant contribution to the clustering performance is provided by the dual graph regularization term and latent representation learning term in dual space. Hence, we conducted parameter sensitivity experiments on these two parameters, and . The search ranges of parameter and were respectively set as {10 0 , 10 1 , …, 10 6 } and {10 −3 , 10 −2 , …, 10 2 , 10 3 }, while the other parameters were fixed at = 10 −3 and =10 3 . Figures 12 and 13 illustrate the varying ACC and NMI under different constituted values of and . These figures imply that ACC is positively correlated with the trend of NMI, and that the clustering performance increases with parameter in the COIL20, JAFFE50, ORL, UMIST, warpPIE10P, Yale32, and Yale64 datasets. Hence, the parameter should not be set too small; otherwise, it tends to achieve bad clustering results. The suitable ranges of parameters and are respectively [10 3 , 10 6 ] and [10 1 , 10 3 ].

Parameter Sensitivity Experiment
From ablation experiments, it can be concluded that a significant contribution to the clustering performance is provided by the dual graph regularization term and latent representation learning term in dual space. Hence, we conducted parameter sensitivity experiments on these two parameters, α and γ. The search ranges of parameter α and γ were respectively set as {10 0 , 10 1 , . . . , 10 6 } and {10 −3 , 10 −2 , . . . , 10 2 , 10 3 }, while the other parameters were fixed at β = 10 −3 and θ = 10 3 . Figures 12 and 13 illustrate the varying ACC and NMI under different constituted values of α and γ. These figures imply that ACC is positively correlated with the trend of NMI, and that the clustering performance increases with parameter α in the COIL20, JAFFE50, ORL, UMIST, warpPIE10P, Yale32, and Yale64 datasets. Hence, the parameter α should not be set too small; otherwise, it tends to achieve bad clustering results. The suitable ranges of parameters α and γ are respectively [10 3 ,

Noise Test
To further verify the robustness of DLRGNMF, a noise test was conducted. Three block sizes of 8 × 8, 12 × 12, and 16 × 16 were chosen from images in the Yale32 and JAFFE50 datasets as noise. These noises were randomly synthesized in the original images as shown in Figure 14b-d and Figure 14f-h, and the comparison results on the noisy Yale32 and JAFFE50 datasets are shown in Tables 7 and 8, where the best results are shown in bold. It can be observed DLRGNMF can achieve better clustering performance than the compared methods under different noise influences. Consequently, DLRGNMF has a stronger robustness of features by latent representation learning.  Table 7. ACC and NMI (mean ± SD %) on the noised Yale32 datasets.

Conclusions
In this paper, a novel DLRGNMF algorithm was proposed. Due to the interrelation of data instances in the real world, dual space latent representation learning is embedded into semi-supervised NMF, which can fully exploit the global interconnection information of dual space by constructing dual latent space. In the mapped dual latent space, the local manifold information of dual space in the raw data is retained by dual graph regularization. Through latent representation learning and manifold regularization, DLRGNMF can reduce reductant information and further improve the low-dimensional learning ability of matrix factorization. Extensive experiments on different datasets verified its superiority and noise reduction capability.
The limitation of DLRGNMF is that there are some parameters to be tuned. In future work, it is desirable to incorporate a regular term without parameters to constrain NMF, which will significantly reduce the time cost of the model. Meanwhile, we would like to explore an efficient optimization method that can optimize all variables simultaneously.
Author Contributions: Y.H., software, data curation, and writing-original draft preparation; Z.M., conceptualization, methodology, writing-review and editing, and validation; H.L., visualization and investigation; J.W., supervision and writing-review and editing. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement:
The data and code that support the findings of this study are available from the corresponding author (Z.M) upon reasonable request.

Conflicts of Interest:
The authors declare no conflict of interest.