Discriminative Label Relaxed Regression with Adaptive Graph Learning

The traditional label relaxation regression (LRR) algorithm directly fits the original data without considering the local structure information of the data. While the label relaxation regression algorithm of graph regularization takes into account the local geometric information, the performance of the algorithm depends largely on the construction of graph. However, the traditional graph structures have two defects. First of all, it is largely influenced by the parameter values. Second, it relies on the original data when constructing the weight matrix, which usually contains a lot of noise. This makes the constructed graph to be often not optimal, which affects the subsequent work. Therefore, a discriminative label relaxation regression algorithm based on adaptive graph (DLRR_AG) is proposed for feature extraction. DLRR_AG combines manifold learning with label relaxation regression by constructing adaptive weight graph, which can well overcome the problem of label overfitting. Based on a large number of experiments, it can be proved that the proposed method is effective and feasible.


Introduction
Information technology is developing rapidly and has become a hot topic in recent years. We can get a lot of information from the data, but the dimension of the data is getting higher and higher [1]. On the one hand, the increase of data dimension makes the description of data samples more comprehensive and provides more bases for further analysis and processing of data samples [2]. On the other hand, the increase in the number of features will bring more redundant features, which not only put forward great requirements for hardware and software equipment to complete data processing but also directly affect the reliability and effectiveness of data analysis and processing results [3]. In order to effectively analyze and process the data, by mapping (or transforming) the original data to the lowdimensional space, the features that best reflect the intrinsic nature of the sample data can be obtained. is process is called feature extraction or data dimensionality reduction [4]. e core task of feature extraction or data dimensionality reduction is how to find out which features are most effective for the final data analysis and processing and how to retain the useful information in the data transformation process to the maximum extent. Feature extraction is always a key problem in pattern recognition, which will directly affect the design and performance of the classifier [5][6][7]. Feature extraction can not only reduce the dimension of data but also retain useful information in the data. It is also widely used in the processing and analysis of complex data. In recent years, the feature extraction method based on manifold learning has made remarkable achievements in nonlinear data analysis and research and has been widely used in nonlinear data processing and analysis. According to the current popular learning methods, the nonlinear characteristics and manifold structures of the sample data are usually distributed in low-dimensional space [8][9][10][11], and the traditional nonlinear subspace method is difficult to describe and extract the information effectively [12][13][14]. At present, the popular learning method is to obtain the embedded mapping of low-dimensional manifold structure and highdimensional feature manifold of the sample and then complete the feature extraction of nonlinear information in the sample data [15][16][17][18].
Discriminative Fisher embedding dictionary learning (DFEDL) algorithm simultaneously establishes Fisher embedding models on learned atoms and coefficients [19,20]. In 2000, Tenenbaum from Stanford University published the first manifold learning method based on isometric mapping (ISOMAP) [21] in Science. ISOMAP method is a globally optimized method to preserve the manifold structure between data effectively by maintaining the geometric relationship between sample data (also known as geodesic distance). Roweis proposed a locally linear embedding (LLE) method in literature [22]. By constructing the reconstruction weight between each sample and the neighboring samples, this method can better preserve the manifold features between the neighboring samples when embedding in low-dimensional space [23][24][25][26]. Based on Roweis and Tenenbaum's basic research on manifold learning, researchers proposed some improved feature extraction methods, including Laplacian eigenmap (LE) [27], local learning projection (LLP) [28], linear representation-based classifiers (CRC) [29], and linear regression classification (LRC) [30].
Due to the validity of the least-squares regression method in data analysis and the completeness of statistical theory, it is widely used as a basic tool in many machine learning problems including discriminant analysis, clustering, multiview learning, multilabel classification, and semisupervised learning. Sun et al. [31] proposed a leastsquares regression model based on generalized eigenvalue decomposition, and Suzanna et al. [32] proposed a weighted least-squares regression model. When solving the multilabel classification problem with least-squares regression, if the data points belong to different categories, then should be considered as the increase of the distance between the data. For example, in order to increase the distance between different types of data points, Leski [33] proposed a quadratic approximation least-square regression model based on misclassification error. However, this model only considers two categories of classification problems. In multilabel classification, the distances among data points from different classes are also expected to be as large as possible [34,35], as well as in multilabel feature selection. erefore, multiple least-squares regression models proposed by Leski [33] can be used simultaneously, but the time cost of algorithm implementation will be relatively high. e traditional label relaxation regression algorithm directly fits the original data without considering the local structure information of the data [36][37][38]. While the label relaxation regression algorithm for graph regularization considers local geometric information, its performance is largely dependent on graph construction [39,40]. However, there are two defects in the construction of traditional graphs: first, it is largely influenced by the parameter values; second, it relies on the original data when constructing the weight matrix, which usually contains a lot of noise [41]. is makes the constructed graph often not to be optimal, which affects the subsequent work. To guarantee the global optima of the latent representation and graphs of all views, we integrate the graph completion and common representation learning into a joint optimization framework. [42][43][44]. erefore, it is proposed to combine manifold learning with label relaxation regression and construct weight graph through adaptive method. At the same time, local identification information is added on the basis of original linear identification analysis so that the projection learning can grasp the local identification information to expand the identification of the projection. A label relaxation regression algorithm for image classification and feature extraction based on adaptive graph is proposed. In this paper, the main innovation points are as follows: (1) e adaptive graph construction can rightly capture the local structure information of the data (2) e problem of overfitting is avoided by introducing adaptive graph into the objective function of label relaxation regression (3) In order to take full advantage of discriminant information, the global discriminant information based on the LDA is considered e rest of this paper is arranged as follows: Section 2 briefly reviews DLSR and structured optimal graph. In Section 3, discriminative label relaxed regression with adaptive graph learning (DLRR_AG) is described in detail, and the convergence of the proposed algorithm is proved. Section 4 mainly provides a lot of experiments to verify the performance of the proposed algorithm. e last Section gives the conclusion.
e classification training samples are given N numbers, and these samples (x i , y i ) N i�1 fall into C(C ≥ 2) classes, where y i ∈ 1, 2 . . . , c { } is the class label of x i , and x i is a data point in R m . e linear equation can be satisfied as follows: t is a translation vector in R c , and e N � [1, 1, . . . , 1] ∈ R N is a vector with all 1 s. Each column vector in Y is of a binary regression type, with the target of class jth being "+1" and the target of the rest being "0." We can drag these binary outputs far away along two opposite directions. at is, with a positive slack variable, we hope the output will become for the sample grouped into "1" and for the sample grouped into "0." is treatment can help enlarge the distance between the classes and mapping data point. 2 Computational Intelligence and Neuroscience Let B ∈ R N×C be a constant matrix, in which the ith row and jth column element B ij is defined as Each element in B corresponds to the direction of the drag. Performing the above ε drag on each element of Y, matrix M ∈ R N × C records these ε { }. e residual can be obtained as follows: where ⊗ is the Hadamard matrix product operator. e objective function of DLSR can be obtained as follows: where λ is a positive regularization parameter. By solving the optimization problem of equation (4), the optimal W and t can be obtained: where I m is a m × m identity matrix and H �

Structured Optimal Graph.
It is well known that the data in high-dimension space is usually embedded in low-dimensional manifold [40]. It is a key success factor to preserve local manifold structure information for graph-based methods. e local manifold structure is captured by the similarity matrix, which determines the ultimate performance of graph-based methods.
Suppose that there are N training samples from c classes which are denoted by where N is the number of samples used for classifier training, while all N samples are used for determination of S, where m is the dimension of observed data, and x i � (i � 1, 2, . . . , N) ∈ R m×1 is the i-th sample for sample set X. For any sample x i , it can be connected by all other samples with probability s ij , where s ij is an element of similarity matrix S ∈ R N×N . If the distance of two samples is closer, the greater their probability (or similarity) will be, and vice versa. erefore, the similarity s ij between x i and x j is inversely proportional to their distance. e similarity s ij can be obtained by solving the following equation: where S i ∈ R N×1 is a vector whose jth element is s ij in similarity matrix S and α is a regularization parameter. e second item in equation (6) is mainly to avoid trivial solutions. It is the ideal state for each sample to include c-nearestneighbor numbers. at is to say, each S ij (i � 1, 2, . . . , N) in similarity matrix S has exact c connected components. In fact, the obtained similarity matrix S in equation (6) does not meet this requirement in most cases. e problem can be solved as follows.
e spectral analysis has an important equation as follows: where is the Laplacian matrix, and matrix D is a diagonal matrix whose ith entry is j S ij + S ij /2. If the rank of Laplacian matrix L S equals to N − C, namely, rank(L S ) � N − C, the obtained similarity matrix S will include exact C connected components [45]. By combining the constraint to equation (6), equation (6) is written as In order to solve equation (6), the ith smallest eigenvalue of Laplacian matrix L S is denoted by σ i (L S ). It is well-known that the solutions of positive semidefinite matrix are more than zero. Laplacian matrix L S is positive semidefinite, so [45]. According to KyFan's eorem [46], we have erefore, based on equation (9), equation (8) can be rewritten as

Discriminative Label Relaxation Regression Algorithm Based on Adaptive Graph (DLRR_AG)
In this section, the motivation of our DLRR_AG is firstly introduced. en, the optimum solution of DLRR_AG is given.

e Motivation of DLRR_AG.
Traditional label relaxation regression algorithms directly fit the original data, which often results in overfitting. In general, to overcome overfitting, a regularization term is added to the target equation, and a regularization factor is used to balance the target equation and Computational Intelligence and Neuroscience regularization term. In addition to this one, maintaining the information of local manifold structure plays a very important role in improving classification or clustering. ere are two kinds of classical nearest-neighbor graphs, one is k-nearestneighbor graph, and the other is ε-nearest-neighbor graph. A large part of graph-based algorithms rely on these two methods to preserve local manifold structure information. However, two problems often occur when these two methods are used. First, the performance of graph learning is greatly affected by the parameter k or ε, the results of taking different parameter values are sometimes far apart, and the optimal value is not easy to determine, which requires a lot of experiments to obtain, which consumes a lot of time. Second, the traditional graph construction method requires two complicated steps: first, the corresponding weighted matrix should be constructed in adjacent graphs, and then the relaxation regression should be carried out. However, once the weighted matrix is generated from the most primitive observation data, it will not change any more. ere is no flexibility. is kind of weighted matrix generated in advance is not practical in practical application, because the original observation data often contain a lot of errors, which will lead to the destruction of the local manifold structure. To solve this problem, we propose a relaxation regression algorithm for adaptive graph. e objective function of the algorithm is described as follows: In order to take advantage of intraclass discriminative information, the discriminant information based on the LDA is introduced into the objective function. Equation (11) can be rewritten as where S w and S b are the within-class scatter matrix and between-class scatter matrix, respectively. (12) of the objective function is convex, so it is difficult to get the global optimal solution. erefore, we can obtain the local optimal solution through continuous iteration. Because the target function contains four different variables, optimization solution (12) is not directly available; it requires an iterative solution to this problem (12). We propose an iterative algorithm to update the rules to solve these problems.

Optimization of DLRR_AG. Equation
According to equation (7), equation (13) can be rewritten as Because it is independent for the similarity vector of each data point, we can solve the optimal problem for each sample as follows:

Computational Intelligence and Neuroscience
Let m ij � ‖W T x i − W T x j ‖ 2 2 , n ij � ‖f i − f j ‖ 2 2 , and d ij � |m ij + λn ij ; equation (15) can be rewritten as e optimal solution of F is formed from the eigenvectors of the c minimum eigenvalues in the L S .

Fixing S, W, and F to Update M.
When other variables are fixed, except M, equation (12) can be transformed into Let us now consider optimization in terms of M ∈ R N×C . Given W and T, and let P � XW − Y record the regression error of n data points, then the optimization problem can be solved from the following aspects: According to the square of matrix Frobenius norm, the fact that an element can be decoupled, (10) can be decoupled equivalently into n × c subproblems. For the ith row of the matrix and the jth column element M ij , we have where P ij and B ij are the ith row and jth elements of P and B, respectively.
en, the optimization problem of equation (20) can be rewritten as Obviously, the optimal solution of equation (21) is given as follows: (12) is a constrained convex optimization problem. According to the properties of convex optimization, the local optimal solution is also the optimal solution of the whole. Next, this paper uses the iterative method to solve the optimal solution of equation (12).

Fixing S, M, and F to Update W. Equation
To solve W, given M, equation (12) is an unconstrained convex optimization problem for W. You just take the derivative of that and you set the derivative to 0 and you get W.
When other variables fix, except W, equation (12) can be transformed into Let N � Y + B ⊗ M, then the problem in (23) can be transformed into the following problem: Computational Intelligence and Neuroscience where L � D − S is the graph Laplacian and D is a diagonal matrix whose diagonal elements D ii � j S ij .

Experiments
In this section, we perform a large number of experiments, and the experimental results can prove that our proposed DLRR_AG algorithm can achieve high classification accuracy. In order to prove the effectiveness of our DLRR_AG algorithm, six public image databases are used to validate our method. For the sake of contrast, the proposed DLRR_AG algorithm is compared with other typical feature extraction methods as follows: (i) Collaborative representation-based classification (CRC) [29]: CRC is a combined result of machine learning and compressed sensing, which shows its good classification performance on face image data. Considering the training samples from a specific class and the query set as two linear subspaces, the classwise prototypes most correlated with the query set are learned, resulting in a condensed gallery set. (ii) Linear regression classification (LRC) [30]: LRC has attracted a great amount of attention owning to its promising performance in face recognition. However, its performance will dramatically decline in the scenario of limited training samples per class, particularly when only single training sample is available for a specific person. (iii) Flexible manifold embedding (FME) [45]: FME is a semisupervised manifold learning framework with good applicability. It can effectively utilize label information from labeled data as well as a manifold structure from both labeled and unlabeled data. (iv) Joint global and local structure discriminant analysis (JGLDA) [46]: for linear dimension reduction, it preserves the local intrinsic structure, which characterizes the geometric properties of similarity and diversity of data by two quadratic functions. (v) Flexible linear regression classification (FLRC) [47]: the inferences are based on the least-squares estimators of the model which have been shown to be coherent with the interval arithmetic defining the model and to verify good statistical properties. (vi) Discriminative least-squares regression (DLSR) [34]: DLSR is to embed class label information into the LSR formulation such that the distances between classes can be enlarged. In order to implement this idea, a technique called ε-dragging is introduced to force the regression targets of different classes moving along with opposite directions.  Figure 1 shows the sample images of one of them.

Experiments on YALE
In this experiment, the first 2, 3, to 6 images from each object are used for training set, and the rest is utilized for test set. In order to evaluate the algorithm more objectively, we will eliminate random effects on the algorithm in the process of implementation, and all the methods are repeated 10 times. e recognition rates of all algorithms are shown as Table 1.
It can be seen from Table 1 that we can draw two points. Firstly, the recognition performance of the proposed DLRR_AG method is better than DLSR method irrespective of the number of training samples. Secondly, our method is superior to all other methods, except when the training sample is 4.

Experiments on ORL Database.
e ORL dataset includes 400 face images from 40 different objects, and each object has 10 face images. For some people, their images were taken at different time and different light; image content includes different facial expression and facial details. Figure 2 shows the sample images of one of them.
In this experiment, the training set was the first 3, 4, 5, and 6 images of each person, and the test set was the remaining images. All algorithms are repeated 10 times. e recognition rates of each algorithm are shown in Table 2.
We can clearly see from Table 2 that the proposed DLRR_AG is superior to CRC, LRC, FLRC, FME, LRR JGLDA, and DLSR.

Experiments on Georgia Tech Database.
e Georgia Tech face database contains photos of 50 people taken during two or three sessions and produced at Georgia Tech. In the database, each individual took 15 color JPEG images with a cluttered background and a resolution of 640 by 480 pixels. e faces in these pictures may be front and tilted, or they may be front or tilted. ese images include the different expressions, illuminations, and proportions. Each image is manually cropped to 60 by 50 pixels. All images are converted to grayscale images in the experiment. Figure 3 shows the sample images of one of them.
In this experiment, the training set was the first 4, 5, to 8 images of each person, and the test set was the remaining images. Repeat the algorithm 10 times. e recognition rates of each algorithm are shown in Table 3.
As can be seen from Table 3, DLRR_AG performs well compared with all other algorithms on the Georgia Tech database. In particular, the performance of DLRR_AG is much higher than that of CRC.

Experiments on CMU PIE Database.
e CMU PIE database contains 41,368 facial images. e images were taken by 68 people with different expressions, decorations, and postures. e acquisition of multiple images of each object is based on the premise of fixed expression and attitude, and the illumination is changed to obtain 14 face images, and then these images are cut to 32 × 32 pixels. Figure 4 shows multiple images from an object.
In this experiment, the first 4, 5, 6, 7, and 8 face images of each object are selected to act as training set, and the rest are taken as test set. e proposed algorithms are repeated for 10 times. e recognition rates of all algorithms are shown in Table 4.
From Table 4, we can know that the proposed DLRR_AG can respectively obtain the best recognition performance in the corresponding comparison algorithms irrespective of the variations of training sample size.

Experiments on AR Database.
e AR face database [44] contains color face images of 120 people, and the total number of face images exceeds 4,000. Among them, 120 subjects were photographed twice with different facial expressions, light conditions, and shade, with a 14-day interval,       Computational Intelligence and Neuroscience and each person produced 26 images. In our experiment, out of the 26 facial images of 120 people, seven were selected from each stage, or 14 face images per person. Each image is manually cropped to 50 by 40 pixels. All images are converted to grayscale images. Figure 5 shows the sample images from one object.
In this experiment, we obtained 14 images of unmasked faces from the first and second experiments. e images in the first stage (ranging from 3 to 7) were used as training images, and the face images in the second stage were used as test images. e proposed algorithms are repeated for 10 times. e experiment results are shown in Table 5.
As can be seen from Table 5, compared with the proposed DLRR_AG algorithm, the classification results of CRC, LRC, FLRC, FME LRR, and JGLDA are poor. In other words, the proposed method can achieve the best recognition performance.

Experiments on UMIST Database.
e UMIST face database contains a total of 575 facial images of 20 people. e 575 photos had all the poses of 20 people in the database, a mix of race, gender, and physical appearance. e views are different for each topic, between 19 and 48 views for each topic. e size of the face image in the view is 56 by 48 pixels. Figure 6 shows example images of a person.
In this experiment, the first 1, 2 to 5 face images of each person are generally selected as the training set, while other images are used as the test images. e algorithm is repeated 10 times for each test. e experimental results are shown in Table 6.
As can be seen from Table 6, when the training sample size is 3, the recognition rate of DLRR_AG algorithm is slightly lower than that of DLSR algorithm. However, the recognition rate of the proposed DLRR_AG algorithm is higher than other algorithms when the training sample size is not 3.

Conclusion
is paper presents a discriminative label relaxation regression algorithm based on adaptive graph (DLRR_AG) algorithm, which can effectively alleviate the overfitting problem caused by label relaxation by correctly capturing the local structure information of the original observed data.
e main innovation of this paper has the following points. (1) e adaptive neighborhood graph can well capture the essential local structural information of the original data. (2) Label relaxation, manifold learning, and discriminant analysis are integrated into a unified framework. A large number of experiments in six public image databases show that the proposed method is superior to other related     Data Availability e image databases used in this paper are publicly available for scientific research.

Conflicts of Interest
e authors declare that they have no conflicts of interest.