Low-rank nonnegative sparse representation and local preservation-based matrix regression for supervised image feature selection

Matrix regression has attracted much attention due to directly select some meaningful features from matrix data. However, most existing matrix regressions do not consider the global and local structure of the matrix data simultaneously. To this end, we propose a low-rank nonnegative sparse representation and local preserving matrix regression (LNSRLP-MR) model for image feature selection. Here, the loss function is deﬁned by the left and right regression matrices. To capture the global structure and discriminative information of the training images and reduce the effect of heterogeneous data and noises, we impose the low-rank constraint on the self-representation error matrix and the nonnegative sparse constraint on the coefﬁcient vector. The graph matrix can be learned adaptively through representation coefﬁcients, so that accurate local structure information in samples can be revealed. Feature selection is performed by obtained row sparse transformation matrix. An optimization procedure and its performance are also present. Experimental results on several image datasets show that compared with the-state-of-the-art method, the average classiﬁcation accuracy of the proposed method is improved by at least 1.2% and up to 3.3%. For images with noise or occlusion, the accuracy is improved signiﬁcantly, up to 4%, which indicates that this method has strong robustness.


INTRODUCTION
Linear regression (LR) model and its variants have been attracted the broad interest of researchers in pattern recognition fields during the past decades since they perform well for the task of classifier training [1][2][3]. The two most typical LR are ridge regression and Lasso models, where the former is LR with the L 2 -norm regularization while the latter is LR with the L 1norm regularization and is a model for sparse representation. Wright et al. [4] presented a sparse representation-based classification (SRC) method. Later, Yang et al. [5] provided some theoretical analysis of the effectiveness of SRC and argued that it is L 1 constraint rather than L 0 that makes SRC effective. Zhang et al. [6] claimed that the collaborative representation strategy plays a more important role than the L 1 -norm based sparsity constraint by analyzing the working mechanism of SRC and proposed a collaborative representation-based classification (CRC). Qin [7] relaxed the L 1 -norm regularized optimization problem to a non-linear optimization problem with L 1 -norm approximation by a smoothening function, which can be solved by existing powerful non-linear optimization methods. Nie et al. [8] present a robust linear regression model based on sparse constraints and design a robust feature selection (RFS) algorithm. Cai et al. [9] define the L 2,1 norm-based loss function with an explicit L 2,0 -norm equality constraint for feature selection. Xiang et al. [10] also provide a framework of discriminative least squares regression for feature selection. Damotharasamy [11] proposes a sparse representation-based human appearance model. The model uses weighted gradient direction that is insensitive to illumination variation to improve the robustness of the model under different occlusion conditions. Liao et al [12] proposed a fast extended sparse-weighted representation classifier by considering the different importance of atoms and using primal augmented Lagrangian method as well as principal component analysis, and combines logarithmic-weighted sum feature descriptor for face recognition. However, the data in many real applications is easily interfered by illumination, corruptions or noises. As a result, the recognition performance of the above methods may degrade in clustering or classifying tasks with noises or corruptions. Fortunately, some methods which are robust to the noise/corrupted data have attracted extensive attentions [13][14][15][16]. Among them, the low-rank representation (LRR) is one of the most classical methods. LRR aims to find the reconstruction coefficients of the data by imposing low-rank constraint on the coefficient matrix. The main advantage of LRR is that it can robustly capture the global discriminative structure of data. Therefore, it can correct the possible errors (including noises, outliers and corruptions) and classify all samples into their most likely respective classes simultaneously. On the other hand, it is well known that manifold learning can well preserve the intrinsic geometry structure of data in the transformed space, which can disclose the local structure of data and is helpful to improve the performance of the method [17][18][19][20][21][22]. Lu et al [23] proposed the low-rank preserving projections (LRPP) for image classification, where the L 2,1 norm-based sparse constraint and the nuclear norm-based low-rank constraint are imposed on the noise matrix and the weight matrix, respectively. Thus, LRPP could keep the global structure of the data and the learned lowrank weight matrix can lower the disturbance of noises in the data. Liu and Yan [24] propose to construct the dictionary by using both observed and unobserved, hidden data, and show that the hidden effects can be approximately recovered by solving a nuclear norm minimization problem.
To use label information to improve their discriminant ability, Yan et al. [25] proposed the margin Fisher analysis method to preserve simultaneously both the intrinsic geometry structure and the discriminant structure of the data. Other similar methods include locality linear discriminant analysis (LLDA) [26,27], locality sensitive discriminant analysis (LSDA) [28], and local discriminant embedding (LDE) [29]. However, in all these methods, an affinity graph with n nodes is first constructed to characterize the affinity between sample data, in which the weight of edges can be computed by different strategies [30], where the graph matrix is fixed during learning the transformation matrix. Since LRR can better capture the global structure of the data and is more robust to noises and outliers, the low-rank representation coefficients of the data can be used to construct an affinity graph with the assumption that noises of the data are sparse. Zhang et al [31] exploited LRR to construct the graph of data relationship and embedded the high-dimensional data into transformation space to establish the low-rank preserving embedding (LRPE) model, which can capture the local geometric information and the discriminative structure of the original samples.
However, the obtained coefficient matrix by LRR will inevitably contain some negative entries, which is physically not suitable to reconstruct training data by applying both additions and subtractions to all training data in real-world applications [32]. In order to enhance the interpretability of the representation coefficients, it expects that the coefficients of training data that are from the same class (i.e., homogeneous data) with the given data should be positive, while the coefficients of training data from different classes (i.e., heterogeneous data) are as zero as possible. To this end, a nonnegative constrain is imposed on the representation coefficients to obtain nonnegative coefficient. There are two advantages to doing so. The first is to make the representation coefficients in LRR physically explainable. The second is that the coefficient is more discriminative.
Obviously, the above methods are vector-based. However, if the inputs are matrix format data, they are usually converted into vector data in traditional vector-based methods in advance. The data obtained in this way will be high-dimensional and destroy the spatial information. Therefore, for inputs in matrix format, Hou et al. [33] provided a sparse matrix regression model for 2D supervised feature selection, where the loss function is defined as the relationship between matrix data and the class labels by deploying left and right regression matrices. Chen et al [34] proposed a robust graph regularized sparse matrix regression method for 2D supervised feature selection, where the intraclass compactness graph based on the manifold learning is used as the regularization item, and the L 2,1 -norm as loss functions to establish the matrix regression model. Yuan et al. [35] present a joint sparse matrix regression and nonnegative spectral analysis (JSMRNS) model for image unsupervised feature selection. Chen et al [36] also proposed dynamic graph regularization and label relaxation-based SMR (DGRLR-SMR) method, where the label relaxation is performed by relaxing the strict binary label matrix into a slack variable matrix via a nonnegative label relaxation matrix, and a graph weight matrix is obtained adaptively.
Motivated by the above analysis, we propose a low-rank nonnegative sparse representation and local preserving matrix regression (LNSRLP-MR) model for supervised image feature selection. First, we also take matrix data as input directly and use the sum of squares of differences between the matrix data and the label by deploying regression matrices as the loss function. Second, in the self-representation of training images, the nonnegative and sparse constraint is imposed on the representation coefficient and the low-rank constraint on the representation error matrix, which can capture discriminative information and the global structure of the data, and thus reduce the effect of noises mixed in data. Furthermore, since the non-negative coefficient describes the positive correlation between image data to a certain extent, we also learn the graph matrix adaptively through these coefficients rather than using a fixed graph matrix to reveal the accurate local structure information within the samples. Finally, an iterative alternating optimization procedure is devised for solving the LNSRLP-MR model and the obtained row sparse transformation matrix is employed to implement feature selection on image data. Extensive experiments on several image datasets show that our method is superior to some existing state-of-the-art algorithms. The main contributions of this paper are as follows.
• The proposed model takes the images as inputs and defines a novel loss function of matrix regression model. • We provide the non-negative sparse self-representation of each training image data by imposing nonnegative and L 1 -norm constraints on the representation coefficients, which make the learned coefficients to be sparse and discriminative. In addition, the low-rank of the error matrix is also used to get the corresponding low-rank structure information in image data. • The model borrows the non-negative representation coefficients to adaptively learn a graph matrix, which can not only capture the global and accurate local structure information of the image data but also have strong robustness to noise and occlusion image data. • An iterative optimization algorithm is designed to solve the proposed model and its convergence and complexity are also analysed. Furthermore, the effectiveness and efficiency of algorithm are also verified by some experiments on several image datasets.
The remainder of this paper is organized as follows. In Section 2, we introduce some notations and review related works, including the matrix-based sparse representation (MSR), lowrank sparse representation (MLRSR), matrix regression (MR). In Section 3, we establish a low-rank non-negative sparse selfrepresentation and local preservation-based matrix regression (LNSRLP-MR) model, devise an algorithm to solve this model. In Section 4, we report some experiments on multiclass classification and give convergence and theoretical complexity analysis of the algorithm. Conclusions are drawn in Section 5.

NOTATIONS AND MATRIX REGRESSION MODELS
In this section, we first introduce some notations and review the related work involved in our method. This related material will facilitate understanding of our proposed algorithm, presented later in this paper.
Suppose that {X i ∈ R m×n ∶ i = 1, 2, … , N } is a set of given training data, where each X i is an m × n matrix, N is the number of matrix samples and these N matrix samples are from c classes. The associated class labels are {y 1 , y 2 , … , y N } ⊂ R c , where y i = [y 1i , y 2i , … , y ci ] T ∈ {0, 1} is the class indicator vector of X i , that is, y ki = 1 if and only if X i belongs to the k-th class and y ki = 0 otherwise. Denote x i = vec(X i ) ∈ R mn is the vector counterparts of the matrix sampleX i that converts the matrix to vector by collecting the columns, i = 1, 2,…, N. We also denote the transpose operator and the trace operator of the matrix A as A T and Tr (A), respectively.

Matrix-based sparse representation (MSR)
For a testing image Y, if all training images are used to approximately represent it, then it should be expressed by the following linear function: where s i is the coefficient of the i-th imageX i , which can denote how much contribution of the training matrix X i to Y, and s = [s 1 , s 2 , … , s i , … , s N ] T ∈ R N . Considering image data may contain noise in most cases, the original model (1) should be revised to a modified form: where E ∈ R m×n refers to representation noise or error matrix. Generally, we expect to represent Y with as few training images as possible and minimize the representation error. Mathematically, this can be achieved by the following optimization problem: where ‖ ⋅ ‖ F represents the Frobenius norm of matrix, ‖ ⋅ ‖ 0 means the number of nonzero elements in the vector and is also viewed as the measure of sparsity. However, this problem is a discrete optimization problem, which is a non-deterministic polynomial-time hard (NP-hard) problem, so there is no efficient algorithm [37,38]. It has been proved that the solution using L 1 -norm minimization with sufficient sparsity can be equivalent to the solution obtained by L 0 -norm minimization with full probability. Therefore, the problem (3) can be relaxed as: where This optimization problem has an analytical solution and can be solved in polynomial time.

Matrix-based low-rank sparse representation (MLRSR)
If the testing image Y is contaminated with contiguous occlusions, the error matrix E ∈ R m×n in Equation (2) should be approximately low-rank [39]. Therefore, the matrix-based sparse representation vector s can be obtained by solving the following low-rank-based optimization problem: where rank(.) denotes the rank of the argument. Since the problem (5) incurs the combination optimization resulting from rank(.), it is also the NP problem. According to the property of matrix rank and the definition of the nuclear norm, the problem (5) can be relaxed into the following convex optimization problem by replacing the rank with nuclear norm: where ‖E‖ * is the nuclear norm of E, that is, the sum of nonzero singular value of E. Although the model (6) can capture low-rank structural information of error image and the obtained sparse coefficients are certain discriminative after removing the contiguous occlusions in images [38], these representation coefficients inevitably contain some negative entries, which are not physically explainable and will weaken the discrimination performance.

Matrix regression (MR)
When the inputs are images, that is, matrix format data, the loss function of the traditional vector-based regression model can be replaced with a matrix regression loss function. To this end, we can rewrite the objective function of the traditional regularized least square regression method as follows: where T is the bias. The first term in Equation (7) is the matrix regression loss function. Inspired by the basic idea of sparse regression for feature selection in [8], the sparse matrix regression (SMR) model can be obtained by replacing the Frobenius norm in the second term of the function (7) with the L 2,1 -norm: This model can perform feature selection by imposing L 2,1 norm constraint on the transformation matrix W.
In addition, based on the idea of bilinear regression, Hou et al [33] also proposed the following sparse matrix regression model for two-dimensional data feature selection: where U k ∈ R m×d and V k ∈ R n×d are the k-th left and right regression matrix, d is the number of regression vectors (i.e. regression rank), which is used to balance the capacity of learning and generalization for the regression model.
then W k is a regression projection matrix formed by the left projection matrix U k and right projection matrix V k for the k-th class. (9) becomes model (8).
Although the models (8) and (9) are feature selection problems based on least squares regression, the model (8) only needs to learn the regression matrix W k instead of learning left and right regression matrices U k ∈ R m×d and V k ∈ R n×d for kth class, which reduces the computational complexity of the model. After obtaining the regression matrix W k (k = 1,2,…, c), we can perform the selection of features from the image matrix by the row sparsity of the matrix W.

Motivation and our model
Since the representation coefficients learned by the matrixbased low-rank sparse representation (MLRSR) model inevitably contain some negative entries, we can impose a nonnegative constraint on the representation coefficients in the self-representation of the training images to enhance their interpretability and make them more discriminating. On the other hand, the self-representation coefficients of the training image can also describe the correlation between it and other training samples from the same class, which indicates that these coefficients reveal local relationship between samples to a certain extent. Therefore, we can construct the following model to learn jointly the non-negative sparse representation coefficients with the low-rank of error matrices and local preserving projections: and γ are positive parameters, E i ∈ R m×n is the self-representation noise or error matrix of the training image X i . The third term in Equation (10) can adaptively preserve the local structure of all training images into the transformation space by learning the representation coefficients. However, both the regression matrix W k and the representation coefficient s i in the model (10) are unknown, so it may output unreliable models. By combining the objective functions in Equations (8) and (10), the following low-rank nonnegative sparse representation and local preserving matrix regression (LNSRLP-MR) model is obtained: where α, β, λ and γ are positive parameters.
In the LNSRLP-MR model, the first and second terms characterize the sparse matrix regression, from which we can learn the regression matrix and get the corresponding row sparse transformation matrix, thus performing feature selection of matrix data. The third term can capture low-rank structural information of the error image of each training image in self-representation. The fourth and fifth terms can iteratively update the non-negative sparse representation coefficients and the regression matrix, which can capture more discriminative information and adaptively learn the final accurate graph matrix. Therefore, the proposed model can reduce the effect of noises and enhance its classification performance.

Solution
In the LNSRLP-MR model (11), there are essentially three types of variables, that is, the regression matrices and the biases {W k , b k } c k=1 , the non-negative representation coefficients Obviously, the objective function is not jointly convex and non-smooth for these variables. Therefore, it is difficult to find the optimal solution to these variables simultaneously. However, the objective function is convex with respect to each variable. Therefore, we can devise an alternating iteration algorithm to solve the above problem. First of all, we introduce an auxiliary variable F to make the objective function separable While not converge do 1. Solve the problem (A.1) and obtain W (t +1) and b (t +1) ; by solving (A.14), i = 1,2,…,N and obtain By the augmented Lagrange multiplier method, the proposed problem (11) is equivalent to the following nonnegative constrained optimization problem: where i and i are Lagrangian multiplier matrix and vector, > 0 is a penalty parameter. This optimization problem can be solved by the alternating direction method of multipliers (ADMM) [40], that is, it can be minimized with respect to one of{W , while fixing other variables, respectively. The whole iterative procedure of solving the problem (13) in detail is provided in the Appendix section. This process is repeated until a stable solution is reached. Here, we summarize the process of the proposed LNSRLP-MR algorithm, as shown in Algorithm 1.
The theoretical analysis of convergence and complexity of the algorithm can be found in Subsection 4.7. Once obtaining the final transformation matrix W by the above algorithm, we use the L 2 -norm of the row vectors of W to evaluate the importance of each feature in image data, and select some features from this image. Concretely, we first rank the L 2 -norm of all row vectors of W in descending order. Then, we select the first several rows with larger L 2 norm from W or set a threshold and select the rows whose L 2 norm is larger than this value. In this way, we can preserve the selected rows in W and change the other rows to zero vectors to generate a new matrix ‚ W ∈ R mn×c . Finally, we implement feature selection for given image X by selecting some features corresponding to non-zero rows in ‚ W from x = vec(X) ∈ R mn .

EXPERIMENTS AND PERFORMANCE EVALUATION
In this section, to systematically evaluate the performance of the proposed LNSRLP-MR method, we conduct experiments on seven publicly available datasets. First, we visualize the affinity graphs obtained by the proposed algorithm on image datasets. Then, we evaluate the recognition performance of LNSRLP-MR on six clean datasets with different selected numbers of features. Finally, we test the robustness to random pixel corruptions and contiguous occlusions.

Datasets
In the subsequent experiments, we employ seven well-known image datasets to verify the performance of LNSRLP-MR, including ORL [41], BIO [42], YALEB [43], WARPAR10P [44], COIL20 [45], USPS [46], and MNIST [47]. The first four datasets are face image sets, and the last three are object image sets and handwritten digit sets, respectively. The brief information about the seven datasets is reported as follows: The The COIL20 dataset contains 20 objects and each object has 72 grey images that are taken from different view directions. The size of each image is 32 × 32. The USPS handwritten digit dataset contains 9298 handwritten digit images in total, which is split into 7291 training images and 2007 test images. The size of each image is 16 × 16.
The MNIST dataset is a dataset of handwritten digits. The size of each image is 28×28. Some samples from these datasets are shown in Figure 1. In

Competitive algorithm
We compare LNSRLP-MR algorithm with related seven stateof-the-art algorithms, including LPP [19], LATLRR [24], LRPP [23], LRPE [31], SMR [33], JSMRNS [35] and DGRLR [36]. These algorithms can be divided into the following three categories: (1) Algorithms based on the locality preserving projections; (2) algorithms based on the low-rank representation and (3) algorithms based on the sparse matrix regression. Besides, for fairly comparing, we directly utilize these algorithms to conduct experiments, compare their performance and seek the best parameters for them as much as possible. In all the experiments, each dataset is randomly divided into a training set and a testing set, which are repeated ten times to calculate the average value of the corresponding evaluation metrics. The effectiveness of the proposed method is verified by these average evaluation metric values.

Parameter settings and evaluation metrics
To the best of our knowledge, there are no efficient schemes to determine adaptively regularization parameters. How to choose the best suitable parameters for different tasks is still a challenging issue. Here, for all the algorithms, we use a grid-search strategy in the range {10 -4 , 10 -3 , …, 10 3 , 10 4 } to tune all parameters, except that the parameter λ in our model (11) is adjusted from candidate range {0.05, 02, …, 1.1, 1.25}. Different parameters combination may be used for different datasets. In addition, for the LPP algorithm, the size of the neighbourhood is set to be five. For algorithms based on locality preserving projections, we perform each method ten times in a wide range of projected dimensions respectively, and the optimal dimension is selected as the corresponding subspace. The number of regression vectors in SMR, JSMRNS and DGRLR-SMR algorithms is set according to the methods provided by relevant references.
To verify the quality of the selected features, we employ the following two evaluation metrics. The first one is classification accuracy (ACC), which is defined as the proportion of correctly classified samples to all testing samples. The second is the redundancy rate (RED) [48], which is a popular evaluation metric for feature selection. The RED value assesses the averaged Pearson's correlation between all pairs of feature data represented by features selected from the data. A larger RED value indicates that many selected features may be correlated and thus high redundancy is expected to exist in the set of selected features. Therefore, the smaller the RED value is, the better performance of feature selection is.

Affinity graph on image dataset
In this subsection, we attempt to verify the effectiveness of the affinity graph obtained by our algorithm on some image datasets. Here, we only take COIL20 and Yale B datasets as examples. For the sake of simplicity, we use the cropped images of the first ten classes in each dataset as training sets to obtain the nonnegative representation coefficient matrices (i.e. affinity matrices), and then visualize these affinity matrices of an undirected graph by using the method presented in [13], respectively. As shown in Figure 2, it is not difficult to see that our method can learn the similarity relationship between training image sam-ples very well. Meanwhile, the affinity matrix learned by our method is relatively dense for samples of the same class, but very sparse for samples of different classes, which shows that LNSRLP-MR algorithm can well capture the local geometric structure between samples and improves the discrimination of representation coefficients.
In addition, we also observe that although most of the nonzero elements in the learned affinity matrix on two datasets are mainly distributed near diagonals, that is, the affinity matrix is block diagonal, there are still a few non-zero representation coefficients far away from these diagonal blocks, and these nonzero elements correspond to a few heterogeneous samples (i.e. samples of different classes) and are very close to zero, which means that these heterogeneous samples play a relatively small role in the self-representation of training samples. In fact, it is impossible for us to obtain a perfect block diagonal affinity matrix, that is, the coefficients of heterogeneous samples in the self-representation are all zero. We can only reduce the influence of heterogeneous samples as much as possible.

Comparisons of classification accuracy
In this experiment, we randomly choose five images per class on each dataset as training samples and the remaining as testing samples and then run all algorithms on the training set. By using the nearest neighbourhood classifier, we can obtain their final classification accuracies on the testing images with selected features, where the number of selected features is provided in Subsection 4.1. To avoid any bias, the above process is repeated 10 times and the average classification accuracy of each method is calculated. The results are shown in Figure 3. It can be seen from Figure 3 that, compared with the other methods, the classification accuracies of LNSRLP-MR method are better than that of all other methods on all datasets. In addition, with the increase of the number of selected features, the classification accuracy of all methods on other four datasets increases monotonously, except on Yale B and WARPAR10P datasets. For these two datasets, the classification accuracy of the LNSRLP-MR algorithm does not always grow with the increase of the number of selected features. The most likely reason is that these two datasets are corrupted by different degrees of illumination and occlusion, respectively. When more features are selected from the images, it is possible to select those features corrupted by noise, thus weakening its classification ability. Therefore, we consider the influence of the number of training samples from each class of these two datasets on the classification accuracy of different algorithms. The results are reported in Table 1. It is obvious that compared with other methods, LNSRLP-MR method achieves consistent good performance on WARPAR10P and Yale B datasets. Moreover, with the increase of the number of training samples per class, the classification accuracy also increases. To some extent, it shows that increasing the number of training samples from each class is beneficial to improve the classification ability of the proposed algorithm for datasets that are greatly corrupted by illumination or occlusion.  The reasons why LNSRLP-MR method is superior to other methods in classification accuracy on all datasets are as follows. First, it takes matrix data as input data, which can retain the spatial information of elements in matrix data. Second, the low-rank constraint in LNSRLP-MR model can capture low-rank structural information of the error image of each training image in self-representation, which reveals the global structure of matrix samples. Third, the LNSRLP-MR method combines the self-representation of training images with the local geometric relationship between training samples to learn adaptively the representation coefficients and obtain the more accurate local structure, while other methods apply the fixed graph weights, which cannot accurately characterize the local geometric relationship between samples. Furthermore, the nonnegative and sparse constraints on self-representation coefficients in the model enhance the interpretability and discriminability of the representation coefficients. Therefore, simultaneously pursuing low-rank error matrix sample-wisely and sparse nonnegative self-representation coefficients are very helpful to classification.
To confirm the ability of feature selection of the proposed algorithm, we also calculate the redundancy rate (RED) values between selected features obtained by different methods. For each method with different number of selected features s, the average of ten independent RED values is shown in Figure 4. Here, we only report the results on ORL, Yale B, and COIL20 datasets and the smallest values are boldfaced. Since the smaller the RED value, the better the performance of feature selection, compared with other comparison algorithms, LNSRLP-MR method has the least feature redundancy in most cases, which shows that it has strong feature selection ability.

Robustness to noises and occlusions
In this subsection, we compare LNSRLP-MR with other seven algorithms on the noisy datasets and inspect the influence of different noises on the classification accuracy. First, we consider the classification performance of the proposed method on noisy ORL, MNIST and COIL20 datasets with random pixel corruptions. For ORL and MNIST datasets, we add salt and pepper noise with noise densities of 0.1 and 0.15 in each image respectively. For COIL20 dataset, we employed the subset of 10 images per class corrupted by Gaussian noise  N(0, 0.01) and N(0, 0.02). Table 2 shows part of the original images and corrupted images with different noise densities. The classification accuracy of all methods under two different salt pepper noise and Gaussian noise densities is shown in Figure 5. It is obvious that LNSRLP-MR algorithm outperforms all other comparison methods and achieves the highest recognition accu-racy. In addition, we also find ( Figure 5(e,f)) that for the COIL20 dataset with Gaussian noise, the classification accuracy of LPP will decrease with the increase of the number of selected features, LRPE and LRPP have lower classification accuracy. The above observations show that the classification performance of these methods is easily affected by different levels of discrete noise.
Next, we can also test the robustness of LNSRLP-MR to different levels of contiguous occlusions on BIO and AR datasets, respectively. For BIO dataset, the images are factitiously corrupted by a randomly located square block with different occlusion size (6 × 6 and 9 × 9), shown in Table 2. For AR dataset, it contains 126 subjects and more than 4000 colour face images. Some images in each class have different facial expressions, illuminations, and occlusions (sunglasses and scarf). To investigate the influence of realistic occlusion on the classification accuracy of different algorithms, we construct the training set and the testing set as follows. We randomly select three images without occlusion and three images with sunglasses from each class to form training set 1, and three images without occlusion and three images with the scarf to form training set 2. Testing set 1 consists of seven images without occlusion and three images with sunglasses, and testing set 2 is constructed by seven images without occlusion and three images with the scarf. Figure 6 shows some face images with realistic occlusions and without occlusions form AR dataset used in the experiment. Figure 7 shows the classification accuracies of eight different methods on the BIO and AR datasets with different block and realistic occlusions, respectively.
It can be seen from Figure 7 that the classification accuracy of LNSRLP-MR algorithm is obviously better than that of other methods for different contiguous occlusions (i.e. square blocks with different sizes, sunglasses and scarf), and it is relatively stable in most cases. We also notice that when the size of the square block occlusion is larger, (such as Figure 7(b)), the recognition accuracy of all algorithms will decrease as the number of selected features increases. This indicates that with the increase of the size of the occlusion block, the important features for selection will decrease, which lead to the decrease of classification accuracy.
The above experimental results indicate that our LNSRLP-MR algorithm has better recognition performance on both datasets with random pixel corruptions and datasets with contiguous occlusions, that is, it has stronger robustness.

Reconstruction by selected features
In order to further verify the effect and ability of feature selection, we give some reconstructed images of an image with scarf from AR dataset generated by different number of features selected by our LNSRLP-MR algorithms, and compare them with the reconstructed images obtained by  Figure 8 shows some reconstructed images obtained by different algorithms. From these reconstruction images, we have the following observations. For images with scarf occlusion, LNSRLP-MR can more effectively select some discriminative important features (such as eyes and nose) and discard scarf features. Although SMR and JSMRNS can choose these important features, they are also more sensitive to scarf features and will choose these features to a certain extent. In addition, when the number of selected features is relatively small, LNSRLP-MR can select more features with discriminant information, while JSMRNS loses more eye features and DGRLR loses more nose features. When half of the features are selected (as shown in the 6-th column of Figure 8), compared with other algorithms, LNSRLP-MR can not only effectively retain most face features, but also eliminate redundant features (i.e. scarf features) to a certain extent. Therefore, the LNSRLP-MR method is more effective for image recognition with occlusion and robust to occlusion.

The convergence and complexity of the algorithm
In this subsection, we first give the theoretical analysis of the convergence and complexity of the proposed algorithm, and then illustrate the convergence behaviour of the objective function in model (11) through experiments. Note that the Algorithm 1 mainly consists of four parts. Step 2 is obtained by SVT operator, and its convergence has also been proved in [49]. The convergence of Step 3 is proven in [50]. The other parts are all associated with convex and smooth optimization problem, and so they are all convergence [36]. With five variables, the proposed optimization problem is convergence, and thus a local (or even a global) minimizer can be obtained.
Next, we will give the computational complexity analysis of LNSRLP-MR. As we can see, the major computation of the proposed algorithm is composed of four steps at (t+1)-th iteration. The first step is to solve the problem (A.1) by Equations (A.4) and (A.5), whose computational complexity is O(mnN 2 + (mn) 2 N + (mn) 3 + mnNc ). The second step is to compute the matrix X i − X(s i ) + i ∕ (i = 1,2,…,N) and its SVD is U˝V T , whose computational complexity is O(max{m 3 , n 3 }N ) at most. The third step is to solve N sub-problems (A.10) by Algorithm 1 with O((mn + N )NM T 1 ), where T 1 is the number of inner loops. The fourth step is to solve N sub-problems (A.15) with O(cmn 2 N 2 ). Therefore, the computational complexity of one iteration will be up to O(((mn) 2 (mn + N ) + max{m 3 , n 3 }N + (mn + N )NM T 1 + cmn 2 N )T 2 ), where T 2 is the maximum number of iterations required by Algorithm 2.
Finally, we further compare the convergence behaviour of our algorithm with that of other some methods on different image datasets. Here, we only discuss the convergence curves of objective function in three matrix regression models on WARPAR10P and COIL20 datasets. That is because the convergence behaviour of the algorithm is similar on different datasets. Figure 9 shows the variation of the objective function value of the proposed method on these datasets. It can be found that after a few iterations (about five times), the objective function values of all three methods tend to a fixed value, and the proposed LNSRLP-MR algorithm has the property of faster con-vergence. Based on this observation, in all our experiments, the number of iterations of the outer loop is set to 5 or 10.

Parametric analysis
In this subsection, we mainly analyse the influence of parameters in our model (11) on classification performance. There are four regularized parameters in our model, namely α, β, λ and γ. The parameter α is used to control the row sparsity of the transformation matrix W, the parameter β reflects the ability of the error matrix to capture low-rank structural information, and the λ and γ guide the model to jointly learn the sparse representation matrix and the affinity matrix. Therefore, the parameters α, β, λ, and γ can balance the effectiveness between matrix regression, global/local structure learning and feature selection.
To best our knowledge, it is still a challenging task to search the appropriate combination of parameters for a specific dataset. Therefore, we only run the proposed algorithm on WARPAR10P datasets with different combinations of parameters by 2D grid searches, respectively, and empirically determine and choose the best one. There are six different combinations of four parameters, which are α vs β, α vs λ, α vs γ, β vs λ, β vs γ and λ vs γ, respectively. Figure 10 shows the variation of the average classification accuracy of LNSRLP-MR algorithm with different combinations of parameters on WARPAR10P dataset.
It can be seen from Figure 10(a-c) that the higher the value of α, the higher the classification accuracy of LNSRLP-MR. It may be because the larger value of α emphasizes the sparsity of the transformation matrix, and then filter out those abnormal features and improve the classification performance of the algorithm. In addition, it is also found from Figure 10(e,f) that the larger γ value will also weaken the classification accuracy. For other parameters, the classification accuracy of the proposed algorithm also fluctuates with the value of each parameter. In a word, the different influences of various parameters in the model on the recognition performance fully show that the global discrimination information and accurate local geometric information in image samples play an important role in the classification of image samples. Therefore, we need to run the algorithm with different combinations of parameters and select the best values of parameters in practical applications.

CONCLUSION
In this paper, we use the learned nonnegative coefficients in the self-representation of image data to adaptively construct an affinity graph, and then propose a new sparse matrix regression model with low rank of representation error or noise. In this model, the L 2,1 -norm is imposed on the transformation matrix to make it row sparse, so we can use it to select features from the given image data. The nuclear norm is imposed on the noise matrix to make it low-rank, so as to lessen the influence of noise on the performance of the algorithm. In addition, the model takes advantage of the L 1 norm to encourage the learned representation coefficients in self-representation to be sparse and discriminative. An effective iterative optimization algorithm is designed to solve the above model. Experiments on six well-known datasets show that the classification accuracy of the proposed method is superior to that of other existing state-of-the-art feature selection methods. Moreover, the proposed algorithm is also some robust to noises (such as random pixel corruptions and Gaussian noise) and contiguous occlusions (such as block occlusions and realistic occlusion).

APPENDIX A1
In this appendix, we give the solution process of the problem (13) by using the alternating direction method of multipliers (ADMM) method. The whole solution process consists of the following five steps alternately.
Step 1. Updating {W k , b k } c k=1 with fixing other variables are fixed, after omitting some irrelevant terms, the problem (13) becomes the following unconstrained optimization problem where g i j = where W = [vec(W 1 ), … , vec(W c )] ∈ R mn×c , 1 = [1, 1, … , 1] T ∈ R c , X = [vec(X 1 ), … , vec(X N )] ∈ R mn×N , L = D − G is Laplacian matrix, D is a diagonal matrix with the diagonal element as D ii = ∑ N j =1 g i j . By setting the derivative with respect to W and b, we have where H = I − 1 N 11 T is the centralized matrix. Reshaping the k-th column of W yields the k-th regression matrix W k , k = 1,2,…,c.
Step 2. Updating {E i } N i=1 with fixing other variables are fixed, the problem (13) is decomposed into the following N independent sub-problem by removing some irrelevant terms which can be iteratively solved by LADMAP [51] and the singular value thresholding operator (SVT) algorithm. Suppose that the truncated singular value decomposition of the matrix X i − X(s i ) + i ∕ is U˝V T , where both U ∈ R m×r and V ∈ R n×r are matrices with orthonormal columns, =diag( 1 , 2 , … , r ), 1 , 2 , … , r are positive singular values of the matrix X i − X(s i ) + i ∕ and r is its rank. Then, it can be obtained from Theorem 2.1 in [49] that the solution to the problem (A.6) follows as where (y) is the soft-thresholding operator: (y) = sign(y) max{ | | y | | − , 0} (A.8) Step 3. Updating {s i } N i=1 with fixing other variables are fixed, the problem (13) can also be decomposed into the following N independent optimization problem by removing some irrelevant terms min s i which can be solved by the iterative soft-thresholding (IST) method [50]. IST method uses the second-order approximation of the second term in (A.10) and works by generating a sequence of iterates {s solution f i of the problem (A.14) is obtained from the solution f i of the problem (A.15).
Step 5. Updating i and i with fixing other variables Finally, Lagrange multiplier matrix i and vector i in the function (13) are updated as follows 18) where (t ) is the value of the penalty parameter at t-th iteration, max = max{ (0) , (1) , … , (t ) } and > 0.
Up to now, we have used ADMM method to solve the proposed LNSRLP-MR model (13) in detail. This procedure is repeated until the convergence condition is satisfied.