Gene Feature Extraction Based on Nonnegative Dual Graph Regularized Latent Low-Rank Representation

Aiming at the problem of gene expression profile's high redundancy and heavy noise, a new feature extraction model based on nonnegative dual graph regularized latent low-rank representation (NNDGLLRR) is presented on the basis of latent low-rank representation (Lat-LRR). By introducing dual graph manifold regularized constraint, the NNDGLLRR can keep the internal spatial structure of the original data effectively and improve the final clustering accuracy while segmenting the subspace. The introduction of nonnegative constraints makes the computation with some sparsity, which enhances the robustness of the algorithm. Different from Lat-LRR, a new solution model is adopted to simplify the computational complexity. The experimental results show that the proposed algorithm has good feature extraction performance for the heavy redundancy and noise gene expression profile, which, compared with LRR and Lat-LRR, can achieve better clustering accuracy.


Introduction
With the accelerated pace of modern life, the high incidence of cancer has brought great challenges to human health. How to detect, prevent, and treat cancer effectively has become an international hotspot of medical research. Gene expression profile is a specific cDNA sequence data of cells, which can describe cells' current physiological function and state. Researches show that tumor cells and normal cells could be identified effectively by analyzing and processing the original gene expression data. However, the scale of the gene expression profile is huge and complex due to the diversity and specificity of the cells; therefore the traditional methods of data analysis and processing have been unable to adapt to these extremely large-scale data.
The size of the gene expression profile is large, and there are interrelationships between the samples. The internal spatial structure of the data may be destroyed in the process of linear transformation. In this paper, a model of feature extraction based on NNDGLLRR is proposed on the basis of Lat-LRR, which with low-rank sparse constraint can remove the redundant components of gene expression and suppress the noise. Nonnegative constraints make the calculation with a certain degree of sparsity, in line with the practical significance of the data, and enhance the robustness of the algorithm. And the manifold regularized constraint is introduced, so that the result of feature extraction can describe the spatial structure of the original data more completely.

LRR.
LRR is a combination of matrix low-rank decomposition and sparse decomposition. In recent years, it has been widely used in subspace clustering. LRR assumes that the original data comes from different subspaces and performs feature extraction by trying to find the lowest rank representation of the original data. And this low-rank representation coefficient is the reflection of the original data in the spatial distribution of structural information. If the original data X = [ 1 , 2 , 3 , . . . , ] ∈ R × , each column represents a sample, and generally the LRR uses the data itself as a dictionary. Then the model can be as shown in The LRR matrix Z = [ 1 , 2 , 3 , . . . , ] ∈ R × , and is the linear representation coefficient of the sample under the data dictionary X. The original data usually contains a lot of noise, while the sparse constraint can maintain the robustness of the algorithm effectively. Ref. [24] shows the specific solution process of LRR.
Let Z = J; we construct the following Augmented Lagrangian function: The specific update algorithm is as follows.
Keep Z = Z , Λ = Λ ; update J: Keep J = J , Λ = Λ , and Π = Π ; update Z: Keep Z = Z , Π = Π ; update E: 2.2. Lat-LRR. LRR has two conditions; one is that the original data X contains enough samples, and the other is that X contains enough nonpolluting data. However, these two conditions are almost impossible to achieve for gene data. On the one hand, the available number of gene samples for research is small because of the high prices of gene sequencing. On the other hand, due to process, instrument electromagnetic interference, and other factors, noise pollution will be produced inevitably in the process of genetic sequencing. To overcome the limitation of LRR, [25] proposed a method of Lat-LRR which expressed the original observation data X as a linear combination of principal feature XZ and latent feature LX for feature extraction. Considering the characteristics of heavy noise in gene expression profile, we added sparsity constraints to the model to construct the following Lat-LRR function: The solution of Lat-LRR is given in [26]. Alternating direction method (ADM) is adopted to solve the model (6). Let Z = J 1 , L = J 2 ; we constructed the following Augmented Lagrangian function: Keep Z = Z and Λ = Λ ; update J 1 : Keep L = L , Π = Π ; update J 2 : Keep J 1 = J 1 , L = L , E = E , Λ = Λ , and Δ = Δ ; update Z: Keep Z = Z , E = E , Π = Π , and Δ = Δ ; update L: Keep Z = Z , L = L ; update E:

NNDGLLRR.
Lat-LRR overcomes the problem of too many constraints of LRR dictionary; however, Lat-LRR has limited ability to recover the subspace, and too many auxiliary variables are involved in the process of algorithm solving that involves a lot of matrix singularity value decomposition (SVD) and matrix inversion, which will affect the performance of the algorithm. Ref. [27] proposed a feature extraction method combining manifold constraint and nonnegative matrix factorization (NMF). In the case of NMF reducing dimensionality, the internal spatial structure of the data is maintained by manifold regularized constraint, and good experimental results are obtained. Ref. [28,29] proposed an image clustering method combining manifold regularized constraint with Lat-LRR. Similar to the image data, the gene expression profile is also constituted by numerical matrix with high redundancy and heavy noise. Considering this characteristic, we constructed a new NNDGLLRR model on the basis of the original model.
where , , and are nonnegative constants; the model is a nonnegative latent low-rank representation (NNLLRR) when and are equal to zero. Model (13) takes a more general form. The dual regularized constraint is used to preserve the internal spatial structure of the original data, and sparse constraints and nonnegative constraints are used to maintain and enhance the robustness of the algorithm. S 1 and S 2 are Laplacian matrices, and W 2 are weight matrix, and there are many ways to solve W, and here we use Gaussian thermal weight. The specific solution is as follows: where is a constant; and represent the th column and th column of X ( th and th sample); and represent the th row and the th row of X, D = ∑ W . ADM is used to solve model (12), and the following augmented Lagrange function is constructed: where Λ is a Lagrangian multiplier; is a constant and > 0. Data in real life is generally nonnegative, and nonnegative constraints will make the calculation with a certain degree of sparseness and enhance the robustness of the algorithm. To maintain the nonnegative of feature extraction, we define the following operators: The solution of model (15) is divided into three subproblems: first, the solution of variable Z, second, the solution of variables L, and, third, the solution variable of E.
(1) Solving the First Subproblem. Update Z: Regarding Taylor second-order expansion to (17), the approximate solution of Z is as follows: Nonnegative constraints to Z are as follows: Ref. [30] gives the solution of (⋅); the solution process is as follows: In (20), UΩV is the singular value decomposition (SVD) of , (⋅) is the vector form of the singular value contraction operator (SVT), and (Ω) is defined as follows: (2) Solving the Second Subproblem. Similarly, update L: Nonnegative constraints to L are as follows: Define L = 2 ℎ/ L 2 ; L = L + ‖S 2 ‖.
(3) Solving the Third Subproblem. Update E: where Θ (⋅) is a soft threshold operator (ST); Θ (⋅) is defined as follows: The iterative process of each variable of NNDGLLRR is given above. The concrete updating process is shown in Algorithm 1.

Sparse Representation Classifier (SRC).
Sparse representation is a hotspot in the field of pattern recognition in recent years. SRC has been successfully applied in the field of image classification and has achieved relatively ideal experimental results [31]. Similar to the image data, the gene expression profile is also composed by a series of high redundancy and heavy noise of gene samples. In this paper, the latent features extracted by NNDGLLRR are regarded as data dictionary to construct the following SRC model: According to the result of SRC, we can get the classification result of unknown gene sample y: The detailed flow of the SRC is shown in Algorithm 2.

Algorithm Flow.
To sum up, the algorithm can be divided into two parts; one is to use NNDGLLRR to extract latent features of the original gene expression profile, and the other is to use SRC to classify the latent features. The overall flow is as shown in Algorithm 3.

Selecting the Test Data.
To test the feature extraction performance of the algorithm, we used diffuse large Bcell lymphoma [32] (DLBCL), mixed lineage leukemia [33] (MLL), lung cancer [34] (LC), acute lymphoblastic leukemia [35] (ALL) gene sequences to make test, and the sample information of each group of genes as is shown in Table 1.

Accuracy
Test. -means and sparse representation classifier (SRC) are simple and common classifiers. To compare the clustering results of -means and SRC, the two kinds of classifiers are used to classify the original gene expression profile. Clustering results are shown in Table 2. It is not difficult to find that the classification effect of SRC is significantly higher than that of -means, which is due to the small number of gene expression profiles. To verify the effectiveness of the algorithm for feature extraction, the extracted features from LRR, Lat-LRR, and NNDGLLRR are classified by SRC. Classification results as shown in Table 2. Table 2 shows that any one of LRR, Lat-LRR, and NNDGLLRR can achieve feature extraction effectively. However, the feature extraction effect of NNDGLLRR is better than that of Lat-LRR. The category and number of samples, as well as dimension of the gene expression profile, will have an impact on the final recognition effect.

The Influence of Graph Regularized Coefficients.
Generally, we set = . To verify the influence of graph regularized coefficients on feature extraction, we have compared the Input: X ∈ R × , > 0, > 0, > 0 Initialization: ) End while Output: L * Algorithm 1: Solving NNDGLLRR model with ALM.   Figure 1.
Through the test results of MLL and LC, we can find that manifold regularized constraint has obvious optimization effect on the gene expression profile feature extraction when the values of and are appropriate, and it can significantly improve the recognition effect of feature extraction. However, and should not be too large or too small. The optimal graph regularized coefficients may be different for different test data sets.

The Influence of Sparse Representation Coefficients.
During the process of gene sequencing, the resulting gene expression profile will usually contain heavy noise due to the sequencing process. To verify the effect of the sparse constraint on the feature extraction, we tested the classification accuracy of LRR, Lat-LRR, and NNDGLLRR for feature extraction under different sparse constraint coefficients . The test results are shown in Figure 2. Figure 2 shows that different sparse constraint coefficients have a considerable effect on the final feature extraction results. When the value of is appropriate, the performance of Lat-LRR and NNDGLLRR on feature extraction is better than that of LRR. In general, the performance of NNDGLLRR is better than that of Lat-LRR, which proves the validity of manifold constraint again.

Complexity
Analysis. Z ∈ R × , L ∈ R × , and E ∈ R × , and we set the lowest ranks of Z and L obtained by the algorithm as 1 and 2 . Then the complexity of SVT operation for Z and L is about O( 1 2 ) and O( 2 2 ), and the complexity of ST operation for E is about O( ). The complexity of construction the Laplacian matrix of Z and L is about O( 2 ) and O( 2 ); and the complexity of one positive operation for Z and L is about 2 and 2 . If the iteration of the algorithm is , then the overall complexity of LRR, Lat-LRR, and NNDGLLRR algorithms is shown in Table 3.
Generally, it is considered that ≫ for gene expression profile. It can be seen from Table 3 that LRR is the simplest in terms of computational complexity, but the performance of LRR on feature extraction is less effective than that of Lat-LRR and NNDGLLRR, and it is difficult to meet the   BioMed Research International 7 actual demand. The result of Lat-LRR on feature extraction can be not bad, but the partitioning ability of the subspace is limited, and the operation speed is slow because of too many introduced variables. The variable update algorithm of NNDGLLRR not only reduces the calculated amount, but also achieves satisfactory results on feature extraction.

Conclusion
Aiming at the characteristics of high redundancy and heavy noise of gene expression profile, a feature extraction model of NNDGLLRR is proposed in this paper. In the process of experiment, we extracted the features of different gene expression profile by LRR, Lat-LRR, and NNDGLLRR and classified the extracted features by SRC. The experimental results show that the performance of NNDGLLRR on feature extraction is better than that of LRR and better than Lat-LRR slightly, which verified the comparative advantages of NNDGLRR. At the same time, compared with Lat-LRR, the overall complexity of NNDGLLRR is reduced through the improvement of the variable update algorithm. The experiments using different gene expression data sets for testing have made comparatively ideal experimental results, which proves the validity of the dual graph regularized constraint. In summary, the proposed nonnegative lowrank sparse constraint and dual graph regularized constraint are reasonable, and NNDGLLRR has good adaptability to different gene expression profile with high redundancy and heavy noise.

Conflicts of Interest
The authors declare that they have no conflicts of interest.