Category-Oriented Self-Learning Graph Embedding for Eﬃcient Image Compact Representation

,

Category-Oriented Self-Learning Graph Embedding for Efficient Image Compact Representation Liangchen Hu, Zhenlei Dai, Lei Tian, and Wensheng Zhang Abstract-As one of the ways to acquire efficient image compact representation, graph embedding (GE) based manifold learning has been widely developed over the last two decades.Good graph embedding depends on the construction of graphs concerning intra-class compactness and inter-class separability, which are crucial indicators of the effectiveness of a model in generating discriminative features.Unsupervised approaches are intended to reveal the data structure information from a local or global perspective, but the resulting compact representation often has poorly inter-class margins due to the lack of label information.Moreover, supervised techniques only consider enhancing the adjacency affinity within the class but excluding the affinity of different classes, which results in the inability to fully capture the marginal structure between distributions of different classes.To overcome these issues, we propose a learning framework that implements Category-Oriented Self-Learning Graph Embedding (COSLGE), in which we achieve a flexible low-dimensional compact representation by imposing an adaptive graph learning process across the entire data while examining the inter-class separability of low-dimensional embedding by jointly learning a linear classifier.Besides, our framework can easily be extended to the semi-supervised situation.Extensive experiments on several widely-used benchmark databases demonstrate the effectiveness of the proposed method comparing with some stateof-the-art approaches.

I. INTRODUCTION
C APTURING an efficient image compact representation by taking full advantage of valid information about the given data is essential for the applications of machine learning as well as data mining.Here, the compact representation of data we refer to contains two levels: one is to embed samples from the high-dimensional observation space into the low-dimensional compact space, namely, by means of dimensionality reduction for feature compression efficiently; the other is to aggregate samples with high affinity or similarity to each other into a compact domain.Both help to reveal the latent data distribution structure and facilitate excellent classification and generalization performance.
Serving the above two purposes, we consider a data-driven graph G(Y , S) on a r-dimensional manifold embedded in a ddimensional space (r < d) [1], where R r×n refers to the data matrix composed of n embedded data samples and S ∈ R n×n represents the affinity matrix of interconnected sample pairs.Now, by embedding the graph into a r-dimensional Euclidean space for locality preserving, we arrive at the following problem min which is known as the Laplacian Eigenmaps (LE) [2] for providing a efficient approach to nonlinear dimensionality reduction, where H is a constraint matrix and m is a constant.Nevertheless, LE faces a major problem of out-of-sample extension.Behind many out-of-sample attempts [3], a series of improved algorithms based on linear assumption Y = P T X is the focus of research, among which Locality Preserving Projections (LPP) [4], [5] are the pioneering work, as follows min The developed LPP-based algorithms fall into three categories of unsupervised, supervised, and semi-supervised.In unsupervised manners [6], [24], [18], [30], [21], the lack of label information leads to the insignificant inter-class margins in the low-dimensional space, thus affecting subsequent clustering and classification tasks.By contrast, the ways of supervising or semi-supervising [7], [8], [9], [10], [11] can strikingly yield the sharper margins between different classes.Unfortunately, as indicated in Fig. 1 (top), these methods only preserve the adjacency information among samples with high similarity by constructing the intra-class graphs but ignore the adjacency relationships between inter-class boundaries, which is not conducive to the generalization of the model.Thus, it is critical to induce large inter-class margins by guiding label information while constructing the graph Laplacians over the entire data distribution.A feasible framework is illustrated in Fig. 1

(bottom).
In these LPP-based approaches, the success of achieving a superior compact representation depends on the construction of affinity graphs, including partial joins and full joins, where partial joins are skilled in mining the structural information of data distribution.However, the thorny issue is that the conventional ways of constructing the affinity graph rely on pre-definition in the original ambient space, including label based 0-1 affinity [8], local geodesic distance [12], sparse representation [14], [15], and a few other attempts [16], [17], [10].Since the affinity of samples in the ambient space cannot be strictly inherited by means of linear dimensionality reduction [19], learning the affinity of samples jointly in the process of projection learning is beneficial to reveal the latent structure of the data manifold.To address this problem, a variety of more or less successful methods have been devised over the past few years, which can be roughly classified into four categories, namely regularization penalty [18], [19], [20], [21], power weighting [25], [26], [27], norm-induced selfweighting [28], and non-negative self-representation [30], [32], [31].These graph learning strategies are revisited in detail later in the related work.

Intra-class compactness
As mentioned earlier, all of the supervised algorithms discussed above can be based on the general discriminant criterion of imposing intra-class compactness and inter-class separability, as illustrated in Fig. 1 (top).Moreover, not considering the marginal adjacency structure is not conducive to the generalization of learning subspace.As illustrated in Fig. 1 (bottom), we propose a framework that covers global graph learning with marginal information preserving and categoryoriented separability examination, thus solving the above issue.In what follows, we summarize several favorable and attractive contributions of our framework.
1) We propose a novel supervised graph embedding framework which implements the joint learning of global adaptive graph learning and inter-class margin examination.Instead of focusing on intra-class compactness as the previous supervised strategies, we are more inclined to consider local compactness over the entire data domain, thus boosting the generalization performance of the learned subspace along with classifying by a joint classifier.
2) Employing relative entropy as regularization, we develop a graph learning approach that encourages each other with subspace learning, which has several excellent properties, such as affinity retentivity, flexible nonlinearity, numerical stability, and approximate sparsity, compared with existing graph learning techniques.
3) We provide an effective iterative optimization algorithm in which we optimize all available variables except the data matrix, which ensures the flexibility of learning lowdimensional compact representation.Besides, the algorithm can be easily extended to semi-supervised scenarios.
The rest of the paper is organized as follows.Section II reviews the related work of dynamic graph structure learning.In Section III, details of the framework we presented are provided and an effective iterative algorithm is then derived for this purpose.Meanwhile, computational complexity and some necessary discussions are given in this section.The results of our experiments are described in Section V. Section VI concludes the paper with some additional summary and future work.

A. Notations
In our writing, all vectors and matrices are in bold italics, and all scalars are in simple italics.Given the data matrix of n samples belonging to c classes, where n i is the number of samples in class i, X ij is the i-th row and j-th column entry of X, and X i ∈ R d×1 refer to a d-dimensional sample.And l i ∈ {1, 2, • • • , c} is the class label of sample X i .Moreover, X T , X −1 and T r(X) represent X's transpose, inverse and trace, respectively.The Frobenius norm, L 2 norm, and L 2,p norm are defined as 1 p respectively, where X i,: means the i-th row vector of the matrix X.

B. Dynamic Graph Structure Learning
As mentioned earlier, the key of achieving a compact lowdimensional representation via solving problem (2) lies in the construction of a suitable affinity graph, including fully connected and locally connected.Unfortunately, among the numerous spectral graph based approaches [9], [8], [7], [12], [10], [11], affinity graphs are often constructed independently of the learning process.Instead of preconstructing the affinity graph, some adaptive weighted approaches have emerged in recent years for graph learning.Here, we present a brief review of four representative strategies.
1) Regularization Penalty: Regulationization is a commonly-used technique for machine learning, which can effectively avoid trivial solutions.The graph learning with regularization penalty can be written as follows where S i is the row vector whose the j-th entry is S ij , 1 is the column vector whose entries are all 1, and η controls the evenness of the entries in S i and the smaller η causes S i to be more uneven.Taking the case of where A i is the i-th row of the predefined affinity matrix A [20], [21], the equivalent subproblem can be extracted from problem (3) as where is the j-th entry in row vector R i .And then, problem (4) can be effectively solved by referring to [22], [18], [23].
Generally speaking, the results obtained by solving (4) are not sparse, which is not conducive to revealing the local manifold structure of data.
2) Power Weighting: An alternative approach is to transform the linear problem concerning affinity S ij into the following power nonlinear problem where v = 2 taken in [25], [26] and it is generalized in [27] as any case where v > 1.When P is fixed, the general solution of (5) can be derived as follows The power weighting strategy requires no additional parameters and has a closed-form solution.But when v > 1, the ∥P T X i − P T X j ∥ 2 2 that approximates to 0 falls in the denominator, thus resulting in the absolute dominance of S ij in S i , which may lead to the construction of morbidity graph.
3) Norm-Induced Self-Weighting: Besides, we can directly employ the non-squared L 2 norm to measure the distance between two projected samples [28], as follows (7) where ∥S i ∥ 0 = K [29] means that each sample has only K nearest neighbors and S ij = 1 ∥P T Xi−P T Xj ∥2 [28] means that larger weights can be assigned to pairs of samples with smaller distances.Solving problem ( 7) is equivalent to calculating the optimal KNN graph under the non-squared L 2 norm, which is constructed as follows However, the same hidden danger of S ij that occurs in (6) may affect the performance of the model.4) Non-Negative Self-Representation: Besides the above methods of relying solely on distance, the affinity on the graph can also be derived from the non-negative contribution of other samples to the reconstruction of a sample, also known as non-negative self-representation of data.Here we mention two more representative approaches [30], [31].In [30], fixing other unrelated variables, we arrive at where E is a sparse additive error matrix.Problem (9) strives to achieve a symmetric graph in which the affinity of a sample pair is

Sij +Sji 2
derived from these low-rank representation coefficients of the original data under a dictionary B. Another approach proposed in [31] is to express the sample relationships on the graph through the non-negative self-representation on the projected data, namely where ⊙ is the entry-wise multiplication and E here refers to the locality adaptor matrix.Similar work can also refer to [32].The above two methods still construct S in a linear way and require more parameters to be turned, but the performance of mining latent subspace would be improved by preserving the structural representations between samples.
Remark 1. From the above several ways of constructing graphs, we summarize several properties that a high-quality graph should have.

1) Retentivity:
The structure information in high dimensional space should be preserved to a certain extent in low dimensional embedding; 2) Nonlinearity: The nonlinear manifold distribution of data forces us to measure the relationships between samples in a nonlinear way; 3) Stability: The affinity between samples cannot trend to be infinity as the distance between samples varies; 4) Sparsity: The affinity of sample pairs in the graph should be sparse to better reveal the local manifold structure.

III. METHODOLOGY
In this section, we formulate our motivations step by step and derive an efficient iterative optimization algorithm.Then, the computational complexity and some bright spots of the proposed algorithm are discussed.

A. Framework
Assisted by the separability examination of a classifier, we are eager to reveal the low-dimensional manifold structure of the data, in which the marginal distribution information is preserved simultaneously with the intra-class structure information.Here, we present the overall framework to express our motives min P ,S,f Ξ(X, P , S)+δΩ(P )+γ (Π(f (Y ), T ) + αΓ(f )) (11) where f is a classifier and T is a target matrix containing label information.Ξ(X, P , S) is employed to perform self-learning graph embedding across the entire data domain, and Ω(P ) is a regularization function for P with δ as a regularization parameter.Π(f (Y ), T ) checks the separability of projected samples of different classes in the learning process of graph embedding.Γ(f ) is an item that prevents the classifier f from overfitting.γ is a tradeoff coefficient and α is a regularization parameter.

B. Self-Learning Graph Embedding
As indicated in Section II-B, a variety of dynamic graph learning strategies exist for diverse assumptions.For flexibility, we are more inclined to the graph learning strategy of "Regularization Penalty".Innovatively, we provide a way to regularize graph learning in terms of relative entropy below Ξ(X, P , S) where S ij refers to the probability of the j-th sample being the nearest neighbor to the i-th sample, ∥S i ∥ 0 = K is the scale employed to constrain local graph construction.The relative entropy D KL (S i ∥A i ) in (12), also known as Kullback-Leibler divergence [33], is defined as follows which is an asymmetric measure of the difference between the two probability distributions S i and A i .Specially, we define 0 ln 0 = 0.
Besides, regularization of P can ensure the mining of resultful latent manifold structure, and there are many strategies available.However, the samples collected in high dimensional space are often doped with noise and redundancy.In view of this, L 2,0 norm should be the best choice, but the difficulty in solving such problems makes us prefer L 2,p norm with 0 < p ≤ 1 [36], namely,

C. Inter-class Separability Examination
Although solving problem (12) with the regularization term ( 14) can capture the overall structural information, the resulting P lacks discriminant power due to the unutilization of label information.As a result, we import the obtained lowdimensional representation Y into a classifier f to examine its classification performance.Then, we arrive at where Π is the total classification loss and ψ i is the i-th loss measure for sample Y i .
Since it is of no practical significance to generate soft labels for training samples, the classifier allows continuous variable prediction.Here, for simplicity we take the linear predictor as f (Y i ) = W T Y i where W is the learned combination coefficient matrix.And, we expect the predicted results to correspond to the targets of those samples, namely where T is flexible and satisfies the marginal constraint Taking the squared L 2 norm for measuring the approximation in (16), we arrive at the result below Simply taking Γ(W ) = ∥W ∥ 2 F and integrating it with ( 12), ( 14), ( 15) and ( 17) as a joint learning framework, our model arrives at With some mathematical tricks, problem (18) can be further written in matrix form in which the Laplacian matrix L s is defined as

D. Optimization
The optimization problem established in (19) involves several variables, which interact with each other, making it difficult to directly derive analytic results.As an executable strategy, we utilize the alternating direction multipliers method (ADMM) [41] for updating one variable alternately while fixing the other variables each time.
To make (19) separable, we introduce the auxiliary variables Z and U with two additional constraints Z = U W , U = P .And then problem (19) can be expressed loosely as follows min Z,T ,P ,S W ,U T r(P T XL s X T P ) + σD KL (S∥A) Consequently, we derive the augmented Lagrangian function of problem (20) below where C 1 and C 2 are Lagrange multipliers, µ is a positive penalty parameter.With some simple mathematical reasoning, Eq. ( 21) can be reformulated as By alternately optimizing each variable of ( 22), we ultimately achieve the optimal solution for all variables.The detailed steps are shown below.
Update Z: Since the variables , S, W , T and U are fixed, (22) can be reduced to the following problem without considering other irrelevant terms Taking the derivative of Eq. ( 23) w.r.t.Z and setting it to zero, we have Update T : By fixing P and W , and simply denoting the regression result X T P W as V ∈ R n×c , Eq. ( 22) degenerates into a retargeting problem Then, n subproblems with an independent constraint on each other can be extracted from problem (25), and the i-th subproblem is described below Obviously, T i depends only on V i .As in [34], we redefine the i-th target vector T i below where is an indicator and τ j ≤ 0 means that class i and class l i satisfy the margin constraint.Using the new formulation in Eq. ( 27), we can rewrite optimization problem (26) as Taking the derivative of Eq. ( 28) w.r.t.△ i , we have And, when ) minimizes, and the optimal △ i is calculated as where Θ(τ j ) = { 1, f ′ (τ j ) > 0 0, other .Then, the optimal target matrix T can be derived from Eq. ( 27).
Update W : Given Z, U and other variables, the subproblem for updating W is Similarly, taking the derivative of Eq. ( 31) w.r.t.W and setting it to zero, we have Update U : Analogously, U can be solved by minimizing Taking ∂L(U )/∂U = 0, we have closed-form solution Update P : When the variables Z, T , S, W and U are fixed, the subproblem w.r.t.P can be rewritten as Then, by setting ∂L(P )/∂P = 0, we have where D ∈ R d×d is a diagonal matrix with diagonal elements in which P i is the i-th row vector of P .And then, we express the optimal P as Incontestably, D depends on P , making the optimal P hard to achieve, but here a suboptimal choice can be made through the strategy of iterative solution.
Lemma 1. [36], [37] When 0 < p ≤ 1, the following inequality holds for any positive real number a and b, √ Theorem 1. Assuming that P t and Dt exist in the t-th iteration, and P t+1 is further solved by Eq. ( 37), then the inequality L(P t ) ≥ L(P t+1 ) holds.
Thus, after receiving the P t generated by the last iteration and calculating Dt , the suboptimal solution of ( 35) can be P t+1 obtained through (37).
Update S: Similarly, by removing all terms that do not involve the variable S, the objective function is reduced to We decompose problem (44) into n independent subproblems, each of which is equivalent to solving the following problem where Obviously, the constraint ∥S i ∥ 0 = K can be implemented by constraining K minimum non-zero elements in Q i .Denoting the subscripts of these K non-zero elements as {ρ 1 , ρ 2 , • • • , ρ K }, (45) can be reexpressed as And, the Lagrangian function of problem (46) is where µ is the Lagrangian multiplier.Taking the derivative Substituting (48) into Then, we substitute (49) into (48) to produce the final result Update C 1 , C 2 and µ: Lagrange multipliers C 1 and C 2 , and penalty parameter µ are respectively updated as follows where µ max and ρ are both constants.The detailed procedure for solving problem ( 19) is summarized in Algorithm 1.

E. Computational complexity
Here, we analyze each step of Algorithm 1 and generalize the total computational complexity.Initializing the probability matrix S 0 costs O(n predefined A, dimension r, parameters α, γ, δ and σ. 1: Initialize P 0 ∈ R d×r and W 0 ∈ R r×c randomly.2: Initialize S 0 by solving Sij ln Sij Aij .

F. Discussion
Here we discuss the following interesting points concerning our proposal and its theoretical derivations.
1) In Eq. ( 50), the predefined affinity A iρj is performed as the penalty weight, thus establishing a "Retentivity" from the high dimensional space to the low dimensional space.Then, exp(− Qiρ j σ ) in (50) makes the distribution of affinity to yield flexible "Nonlinearity".And exp(− Qiρ j σ ) → 1 versus Q iρj → 0 ensures numerical "Stability".As illustrated in Fig. 2, Q i,ρj , which is not on the same order of magnitude as Q i,ρ1 , may make S i,ρj trend to 0, which highlights a certain "Sparsity".
2) Essentially, our proposal yields two projection matrices P ∈ R d×r and P W ∈ R d×c , both of which can be used for subsequent tasks.P preserves the adjacency structure information of the entire data distribution and can project data to any dimensional subspace, while P W can only project data to a c-dimensional subspace.Besides, for P W , assuming r < c, the resulting P W has a discriminative low-rank property [35].
3) The sparsity of L 2,p norm can realize the joint learning of feature selection and feature extraction, making the input features of joint classification learning as pure as possible.
Moreover, in the process of the joint classification learning, the targets can adaptively respond to the changes of lowdimensional embedding and learn more discriminative features.

IV. SEMI-SUPERVISED EXTENSION OF COSLGE
Assuming that the data matrix X = [ X, X], where X ∈ R d×l is labeled and X ∈ R d×u is unlabeled (l + u = n), then we can easily extend our model to a semi-supervised version min P ,S,W ,T T r(P T XL s X T P ) + σD KL (S∥A) In (54), local adaptive graph learning is consistent with (19) across the entire data field, while the separability examination is implemented only on those labeled samples.Labeled data X can be projected into a desired subspace by means of the separability examination, while unlabeled data X can be drawn closer to labeled data X with the help of its underlying structure, distribution patterns, and adjacent-labeled data.Problem (54) can also be solved using Algorithm 1, but W is only learned on the labeled projected data P T X.

V. EXPERIMENTS
In this section, we conduct a comprehensive analysis of the proposed method experimentally, including comparisons with other methods on several public benchmark databases, validation of the effectiveness of each component in our model using ablation experiments, and study of parameter sensitivity and algorithm convergence.Here, six distinctive and widely-used databases are employed to demonstrate the validity of our algorithm, including GTdb, UMIST, YaleB, PIE, USPS, and COIL20, four of which are to evaluate face recognition performance, and two for handwritten and object recognition performance, respectively.Some typical sample images are illustrated in Fig. 3. Besides, more details are described below.

A. Databases
GTdb 1 -The Georgia Tech face database contains images of 50 people, each with 15 sets of images.The pictures show frontal and/or tilted faces with different facial expressions, lighting conditions and scale.We resize each cropped image to 60×40 pixels and directly train the grayscale pixel features.
UMIST [43] -The UMIST face database consists of 575 images of 20 subjects, each changing a series of views from the side to the front.Each subject has no less than 20 images and a maximum of 38 images, each of which is resized to 32 × 32 pixels in our experiments.
YaleB [44] -The cropped Extended Yale B face database consists of 2414 frontal facial images of 38 subjects, each photographed under about 64 different lighting conditions.We resize all the images to 32 × 32 pixels, with 256 gray levels per pixel.
PIE [45] -The CMU PIE database includes 68 persons and a total of 41368 face images taken by multiple synchronous cameras and flashs in different poses, lighting and expressions.Our study focuses on five near-frontal image sets (C05, C07, C09, C27, C29), with approximately 170 images per subject for a total of 11554 images.All images are grayscale and cropped to 32 × 32 pixels.
USPS [46] -The USPS handwritten digital database we refer to here is a popular subset of the original database that contains 9298 images of 10 Arabic numerals, each resized to 16×16 pixels, meaning that each sample is a 256 dimensional vector.
COIL20 [47] -The Colombian Object Image Library database is composed of 20 objects with a total of 1440 images.As objects rotate on the turntable, images of each object are taken at five-degree intervals, with 72 images for each object.Each image is adjusted to 32×32 pixels.

B. Competitors
Training a Deep Neural Network (DNN) requires a large number of data points, which is resource-based and not suitable for small scale datasets [48].As a result, we focus mainly on several state-of-the-art graph-based methods for comparisons, including supervised LPP (SLPP) [5], LPPbased local Fisher discriminant analysis (LFDA) [7], marginal Fisher analysis (MFA) [8], nonnegative sparse graph (NNSG) based linear regression [42], robust latent subspace learning (RLatSL) [32], joint graph optimization and projection learning (JGOPL) [31], simultaneously learning neighborship and projection (SLNP) [19], adaptive local linear discriminant analysis (ALLDA) [27].Except that JGOPL and SLNP are our duplicate codes based on the published papers, the codes of other methods are released by the original authors.To be fair, for unsupervised JGOPL, we extend it to supervised scenarios 1 http://www.anefian.com/research/facereco.htmbellow min where X k i is the i-th sample belonging to the k-th class, and S k ij is the affinity between X k i and X k j .Besides, some information about these methods is described in Table I.Among them, there are 3 predefined graph methods, 3 dynamic intra-class graph learning methods and 2 global dynamic graph learning methods.NNSG builds a non-negative self-representation graph for the learned targets, while ours builds a global graph of inter-sample relationships in the projected subspace.

C. Experimental Setup
All the experimental images are expanded into vector samples and normalized.n s samples of each category in each database are randomly selected for training, and the remaining samples form a test set.To be specific, in the following experimental results, the value of n s is denoted as "# number".Considering that some methods involve inverse problems, PCA algorithm is performed to reserve 98% energy for each data in all our experiments.Since the regressionbased algorithms can only reduce to c dimensions, we set the reduced dimensions of each algorithm to c for simplicity and convenience.
In Table I, there are some parameters that need to be set up in advance for the methods being compared.As for the neighborhood size of each method constructing the graph, we uniformly set it at 20% of each class of samples.Selfrepresentation graph learning adaptively determines the neighborhood size.Then, we simply accept the suggested hyperparameter settings in the original papers.After extracting compact features by various methods, we simply classify them by the nearest neighbor classifier (1-NN) and repeat 10 trials randomly to evaluate their performance with average recognition accuracies and standard deviations.

D. Experimental Results
1) Classification Performance: In this section, we investigate the superiority of our method over other methods in classification performance.These databases involve face recognition, object recognition and handwritten recognition, which is helpful to reflect the universality of the proposed method.In the supervised version, our COSLGE can learn two different projections simultaneously, namely P ∈ R d×r and P W ∈ R d×c .Thus, to compare the classification performance of these two projections, we denote their corresponding methods as COSLGE P and COSLGE PW.Instead of comparing our semi-supervised version S-COSLGE separately, the training data of COSLGE is taken as the labeled samples of S-COSLGE and the test data is taken as the unlabeled samples for training.The final experimental results are shown in Table II.And then, we discuss the following concerns.
1) Our approach surpasses other approaches at almost every split in all of these databases, especially for GTdb databases with more complex distributions.The two projections learned are all valid and have similar classification performance.Because of adjacency graph learning across training sets and test sets, the results of S-COSLGE are usually better than supervised COSLGE.
2) Some graph optimization algorithms can not even exceed the predefined graph algorithms, the potential reasons may be caused by two aspects: one is the algorithm's ability to extract discriminative information is insufficient, and the other is the rule assumption of graph optimization is unreasonable.Both LFDA and MFA attempt to construct local penalty graph, which is helpful to enhance the ability of model to extract discriminant information.
3) JGOPL, SLNP, and ALLDA all require the inverse of the global scatter matrix S t , thus it is necessary to remove the null space of the matrix S t in advance or to impose regularization perturbation on the matrix S t .Our approach, by contrast, does not require these concerns and does special preprocessing of the data.Moreover, RLatSL and NNSG can only project samples into c-dimensional space, and our method COSLGE P can project samples into any dimensional space.
4) A stable and efficient model needs to yield stronger compactness in the sample domain with high similarity and induce certain separability among different classes.Thus, we prefer to construct graph globally rather than within classes.Moreover, the tendency to preserve marginal adjacency information helps to enhance the generalization ability of the model.Then in our model, the joint classifier promotes the generation of separability.
2) Visualization Results: In this section, we present some visual results to support the validity of our proposed COSLGE.Due to space limitations, we cannot illustrate the results on all databases, but only on the PIE and YaleB databases.
With the increase of dimension, the inter-sample affinity based on Euclidean metric gradually fails. 2 Measuring the affinity between similar samples in the original space becomes extremely hard, as illustrated on the left in Fig. 4.An alternative strategy is to project these samples into a lowdimensional space so that similar samples are clustered in a compact domain, which is what our model craves.By contrast, 2 https://www.visiondummy.com/2014/04/curse-dimensionality-affectclassification/all the non-zero affinities learned on the right side of Fig. 4 have a distinct meaning and are almost all concentrated in the same class.The meaning of "almost" here is that some marginal information remains during the learning process.On the right side of Fig. 4, we illustrate two sample pairs {1630, 121} and {1290, 2239} that are not in the same class and have the non-zero affinities, where the number refers to the order in the training set.The two sample pairs are highly similar in terms of facial features, face orientation, and lighting conditions, which confirms the importance of preserving marginal information illustrated in the framework of Fig. 1.Besides, to investigate the validity of the separability examination in our model more intuitively, we specially made t-SNE visualization for the original sample data and the projected sample data.In Fig. 5, it is obvious that the data samples belonging to different classes after projection are significantly separated from each other, while the samples in the original space are mixed together.As for the classification targets, we also visualized them in Fig. 6, and they differ significantly from the 0-1 targets.The following ablation experiments will verify their superiority over the strict 0-1 targets.3) Ablation Study: In this section, we investigate the performance of our model after removing the relevant modules.Above all, we denote our model without optimizing  target matrix T as COSLGE noT, our model without A as COSLGE noA, and our model optimizing S only in original space as COSLGE noS.The average experimental results for the two splits across all databases are shown in Table III.Although COSLGE noA outperforms the others in GTdb and UMIST, the worst performing COSLGE noS shows that S is more efficient at capturing nearest-neighbor relationships in projected subspace.Besides that, COSLGE P performs best in all other situations, which also indicates the indispensability of the three modules.
4) Sensitivity to Reduced Dimensions: In the above experimental setup, for fairness, we assume that all methods are projected onto a c-dimensional subspace.In this section, we specially set up a set of experiments to observe the effect of the proposed method with respect to dimension changes.At 10-dimension intervals between 10 and 100 dimensions, we repeat 10 randomized trials across all databases, and the average results are illustrated in Fig. 7. Except for a slight underperformance in 10 dimensions, there is almost no significant difference in the remaining dimensions, indicating that our algorithm is relatively stable to any dimension.

5) Parameter Analysis:
In our model, three hyperparameters α, δ and γ need to be tuned before use, where α and δ are regularization parameters.We investigate the parameter sensitivity of the proposed method by observing the classification accuracy under different parameter combinations.Given the properties of the regularization parameter, we recommend that α and δ be set to the same value.Then, we recommend that two regularization parameters [α, δ] can be selected from set {a, b, c, d} = {[0.5,0.5], [0.1, 0.1], [0.05, 0.05], [0.01, 0.01]}, and the tradeoff coefficient γ can be selected from set {0.5, 1, 5, 10}.The classification representation is shown in Fig. 8.In general, our model has strong stability for parameter combinations except for GTdb and UMIST databases with smaller sample sizes.The performance on GTdb reflects that the model is more affected by regularization.
6) Convergence Analysis: Here, we demonstrate the convergence of the proposed COSLGE.In Fig. 9, we plot the changes of the three convergence conditions of our algorithm with the number of iterations on YaleB and PIE databases.In the early stage of convergence, the convergence curve may fluctuate under the influence of iteration step size but eventually converges.Of the three, ∥Z −U W ∥ ∞ and ∥U −P W ∥ ∞ converge the fastest within 20 iterations.The convergence of our method to other databases is not covered due to the limitation of space.

VI. CONCLUSION
In this paper, we propose a compact self-learning representation framework that combines adjacency graph learning with joint separability examination.Different from the existing supervised self-learning graph embedding methods, we tend to preserve both intra-class and inter-class adjacency information for better generalization performance.And the obtained low-dimensional embedded samples are input into the joint classifier to test their inter-class separability, which is beneficial to the subsequent classification tasks.Extensive experiments on six widely used databases demonstrate that the proposed method has strong generalization, stability and superiority.Besides, our approach can easily be extended to semi-supervised version and does well.In future work, we further consider improving the robustness of our model and the processing of large data sets.

Fig. 4 .
Fig. 4. Two affinity matrices generated from class 5 to 10 on the PIE database of 40 training samples per class.Initial affinity matrix in original space (left).Final affinity matrix in projected space (right).

Fig. 5 .
Fig. 5. t-SNE visualization of all original samples (left) and all embedded samples (right) in YaleB database.

Fig. 7 .
Fig. 7. Classification performance (%) versus reduced dimensions on six different databases with 40, 30, 9, 12, 30 and 40 training samples per class, respectively.The numbers in the legend are the reduced dimensions.
2 d).In Step 6, Computing Z requires O(nd 2 + ndc + d 3 + d 2 c + drc) operations.In Step 7, the calculation of T with less time consuming is approximately O(nc) operations.In Steps 8 and 9, update W and U are both of order O(dr 2 + r 3 + r 2 c + drc).Computing P in Step 10 requires O(n 2 d + nd 2 + d 2 r) operations.In step 11, the probabilistic matrix S needs a time complexity of order O(n 2 r).By contrast, the complexity in Step 12 is negligible.
Since {r, c} ≤ d and the size of n is uncertain, the total computational complexity is at most O(t(nd 2 + d 3 + n 2 d)), where t is the number of iterations before algorithm convergence.

TABLE I ATTRIBUTES
OF ALL METHODS INVOLVED IN COMPARISONS.

TABLE II MEAN
CLASSIFICATION ACCURACIES ± STANDARD DEVIATIONS (%) OF DIFFERENT METHODS ON VARIOUS SPLITS OF SIX DATABASES." means the number of training samples per class, and bold indicates the best and the second best.

TABLE III THE
PERFORMANCE EVALUATION OF OUR MODEL AFTER REMOVING THE RELEVANT MODULES.