Introduction

The technological advances in data acquisition and storage result in a large number of high-dimensional and ultra-high-dimensional datasets in various biomedical applications1, such as electroencephalogram (EEG), gene expression, and sonar imaging. These datasets might be redundant and result in the curse of dimensionality or excessive computational consumption when they are utilized for model construction and pattern recognition, although they are certainly desirable for modeling biological processes2. This issue promotes the utilization of dimension reduction technologies such as principle component analysis (PCA), linear discriminant analysis (LDA), and locality preserving projections (LPP). In essence, these linear or nonlinear projection strategies can be unified in the structure of graph embedding (GE), which constructs a graph space based on the affinity information between each pair of samples and represents each vertex of the graph with a low-dimensional vector that preserves similarities between the vertex pairs3. Since the establishment of this framework, there have been varieties of successful applications in biomedical signal analysis4,5,6.

However, in data analysis, such as biomedical research, a great challenge exists, i.e., the inevitable artifacts in the signal recordings7,8, which are one of the major factors accounting for reduced signal quality. Artifacts can be caused by various factors such as the clinical image artifacts from illumination variations and dust particles, measurement errors in biochemistry8, independent scatters caused by biological tissues that are smaller than the acoustical wavelength9 and artifacts in EEG recordings due to blinks and eye movement, a large number of spontaneous brain activities, or spikes. Artifacts are usually characterized by several orders of magnitude larger than the signal of interest, which cannot be described by the standard Gaussian distribution. Thus, when using the conventional L2 norm-based approaches like maximum-likelihood (ML) and least squares to estimate the noise variance or to extract features, the results are usually inevitably biased by the components that express the outliers9,10,11. To resist the noise influence, Cai et al.12 redefined kernel discriminant analysis with spectral graph analysis (SRKDA) and estimated the projections in a more efficient and robust way. Though a series of improvements13,14,15 have been proposed to widen the applications of GE extensions in recent years, few of them have focused on artifact restriction.

The easiest way to depress the influence of artifacts for further analysis is to reject the contaminated samples. However, this is not favorable because it may lead to the loss of other useful information. A more favorable way is to detect the artifacts or outliers and then use approaches like regression, the blind source separation (BSS) method for artifact removal. Although some methods have been proposed to find outliers, it is difficult to determine the extent to which the peculiarities would be outliers8. Even if the artifacts or the outliers are located, the process of eliminating the artifact component by using BSS methods such as independent component analysis (ICA) is not trivial and may result in the loss or distortion of useful information in the extracted features16. To avoid the loss of data while restricting the outlier influence for feature extraction, one of the most efficient ways is to extend the conventional L2 norm-based models to the Lp (p ≤ 1) norm space. For example, Kwak et al. translated the L2 norm structure of conventional PCA with the L1 norm to restrict the outlier influence for dimensional reduction17. Wang et al. proposed replacing the covariance matrix with the L1 norm structure in the Rayleigh quotient expression to improve the robustness of common spatial patterns (CSP) toward outliers for a motor imagery-based brain computer interface (BCI)11. In our previous conference work18, Zhou developed the spectral regression (SR) in the L1 norm space to obtain robust parameters under outlier conditions while neglecting the possible outlier effects on the estimation of response vectors in graph embedding. Similar to Zhou’s work, Nie et al. measured the distance between any pair of projected vertices in the L1 norm space19. To facilitate the solution, they still restrained the denominator of the Laplacian embedding in the L2 norm space. Accordingly, this may also inevitably be influenced by outliers. In this work, we propose a novel graph-embedding framework based on the L1 norm maximization strategy to restrict the outlier influence in dimensionality reduction, which is usually encountered in biological signals.

Results

In recent years, graph embedding and its extensions have gained more and more attention and contributed to a series of relative reports. Among these, discriminant analysis-based extensions, such as RLDA, HLDA and NDA, and spectral regression-based extensions, such as SRDA, SRKDA and local preserving projection and its extensions, are commonly mentioned and have been successfully applied to both engineering technology and neural science research12,20,21,22,23,24,25. These methods have their own superiorities when dealing with different data problems such as noise and heteroscedasticity in dimension reduction3,26. In addition to these GE extensions, other L1 norm-based linear classifiers such as Lasso SRDA, L1 SR and L1 LDA have also been proposed to resist outlier influence in recent years. Thus, in this section, we will investigate the performance difference between our proposed L1 GE algorithm and these popular classifiers for classification tasks. Classification accuracy is adopted as the performance index for evaluation. All of our experiments were performed on a Core i3 3.30-GHz Windows 7 machine with 8 GB of memory.

We mainly investigate the performance difference between L1 GE and the 10 other popular classifiers (7 GE extensions and 3 L1 norm-based linear classifiers), i.e., LDA27, Regularized LDA (RLDA)28, Heteroscedastic LDA (HLDA)29, Non-parameter LDA (NDA)30, Spectral regression discriminant analysis (SRDA)31, LPP32, SRKDA12, Lasso SRDA31, L1 LDA33 and L1 SR18, for dimension reduction and feature extraction. In both simulation and real datasets experiments, we simply set the regularization parameter for RLDA and SRDAs (i.e., SRDA, SRKDA and Lasso SRDA) as 1.0 and 0.7 according to31. As the types of kernels were not what we considered in this work, we just used the Gaussian kernel for SRKDA as reported in12. In this work, we tuned the kernel width parameter σ in SVM to achieve the best testing performance for SVM. Then, the same kernel width parameter σ was used for SRKDA. The weights estimation for LPP is straightforward according to32. The 3 neighbors in NDA were used as reported in a previous study30.

Simulation studies

In this section, two classes from a 2-D Gaussian distribution with different means are adopted34. Class 1 is from a Gaussian distribution with mean (3.00, 3.00) and variance (0.5, 0.5), and class 2 is from a Gaussian distribution with mean (1.85, 1.85) and variance (0.5, 0.5). Theoretically, the optimal hyperplane (i.e., decision boundary) for differentiating the two classes should be along 135 degrees. If the outliers are introduced, the corresponding classification hyperplane may be biased. During simulation, the training set consisted of 220 samples, with each class being 110 samples, and the testing set contained the same label distribution as the training set, i.e., 110 samples for each class in the testing set.

Effect of outlier occurrence rate

This simulation generates datasets contaminated by different numbers of outliers. The outlier is generated from the Gaussian distribution with mean (13.65, 13.65) and variance (0.5, 0.5). The number of outliers is set as 0%, 3% and 8% of the sample number. In each outlier condition, the outliers are evenly assigned to the two classes and the procedure is repeated 100 times. The mean accuracies are reported in Table 1 on the left side. The corresponding mean effect of the outlier number on the hyperplane is visually given in Fig. 1 for all the linear classifiers. As SRKDA constructs the hyperplane in the kernel space that is essentially different from the original data space, the corresponding hyperplane of SRKDA is not presented in Fig. 1 under different outlier occurrence rates.

Table 1 Classification accuracy and the projection direction in different outlier conditions.
Figure 1
figure 1

The effect of outliers on decision boundaries. (a) LDA. (b) RLDA. (c) HLDA. (d) NDA. (e) L1 GE. (f) SRDA. (g) LPP. (h) L1 LDA. (i) Lasso SRDA. (j) L1 SR.

Effect of outlier strength

This simulation generates datasets contaminated by outliers with different strengths. The outlier is generated from a Gaussian distribution with fixed variance (0.5, 0.5) and varied mean from (6.00 6.00) to (8.60, 8.60), where the outlier strength (\(6.00\times \sqrt{2}\) ~ \(8.60\times \sqrt{2}\)) can be adjusted by the varied mean, aiming to reveal the effect of the outlier strength on the hyperplane. We fix the number of outliers as 5% of the total sample number. The procedure is repeated 100 times, and the mean accuracies are reported in Table 1 on the right side. The mean effect of the outlier strength on the hyperplane is also given in Fig. 1 for all the classifiers. In this simulation study, we also do not report the corresponding hyperplane of SRKDA under various outlier strengths, as in the first simulation study.

Actual Dataset

Gene Datasets

In real applications, gene data are usually contaminated by outliers due to measurement errors and random fluctuation at various manufacturing stages. Thus, in the following experiments, we tested the classification performance of the eight GE extensions on two gene datasets: One is a colon cancer dataset35 and the other is a Leukemia dataset36.

The first colon cancer dataset contains the expression levels of 2000 genes taken in 62 different samples, i.e., a 62 × 2000 data matrix. For each sample, the matrix indicates whether it is from a tumor biopsy. Forty samples are positive, and 22 samples are negative. In our experiment, a 5-fold CV is adopted and repeated 100 times. In each 5-fold CV run, approximately 49 samples are included in the training set, and the 2000-length feature is reduced to a 49-length for each sample to serve as an input for classification according to previous studies37,38. The total time consumptions for the eight GE extensions to estimate the mapping information are 4.14 s, 1.08 s, 11.69 s, 35.30 s, 14.93 s, 0.90 s, 2.51 s, 0.91 s, 69.92 s, 2.17 s and 618.65 s. The mean accuracies of 100 runs for the eleven classifiers are given in Fig. 2(a).

Figure 2
figure 2

The gene dataset classification based on eight classifiers. (a) Colon cancer data. (b) Leukemia data.

The second leukemia dataset is a collection of expression measurements reported by Golub et al. (1999). The data set contains 72 samples. These samples are divided into two variants of leukemia: 25 samples of acute myeloid leukemia (AML) and 47 samples of acute lymphoblastic leukemia (ALL). The source of the gene expression measurements is taken from 63 bone marrow samples and 9 peripheral blood samples. The gene expression levels in these 72 samples are measured using high-density oligonucleotide microarrays. The expression levels of 7129 genes are reported. A 5-fold CV is used to evaluate the classification performance of the eight GE extensions on this dataset. In each 5-fold CV run, approximately 58 samples are included in the training set, and the 7129-length feature is reduced to a 58-length for each sample to serve as the input for classification. The times required for all the classifiers to estimate the mapping information are 5.25 s, 1.18 s, 14.84 s, 38.94 s, 17.91 s, 0.97 s, 3.12 s, 0.96 s, 62.05 s, 1.93 s and 637.33 s. The corresponding recognition accuracies for the classifiers are shown in Fig. 2(b). Both Fig. 2(a) and (b) consistently indicate that L1 GE, L1 SR, L1 LDA and RLDA have markedly better recognition performance than the other seven classifiers.

BCI Dataset

EEG is often contaminated by the noise from eye blinks or head movement, which largely influence the performance of EEG-based brain computer interfaces. In this section, we compare the performance difference of the eight GE extensions on two motor imagery datasets. The datasets used here are Dataset IVa of BCI competition 3 and a dataset recorded in our lab.

The first dataset consists of EEG signals recorded from five subjects using 118 electrodes. In each trial, a visual cue with respect to the motor imagery is shown for 3.5 s. Three types of images are presented, i.e., left hand, right hand and right foot. The tasks for the imagery of the right hand and the left hand are adopted to evaluate the algorithm performance in the current work. For the two tasks, the total number of EEG trials for each subject is 280, and the data are bandpass filtered between 0.05 and 200 Hz and down-sampled to 100 Hz. Following the reported results in11,39,40, the time interval between 0.5 s and 2.5 s after the trial onset is selected for task recognition, and the training trial and testing trial numbers used in previous studies41 are also adopted in the current work.

The second dataset is from the MI BCI system developed in our group, which consists of EEG data from 19 subjects. During the online experiment, the subjects were required to sit in a comfortable armchair in front of a computer screen, and they were asked to perform motor imagery with the left or right hand according to the instructions appearing on the screen. Motor imagery lasted for 5 s, followed by 5 s of rest. Fifteen Ag/AgCl electrodes covering the sensorimotor area were used to record the EEG, and the signals were sampled using 1000 Hz and bandpass filtered between 0.5 Hz and 45 Hz. Four runs on the same day were recorded for each subject, with each run consisting of 50 trials (i.e., 25 trials for each class), and there was a 3-minute break between two consecutive runs. The time interval between 0.5 s to 5 s after trial onset was selected for task recognition. The 50 trials in the first run were used for training classifiers, and the remaining 150 were used to test the performance of the classifiers. The experiment was approved by the Ethical Committee of the University of Electronic Science and Technology of China (UESTC). Informed consent was obtained from all participants, according to the Declaration of Helsinki.

For both datasets, a common spatial pattern (CSP)42 is adopted to extract the MI rhythm related features, and the 6 features corresponding to the 6 most discriminative CSP filters43 are used for task recognition. The corresponding classification accuracies for the 23 subjects are illustrated in Table 2, where L1 GE shows better performance than the other classifiers.

Table 2 Classification accuracy for the two BCI datasets.

UCI Dataset

To evaluate the generalization performance of L1 GE on other applications, we use nine binary-class datasets in UCI machine learning repositories to evaluate the algorithm performance; details of these nine datasets can be found in44. Specifically, in these nine datasets, five are from a clinical environment, i.e., heart disease data, breast cancer data, liver disorder data, SPECT data and thrombin data, while the remaining four datasets are from other applications.

For the purpose of our experiments, all of the features in each dataset are used for classification. We adopt the 5-fold CV proposed in17 for comparison, and the mean accuracies for 100 repetitions of CVs are reported in Table 3 for eleven different classification approaches. As shown in Table 3, L1 GE shows better performance on 5 datasets among the 9 tested datasets, giving the highest accuracy (84%) and approximately 1~5% accuracy improvement compared to the other approaches.

Table 3 The classification for the UCI dataset.

Discussion

The simulation study quantitatively evaluates the possible influence of outliers on the performances of GE extensions when they are applied to pattern recognition. In the first experiment, we studied the influence of the ratios between samples and outliers on the hyperplane estimation with the mentioned eleven classifiers. As shown in Table 1 on the left side, when the occurrence rate of an outlier is increased from 0% to 8%, the performances of ten linear classifiers are all lowered. But in terms of classification accuracies, L1 GE and the other L1 norm-based classifiers consistently show the better performance compared with the L2 norm-based GE extensions. In essence, the accuracy difference among the eleven classifiers is determinative by the hyperplane estimated for classification, and the hyperplane angle in Table 1 also confirmed that the hyperplanes estimated in the L1 norm space are less influenced by the introduced outliers and that their angles are closest to the theoretical 135 degrees. Specifically, Fig. 1 visually reveals the different effects of outlier occurrence rates on the eleven classifiers, where the hyperplane of all the classifiers are deflected under noise conditions. However, the hyperplanes estimated in the L1 norm space are more robust to outlier influence with less deflection than the other six linear classifiers. As for the non-outlier case (i.e., 0% occurrence rate), the eleven classifiers expected for SRKDA can find the boundaries to discriminate the two classes well. However, if a dataset is contaminated with outliers, the corresponding hyperplanes estimated in the L2 norm space are obviously biased toward the outliers, resulting in the misclassification of some samples. Compared to the classifiers estimated in the L2 norm space, although L1 GE and other L1 norm-based classifiers are influenced by outliers, their hyperplanes can still provide good discrimination ability (i.e., close to the diagonal 135 degree direction) for the two classes.

In the second experiment, we studied the influence of various outlier strengths on the classification hyperplane. Similar to the results in the first experiment, all the linear classifiers used in this experiment were influenced by outlier strength. As shown in Table 1 on the right side, when the value of the outlier varied from \(6.00\times \sqrt{2}\) to \(8.60\times \sqrt{2}\), their performances decreased. However, in terms of classification accuracy, L1 norm-based classifiers consistently show better performance than the traditional GE extensions. Similar to the results on the left side, the hyperplane angles in Table 1 on the right side confirm that L1 norm-based classifiers are robust toward the influence of outlier strength and that the angles of their estimated hyperplanes are closest to the theoretical value of 135 degrees. Specifically, the hyperplanes estimated from different outlier-strength conditions in Fig. 1 also reveal the robustness of L1 norm-based classifiers. Another point to note is that even a small outlier, e.g., \(6.00\times \sqrt{2}\), will lead to a biased hyperplane estimated by the L2 norm-based GE extensions. Compared to the traditional GE extensions, although L1 norm-based classifiers are also influenced by the outliers, their hyperplanes can still provide good discrimination ability (i.e., close to the diagonal 135 degrees direction) for these two classes.

Considering the results in Fig. 1, we can see that the hyperplanes estimated by HLDA under outlier conditions show much more bias toward the ideal boundary compared to the other nine classifiers, which accounts for the relatively lower accuracies for this classifier in the simulation studies. Among the existing LDA variants, HLDA aims to extend LDA toward problems with more complex distribution other than the homoscedasticity distribution with equivalent variance29,30,45. The motivation of HLDA is to address heteroscedastic situations where the classes may have a distribution with differences in both the variance and mean, which indicates that this method needs to estimate these two parameters from the training samples29. However, when the training sample is contaminated by outliers, both the mean and variance are inaccurately estimated to fit a false sample distribution, which will finally induce the biased hyperplane. Similar to HLDA, NDA is also designed to solve the problems with unequal variance distribution. For NDA the critical step is to estimate the distance between samples in one class and their nearest neighbor centers from another class30. When the training samples are contaminated with outliers, the distance between the noised samples and their neighbor centers is usually larger than that of other original samples, which plays a dominant role in the between-class scatter estimation. Consequently, more outlier numbers will lead to a more biased hyperplane estimated by NDA. RLDA is developed to address the possible singular problem existing in the covariance matrix31, and SRDA combined with regularization operators can decrease the time and memory consumption by the Eigen decomposition of dense matrices. For this simulation study, the covariance matrix has relatively larger coefficients in the diagonal direction, thus indicating the non-singularity of the covariance matrix. Therefore, when small regularization parameters (α = 1.0 or α = 0.70) are added to the diagonal coefficients in RLDA and SRDA, they will have little effect on the covariance properties, resulting in the similar results observed for LDA, RLDA and SRDA in the simulation study. Nevertheless, the effect of different regularization strategies used in RLDA and SRDA on the classification performance can be revealed by the actual datasets when the covariance matrix may be close to singularity. As SRKDA estimated hyperplanes in the kernel space whose dimension is usually more than 2, they cannot be illustrated in Fig. 1 as other linear GE extensions. However, Table 1 showed that SRKDA performs similarly in resisting outlier influence. In essence, SRKDA and SRDA share the same parameter estimation strategy, which indicates that when the outlier influence cannot be restricted by the kernel space, SRKDA will also be influenced. It is worth noting that the Gaussian kernel seems to be less sensitive than the other LS-based GE extensions when both the strength and occurrence of the outlier increases, which may be attributed to the Gaussian space used for kernel construction. In fact, the kernel space reflects the interactions between different samples. For a Gaussian kernel, the weights of the outliers tend to be small values. When the outlier ratio is small, it will not influence the real neighbor relationship in the kernel space. However, when the outlier ratio increases, more kernel-components will tend toward small values, which will greatly influence the stability of the kernel matrix. In addition, the types and parameters of kernels will also have an effect on the performance of related methods. Thus, it might be cumbersome to find proper kernels and the corresponding parameters for different applications. LPP is proposed to yield a nearest neighbor structure in low-dimensional space similar to that in high-dimensional space by preserving the local structure32. When the value of the outlier is small, it will have less influence on the main neighbor structure, resulting in a relatively robust hyperplane as illustrated by the red line in Fig. 1(g). However, when the outliers are stronger, the neighbor structure measured by the Euclidean distance will be destroyed, and the biased mapping information as illustrated in Fig. 1(g) will be estimated. Evidently, these linear GE extensions are not inferred in the outlier problem. As a result, they are all sensitive to the outliers in our simulation studies. It is worth noting that both L1 norm-based classifiers and SRKDA consistently show good ability to compress the outlier effect. Compared with SRKDA, L1 GE, L1 LDA and L1 SR are nonparametric methods, which indicates their convenience in real applications.

Compared with the L2 norm-based GE extensions, all the L1 norm-based classifiers mentioned in this work performed better, as illustrated in Fig. 1(e),(h–j), which could be attributed to the robustness of the L1 norm space to outlier influence. In addition to the better performance, we also observe that the differences in the objective function will have an influence on the hyperplanes, although the four mentioned L1 norm classifiers estimated the parameters in the same norm space. Through Fig. 1, we can see that the hyperplanes estimated by L1 SR and L1 SRDA are almost the same as shown in Fig. 1(i) and (j), which may be attributed to the spectral regression model utilized in their objective function. Meanwhile, the hyperplanes estimated by L1 GE and L1 LDA also hold some similarity in different outlier conditions as shown in Fig. 1(e) and (h). In fact, with the coding matrix H defined for supervised learning in the current work, (11) can be transformed into an L1 norm-based DA structure that is similar to L1 LDA, and this similarity can account for their close performance as outlier restriction methods in the simulation study. Although L1 GE and L1 LDA have a similar DA structure under supervised conditions, there are some differences in the definition of the scatter matrix in the DA denominator, which accounts for the different classification results when applied to the actual datasets. In essence, our proposed L1 GE can be flexibly transformed into other L1 norm-based GE extensions (i.e., L1 LDA, L1 LPP, L1 PCA, etc.), with a variety of coding matrices H. We will explore this in further work.

In biomedical engineering, gene data are usually measured for cancer diagnosis. However, gene expression data sets usually have a large number of variables but with a small number of samples46 that require robust classifiers for the clinical diagnosis of diseases. RLDA is one of the popular methods for such a situation20,28, and the classification results for these two datasets demonstrate the efficiency of RLDA for gene recognition. For a gene dataset, specific problems are encountered such as information redundancy or a low signal-to-noise ratio (i.e., strong noise artifacts)36,37,47. The conducted comparison of the two gene datasets confirmed that L1 GE may also have a stable ability to address these problems and has the closest performance to RLDA. In gene datasets, we also observe that Lasso SRDA performs worse than the other L1 norm-based classifiers. This may be due to the objective function used in Lasso SRDA. In essence, the distance measurement of Lasso SRDA is still designed in L2 norm space although it imposed the L1 norm constraint on the parameter, which may indicate that when the dataset suffers from the curse of dimensionality and strong noise artifacts, the Lasso SRDA will be influenced, regardless of how strong the constraints that are imposed on the parameters.

In BCI applications, artifacts are the main cause of reduced signal quality. Artifacts can be caused by various factors, such as electrode wire movement in the signal recordings and measurement errors in biochemistry. These artifacts usually hold a strength that is several orders of magnitude larger than the signal of interest and appear as spike-like waveforms of very short periods. Such artifacts can usually be treated as outliers and must be eliminated before further study. In studies based on biomedical signals, the easiest way is to reject the contaminated samples. However, this is not preferable because it may lead to the loss of other useful information. Other favorable methods such as BSS can also be applied for artifact rejection. However, it is difficult to determine the extent to which the samples may be contaminated by outliers, and it is also not trivial to eliminate the artifact components by the BSS methods. In addition, there is a rare report about the application of graph embedding analysis in scalp EEG BCI. To investigate the feasibility of L1 GE in the EEG BCI application, we evaluate the performance on two independent BCI datasets (total 24 subjects). Table 2 shows that 9 of 24 subjects can obtain the highest accuracy when L1 GE is used, which is the highest ratio among the eleven classifiers. Moreover, L1 GE shows 1~9% accuracy improvement compared to the other ten classifiers. The results in Table 2 show the potential of L1 GE for actual BCI application.

The presence of outliers is encountered not only in biomedical signal recording or gene expression data but also in a variety of other clinical applications. To evaluate the generalization performance of L1 GE, we used 9 UCI datasets with 6 datasets from the clinical application. The results for the UCI dataset presented in Table 3 reveal that L1 GE has the highest recognition accuracy (84%). Table 3 also demonstrates that no classifier consistently shows the best performance across the 9 different UCI datasets and that L1 GE achieves the best performance for 5 of the 9 datasets, hence outperforming the ten other classifiers. In contrast to the conventional L2 norm-based GE and its extensions, L1 norm-based classifiers will utilize an iterative procedure, which results in the relatively higher complexity and also accounts for more time consumption.

The applications to various datasets demonstrate that the L1 norm-based GE extensions are robust in dealing with the artifact-influenced classification. Compared to conventional graph embedding based on the L2 norm structure, the developed L1 norm GE maximizes the dispersion in the L1 norm space, which can provide a more powerful ability to suppress artifact effects. In this work, we mainly evaluated the performance of L1 GE in supervised learning. For unsupervised learning, one crucial step is to estimate the symmetric factor H, which holds the structure information in similarity matrix W48,49. In essence, our proposed L1 GE can be flexibly transformed into other L1 norm-based GE extensions, with a variety of coding matrices H. In future work, we will explore the efficiency of L1 GE when it is transformed into other L1 norm-based GE extensions, and we will also do further analysis on its efficiency in unsupervised and semi-supervised learning.

Methods

Graph embedding

Define \(m\) samples \(X={\{{x}_{i},{x}_{i}\subset {R}^{n\times 1}\}}_{i=1}^{m}\subset {R}^{n\times m}\) from \(C\) classes. In general graph embedding, each sample point can be treated as a vertex in an adjacency graph \(G\in {R}^{m\times m}\), where n denotes the sample dimension. The corresponding edges in G represent a statistical relationship between each pair of these sample points. The motive of graph embedding is to represent each vertex of G in a lower dimensional space and preserve the original edge information between vertex pairs. Essentially, graph embedding estimates the response vector \(y\in {R}^{m\times 1}\), which maximizes the following function:

$$J(y)=\frac{{y}^{T}Wy}{{y}^{T}Dy}\,\,$$
(1)

where T denotes the transpose and \(W\in {R}^{m\times m}\) is a sparse symmetric matrix reflecting the weight of the joining edge between vertices i and j as

$${W}_{ij}=\{\begin{array}{c}1/{m}_{k},\,\,\,{\rm{if}}\,{x}_{i}\,{\rm{and}}\,{{\rm{x}}}_{j}\,{\rm{both}}\,\\ \,\,\,\,\,\,\,{\rm{belong}}\,{\rm{to}}\,{\rm{the}}\,k-\mathrm{th}\,\mathrm{class};\\ 0,\,\,\,\,\,\,\,\,{\rm{otherwise}}.\end{array}$$
(2)

and m k is the sample number of the k-th class. D is a diagonal matrix whose entries are column or row sums of W31. Note that the scaling of the projection y will have no effect on the objective value. Thus, maximizing J(y) is tantamount to the following constrained optimization problem as

$$\{\begin{array}{c}\mathop{\text{arg}\,\max }\limits_{w}\,\,\,{y}^{T}Wy\\ subject\,to\,{y}^{T}Dy=1\end{array}\,\,\,\,$$
(3)

By introducing the Lagrange multiplier, the objective function can be rewritten as

$$L(y;\lambda )={y}^{T}Wy-\lambda ({y}^{T}Dy-1).$$
(4)

Taking the derivative of (4) with respect to y under the condition of \(\partial L/\partial y=0\), response vector y can be estimated by using the generalized eigenvalue equation as

$$Wy=\lambda Dy\,$$
(5)

where \(\lambda \) denotes the eigenvalue of the generalized eigenproblem, and y is the corresponding eigenvector. For multiple response vectors, the above equation (5) can be solved as

$${D}^{-1}WY={\rm{\Sigma }}Y$$
(6)

where Y is the matrix consisting of the eigenvectors of \({D}^{-1}W\), and \({\rm{\Sigma }}=diag({\lambda }_{1},{\lambda }_{2},\mathrm{...}{\lambda }_{m})\) is a diagonal matrix consisting of the eigenvalues of \({D}^{-1}W\). For classification purposes, there are only \((C-1)\) eigenvectors corresponding to the maximum \((C-1)\) eigenvalues. However, the response vectors \({\{{y}_{i}\}}_{i=1}^{C-1}\subset {R}^{m\times (C-1)}\) inferred from (6) only provide mapping information in the training set. To expand the mapping information for the testing sample, a simple way is to estimate some projections between the response vector and sample points. By replacing y with \({X}^{T}\alpha \), the objective function in (1) could be rewritten as

$$\mathop{\text{arg}\,\max \,}\limits_{\alpha }\,J(\alpha )=\mathop{\text{arg}\,\max }\limits_{\alpha }\frac{{\alpha }^{T}XW{X}^{T}\alpha }{{\alpha }^{T}XD{X}^{T}\alpha }$$
(7)

where \(\alpha \subset {R}^{n\times 1}\) is the mapping projection between the defined graph and samples. Therefore, the optimal solution of equation (7) is a mapping \(\alpha \subset {R}^{n\times 1}\), which can transform samples X to Y by preserving the manifold structure defined in W as much as possible. More details about graph embedding can be found in Appendix B.

L1 Graph embedding

Noting that W is a symmetrical matrix and D is a diagonal matrix, equation (7) can be formatted as

$$\begin{array}{c}\mathop{\text{arg}\,\max }\limits_{\alpha }\,J(\alpha )=\mathop{\text{arg}\,\max }\limits_{\alpha }\frac{{\alpha }^{T}XW{X}^{T}\alpha }{{\alpha }^{T}XD{X}^{T}\alpha }\\ \,\,\,\,\,\,\,\,\,\,\,=\,\mathop{\text{arg}\,\max }\limits_{\alpha }\frac{{\alpha }^{T}XH{H}^{T}{X}^{T}\alpha }{{\alpha }^{T}X\sqrt{D}{\sqrt{D}}^{T}{X}^{T}\alpha }=\frac{{\Vert {\alpha }^{T}XH\Vert }_{2}^{2}}{{\Vert {\alpha }^{T}X\sqrt{D}\Vert }_{2}^{2}}\end{array}$$
(8)

where \({\Vert \bullet \Vert }_{2}\) denotes L2 norm, \(W=H{H}^{T}\) and \({D}_{ii}={\sum }_{j=1}^{m}{W}_{ij}\). Because W is a symmetric matrix, there are several effective strategies to implement symmetric factorization so that we can easily estimate H48,49. As revealed in equation (8), the graph embedding is essentially derived from the L2 norm structure. However, the L2 norm has been proven to be prone to the presence of outliers17, which indicates that in practical applications, outliers will cause an unexpected effect on related analyses such as signal processing and feature extraction. To improve the robustness of parameter estimation in this framework, some schemes like sparse constraint with an L1 norm are proposed to alleviate the outlier effect50,51. However, most of these schemes are mainly focused on imposing restrictions on parameters but still leave the main structure of the objective function in the L2 norm space, which will inevitably exaggerate the outlier effect, regardless of how much these parameters are emphasized. Motivated by the merit of the L1 norm in suppressing the outlier effect, we proposed estimating the mapping denoted in (7) and (8) with the L1 norm space instead of the L2 norm as

$$\mathop{\text{arg}\,\max \,}\limits_{\alpha }\,\tilde{J}(\alpha )=\mathop{\text{arg}\,\max }\limits_{\alpha }\frac{{\Vert {\alpha }^{T}XH\Vert }_{1}}{{\Vert {\alpha }^{T}X\sqrt{D}\Vert }_{1}}$$
(9)

where \({\Vert \bullet \Vert }_{1}\) denotes the L1 norm. We refer to (9) as L1 Graph embedding (L1 GE). For pattern recognition, when all the training samples are labeled, we have \({\sum }_{k=1}^{C}{m}_{k}=m\). Suppose that all of the data points in X are ordered according to their labels as X = [X(1), X(2),…, X(C),…, X(k)], where \({X}^{(k)}=[{x}_{1}^{(k)},{x}_{2}^{(k)},\mathrm{...},{x}_{{m}_{k}}^{(k)}]\) with \({x}_{i}^{(k)}\) being the ith feature vector for the kth class; then, the weight matrix W can be defined as a c-block diagonal matrix31, with each block being a symmetric matrix as

$$W=[\begin{array}{cccc}{W}^{(1)} & 0 & 0 & 0\\ 0 & {W}^{(2)} & 0 & 0\\ 0 & 0 & \ddots & 0\\ 0 & 0 & 0 & {W}^{(C)}\end{array}]$$
(10)

where \({W}^{(k)}\) is an \({m}_{k}\times {m}_{k}\) matrix with all of the elements as \(1/{m}_{k}\) and \(D\in {R}^{m\times m}\) is a unit matrix. Let \(\tilde{X}=X-\bar{X}\); we can rewrite (10) as

$$\mathop{\text{arg}\,\max }\limits_{\alpha }\,\tilde{J}(\alpha )=\mathop{\text{arg}\,\max }\limits_{\alpha }\frac{{\Vert {\alpha }^{T}{{\rm{\Phi }}}_{b}\Vert }_{1}}{{\Vert {\alpha }^{T}\tilde{X}\Vert }_{1}}$$
(11)

where \(\bar{X}\) is the mean of \(X\), and \({{\rm{\Phi }}}_{b}=\tilde{X}H\). \(H\in {R}^{m\times c}\) is a coding matrix with columns indicating the class types and rows indicating the samples. The values in each column of H are defined as

$${H}_{k}={[\mathop{\underbrace{0,\mathrm{...},0}}\limits_{{\sum }_{i=1}^{k-1}{m}_{i}},\mathop{\underbrace{1/\sqrt{{m}_{k}},\mathrm{...},1/\sqrt{{m}_{k}}}}\limits_{{m}_{k}},\mathop{\underbrace{0,\mathrm{...},0}}\limits_{{\sum }_{i=k+1}^{c}{m}_{i}}]}^{T}.$$
(12)

In essence, by using the logarithm transformation, equation (12) is formatted as

$$J(w)=\mathop{\text{arg}}\limits_{w}\,\max \,\,\mathrm{ln}(||{\alpha }^{T}{{\rm{\Phi }}}_{b}|{|}_{1})-\,\mathrm{ln}(||{\alpha }^{T}\tilde{X}|{|}_{1}).$$
(13)

By introducing the sign function R and Q, we define the iterative direction for (14) as

$$d(\alpha (t))=\frac{(\sum _{i}^{C}{R}_{i}(t){{\rm{\Phi }}}_{b}(:,i))}{||\alpha {(t)}^{T}{{\rm{\Phi }}}_{b}|{|}_{1}}-\frac{(\sum _{j}^{N}{Q}_{j}(t)\tilde{X}(:,j))}{||\alpha {(t)}^{T}\tilde{X}|{|}_{1}},$$
(14)

where R(t), and Q(t) are defined as

$$\begin{array}{c}{R}_{i}(t)=\{\begin{array}{c}1,\,\alpha {(t)}^{T}{{\rm{\Phi }}}_{b}(:,i)\ge 0\\ -1,\alpha {(t)}^{T}{{\rm{\Phi }}}_{b}(:,i) < 0\end{array}\\ {Q}_{j}(t)=\{\begin{array}{c}1,\,\alpha {(t)}^{T}\tilde{X}(:,j)\ge 0\\ -1,\alpha {(t)}^{T}\tilde{X}(:,j) < 0,\end{array}\end{array}$$
(15)

In the above equations, \({{\rm{\Phi }}}_{b}(:,i)\) and \(\tilde{X}(:,i)\) denote the ith column of matrices \({{\rm{\Phi }}}_{b}\) and \(\tilde{X}\), respectively.

Based on the gradient in (14), the objective function in (11) can be solved by the iterations below:

  1. (1)

    Initialization. Set t = 0; Set the stop tolerance error \({\varepsilon }_{1}\) and \({\varepsilon }_{2}\) with a small positive number. Initialize \(\alpha (0)\,\) with N random numbers and normalize it as \(\alpha (0)\,=\alpha (0)/||\alpha (0)|{|}_{2}\).

  2. (2)

    Updating. Update the projection vector as \(\alpha (t+1)\,=\alpha (t)+\eta d(\alpha (t))\) with \(\eta \) being a small number accounting for the learning rate of iteration.

  3. (3)

    Convergence criterion. If \(J(\alpha (t+1))-J(\alpha (t))|| < {\varepsilon }_{1}\), or \({\Vert \alpha (t+1)-\alpha (t)\Vert }_{2} < {\varepsilon }_{2}\), set \(\alpha \,=\alpha (t+1)/||\alpha (t+1)|{|}_{2}\) and stop iteration; else, t = t + 1, and go to Step 2.

Using the above steps, the optimal \(\alpha \) can be estimated. The computational complexity of L1 GE parameter estimation for a single iteration is \(o(n\times (2m+2c+mc))\).

With the gradient in equation (14) and the iteration steps, the proposed objective function J(α(t)) is a non-decreasing function at each step of iteration t, which have been proved in Appendix A.