Introduction

For the attractive growth of image data in the real computer vision field, feature extraction is becoming the central research topic in image classification. Come with the various changes of light condition, background and viewpoint that make the task become more challenge [1]. In the meanwhile, to single out more efficient and robust representative features to deal with such variation is of great important to image classification, in which feature extraction plays a crucial role.

By extraction means, feature extraction methods can be divided into the ones for handcrafted features and automatically learned features. Among the former feature extraction methods, the successful methods are scale invariant feature transform (SIFT) [2] and histogram of oriented gradient (HOG) [3]. They have always been used with geometric and statistical methods for some special tasks, and their limitations have been exposed in different real-world applications. On the other hand, although the latter feature extraction methods such as sparse representation and deep learning methods have been widely used in past decade and have achieved dramatic progress in the real-world application of computer vision [4], they are still facing some troubles. For spare representation methods, the performance of dictionary learning has an influence on the sparse coding vectors. For the deep learning approaches [5,6,7], they have always suffered from the complicated parameter tuning process and local minima.

Although sparse representation is susceptible to dictionary learning performance [8], it is very efficient in feature learning given no prior information and it has proven particularly robust in solving image processing problems owing to the following two incomparable merits [9,10,11]: (1) the receptive fields of cell can be modeled using sparse coding in the visual cortex; (2) the rich representation can recover the subspace of image patches, leading to sparse representations naturally.

Among all the essential components of high-performing deep learning approaches, feed forward neural networks (FNNs) [12], especially the convolutional neural networks (CNNs) [13] and neural response (NR) [14] have achieved an excellent performance in various real-world tasks, such as face recognition, object tracking, and speech recognition [15], based on the multilayer perceptron. During the parameter optimization and tuning processes, the existing FNNs deeply rely on the backward propagation algorithm and always have lower convergence [16]. Although the tuning time is reduced to some extent, the CNNs are still in need of the tedious weights and bias optimization process, which virtually prolongs the run time and local minimum phenomenon. The performance of the NR methods is influenced by the design and selection of the template.

To overcome these unavoidable drawbacks of sparse representation and deep learning methods, Huang et al. [17] proposed the extreme learning machine (ELM), in which the hidden layer neurons are generated randomly without tuning process and the output weights can be determined analytically. The ELM is wide used in various computer vision applications such as object recognition and image classification. Typically, the original ELM is in a singer-layer structure and can be extended to multilayer framework called multi-layer ELM [18] (ML-ELM) to improve its generalization performance. The ML-ELM can be constructed using the ELM-based autoencoders through stacking ideas, but it will lose the full universal approximation merit of ELM. Huang et al. [19] proposed the local receptive fields-based ELM aiming to solve the universal approximation problem, but it has only one feature mapping layer and one pooling layer which fail to extract the sufficient representative features. So further research is counted upon to focus on the topic of improving the classification performance and learning efficiency.

In the learning process of ELM, the output weights are crucial and normally computed using the Moore–Penrose generalized inverse method by minimizing the classification error of the training data. Recent studies have shown this method tends to fail due to data distributions. Several studies have tried to address this unavoidable problem in the real-world application using different technologies. Huang et al. [20] used the \({l}_{2}\) norm (Ridge regression) regularization to address the minimum problem. Tang et al. [21] used the \({l}_{1}\) norm (Lasso) regularization to derive the sparse solutions in order to restrict the output weights. While in the real application, the hidden nodes usually outnumber the label data so the traditional Lasso method could fail to realize the group selection.

In summarizing the existing learned feature extraction methods, it is found that they all have some drawbacks and suffer from either low classification performance or high time consumption. In this paper, inspired by the sparse representation and ELM, our panel proposes a novel nonlinear feature extraction approach called weighted extreme learning machine exponential regularized discriminative dictionary learning (WELM-ERDDL). The proposed WELM-ERDDL consists of two stages: the WELM-ERDDL feature mapping stage and the WELM learning stage. In the feature mapping stage, the WELM is embedded with exponential regularized discriminative dictionary learning via exponential regularized linear discriminative learning (ERLDA) and sparse coding, so the input features can be transformed via nonlinear feature mapping. In the WELM learning stage, elastic net regularization is used to update the output weights comprising the \({l}_{1}\) norm and \({l}_{2}\) norm to obtain more compact and meaningful features. Finally, a flexible weight update criterion is designed for the WELM.

Overall, the main contributions of this paper are outlined as follows:

  1. 1.

    A nonlinear WELM embedded feature projection strategy via Exponential Regularized Discriminative Dictionary Learning is given for feature diversity and low computational efficiency.

  2. 2.

    During the ELM learning stage, the output weights are updated through elastic net regularization to enhance its compactness and meaningfulness.

  3. 3.

    An effective adaptive online weight update criterion is designed for the WELM.

The rest of this paper is organized as follows: section “Preliminaries” provides some prior knowledges related to this paper. In section “The proposed WELM-ERDDL”, the proposed WELM-ERDDL feature extraction framework is given in detail. In section “The ELM Learning”, the efficient ELM learning process is discussed. In section “Classification with the WELM-ERDDL”, the classification scheme with the proposed WELM-ERDDL is presented. In section “Experimental results and analysis”, the experimental results are shown and analyzed. Finally, in section “Conclusion”, the conclusion is shown, and some potential directions of future work are indicated.

Preliminaries

In this section, a brief introduction of some preliminaries is presented, including the concepts of extreme learning machine (ELM) and weight extreme learning machine (WELM), dictionary discriminative learning (DDL), and elastic net regularization.

Extreme learning machine (ELM)

Suppose the training set \({\left\{{x}_{i},{t}_{i}\right\}}_{i=1}^{N}\) is composed of \(N\) training samples; the input is \({x}_{i}\) whose dimension is \(d\); \({t}_{i}\) is the label of the output. Then the output of the ELM is [22]:

$$ \mathop \sum \limits_{j = 1}^{L} \beta_{j} h\left( {x_{i} } \right) = \mathop \sum \limits_{j = 1}^{L} \beta_{j} g\left( {x_{i} \cdot w_{j} + b_{j} } \right) = t_{i} , $$
(1)

where the parameter \(w_{j} = \left[ {\begin{array}{*{20}c} {w_{j1} ,} & {w_{j2} ,} & \ldots & {w_{jn} } \\ \end{array} } \right]\) is the input weight of the \(j{\text{th}}\) hidden node; \(b_{j}\) is the deviation of the \(j{\text{th}}\) hidden node; \(\beta_{j}\) is the output weight of the \(j{\text{th}}\) hidden node. Equation (1) can be simplified into

$$H\beta =T,$$
(2)

where \(H\) is the hidden layer output matrix:

$$H=\left[\begin{array}{c}h\left({x}_{i}\right)\\ \dots \\ h\left({x}_{N}\right)\end{array}\right]=\left[\begin{array}{ccc}g\left({x}_{1}\cdot {w}_{1}+{b}_{1}\right)& \dots & g\left({x}_{1}\cdot {w}_{L}+{b}_{L}\right)\\ \vdots & \ddots & \vdots \\ g\left({x}_{N}\cdot {w}_{1}+{b}_{1}\right)& \dots & g\left({x}_{N}\cdot {w}_{L}+{b}_{L}\right)\end{array}\right].$$
(3)

To improve the generalization ability of ELM, a penalty factor \(C\) is introduced into (3), and the output weight matrix \(\beta \) is

$$ \beta = H^{T} \left( {\frac{1}{C} + HH^{T} } \right)^{( - 1)} T. $$
(4)

ELM aims to minimize the training error and the \({l}_{2}\) norm of the output weights, namely

$$\mathrm{min}:{L}_{\mathrm{ELM}}=\frac{1}{2}{\Vert \beta \Vert }^{2}+C\cdot \frac{1}{2}\cdot {\sum \limits_{i=1}^{N}\Vert {\xi }_{i}\Vert }^{2}.$$
(5)

Then the output of the extreme learning machine can be expressed as

$$ f(x) = h(x)\beta = h(x)H^{T} \left( {\frac{1}{C} + HH^{T} } \right)^{( - 1)} T. $$
(6)

Weighted extreme learning machine (WELM)

The main viewpoint of WELM is to assign penalties to different classes, and it can be view as a cost sensitive version of ELM in handling the troublesome imbalanced data problem. In the WELM, the penalty factor \(C\) is added to the ELM while the minority class has a greater value of \(C\). Then a weighted matrix \(W\) is used to regulate \(C\), so (5) can be modified as [23]

$$\mathrm{min}:{L}_{\mathrm{ELM}}={\frac{1}{2}\Vert \beta \Vert }^{2}+\frac{CW}{2}\sum \limits_{i=1}^{N}{\Vert {\xi }_{i}\Vert }^{2}.$$
(7)

In (7), the critical problem is to determine the appropriate weight matrix. Zong et al. gave two different versions of computing methods:

$${W}_{\mathrm{ELM}1}=\frac{1}{\mathrm{num}\left({t}_{i}\right)},$$
(8)
$${W}_{\mathrm{ELM}2}=\left\{\begin{array}{cc}\frac{0.618}{\mathrm{num}\left({t}_{i}\right)}& \mathrm{if \,\,num}\left({t}_{i}\right)>\mathrm{AVG}\left({t}_{i}\right)\\ \frac{1}{\mathrm{num}\left({t}_{i}\right)}& \mathrm{if\,\, num}\left({t}_{i}\right)\le \mathrm{AVG}\left({t}_{i}\right)\end{array}\right.,$$
(9)

\(\mathrm{num}({t}_{i})\) it is the number of samples belonging to the ith class. Finally, (6) can be modified as

$$ f(x) = h(x)\beta = h(x)H^{T} \left( {\frac{1}{C} + HWH^{T} } \right)^{( - 1)} WT. $$
(10)

Dictionary discriminative learning (DDL)

For the sparse coding problem, a classical dictionary learning problem is shown as follows [24,25,26]:

$$\underset{D,A}{\mathrm{min}}\frac{1}{2}{\Vert X-DA\Vert }^{2}+{F}_{\mathrm{s}}{\left(A\right)}^{2}+{F}_{\mathrm{d}}{\left(A\right)}^{2},$$
(11)

where \({F}_{\mathrm{s}}\) stands for the sparsity inducing term; \({F}_{\mathrm{d}}\) stands for the discriminative term. The research has proven that adding discriminative information to the sparse coding can significantly enhance the classification performance. Liu et al. proposed a specific \({l}_{\mathrm{1,2}}\) norm to learn the discriminative information \({F}_{\mathrm{d}}\):

$${F}_{\mathrm{d}}\left(A\right)=\sum_{c=1}^{C}{\Vert {A}_{c}\Vert }_{\mathrm{1,2}},$$
(12)

where \({A}_{c}\) represents the coding vectors from class \(c\), while the coding vector from the same class will have the same saprse pattern achieved simultaneously by the sparsity and discriminative encoding.

Elastic net regularization

Elastic net regularization can solve the variable selection problem effectively by combining the \({l}_{1}\) norm and \({l}_{2}\) norm to obtain a better solution. Therefore, the elastic net regularization problem can be expressed as follows [27, 28]:

$$P\left(x;\omega \right)=\omega {\Vert x\Vert }_{1}+\left(1-\omega \right)\frac{1}{2}{\Vert x\Vert }_{2}^{2},$$
(13)

where the parameter \(\omega \) controls the proportion between the \({l}_{1}\) norm and \({l}_{2}\) norm. Evidently, there are two special cases: if \(\omega =0\), the elastic net regularization becomes associated with the \({l}_{2}\) norm; if \(\omega =1\), the elastic net regularization becomes associated with the \({l}_{1}\) norm.

The proposed WELM-ERDDL

In this section, the proposed WELM-ERDDL is to be deduced in detail. As known, the whole structure of the proposed WELM-ERDDL consists of two parts: the discriminative projection term and the discriminative sparse regularization term. In the discriminative projection, the exponential regularized linear discriminative analysis (ERLDA) is conducted for the discriminative projection term. Therefore, the proposed approach can not only acquire high-dimensional features for high performance without parameter turning process but also perform dictionary-learning process in a low-dimensional subspace.

Inspired by the dictionary learning problem given in (11), our panel formulates the following objective function of the proposed WELM-ERDDL approach:

$$\underset{D,A}{\mathrm{min}}\frac{1}{2}{\Vert \beta HW-DA\Vert }^{2}+{F}_{1}{\left(\beta \right)}^{2}+{F}_{2}{\left(A\right)}^{2},$$
(14)

where \(H\) is the nonlinear transform of the original input \(X\); \({F}_{1}\) stands for the discriminative projection term and \({F}_{2}\) stands for the discriminative sparse regularization term, which are regularizations for \(\beta \) and \(A,\) respectively. From (14), one may find that the proposed WELM-ERDDL has two merits:

  1. 1.

    Using the nonlinear transform, the original input is mapped into high-dimensional features \(H\) without parameter tuning process so that the universal approximation capability can be guaranteed.

  2. 2.

    The dictionary learning is performed solely in a lower dimensional space.

The discriminative projection term \({{\varvec{F}}}_{1}\)

In this part, the discriminative projection term is to be explained comprehensively. From the objective function given in (14), it can be seen that the discriminative subspace \(\beta WH\) is crutial for dictionary learning and its determined by the discriminative projection term \({F}_{1}\). Formally, the LDA approach is utilized for regularization, but it is always confronted with small sample size problem. In this paper, the ERLDA is conducted for the discriminative projection term. For the ERLDA approach, the discriminant criterion is given by:

$$J(W,\alpha {)}_{\mathrm{ERLDA}}=\frac{\left|{W}^{T}\mathrm{exp}({S}_{b})W\right|}{\left|{W}^{T}\mathrm{exp}({S}_{w}+\alpha I)W\right|}.$$
(15)

Then the orientation matrix is computed by EVD as \({\left[\mathrm{exp}\left({S}_{w}+\alpha I\right)\right]}^{-1}\left[\mathrm{exp}\left({S}_{b}\right)\right]\). While the discriminative projection term \({F}_{1}\) is computed as follows:

$${F}_{1}\left(\beta \right)=\frac{{\lambda }_{1}}{2}\mathrm{tr}\left[\beta \left(\mathrm{exp}\left({S}_{w}+\alpha I\right)-\mathrm{exp}\left({S}_{b}\right)\right){\beta }^{T}\right],$$
(16)

where \({\lambda }_{1}\) is a hyperparameter; \({S}_{w}\) and \({S}_{b}\) are the intraclass scatter matrix and the interclass scatter matrix based on the hidden space, respectively. The effect of (16) is to minimize the intraclass scatter and maximize the interclass scatter to separate the classes of features alongside dictionary learning.

The discriminative sparse regularization term \({{\varvec{F}}}_{2}\)

The traditional \({l}_{\mathrm{1,2}}\) norms in (12) are inspired by multitask learning with similar tasks sharing similar sparse patterns, which means the row sparse structure in (12) must select the same dictionary atoms within the same class. The drawback of this scheme is that it is hard to optimize; meanwhile, it is sensitive to the optimization method. In this section, a simple and novel sparse regulation strategy is to be presented for dictionary learning. Then \({F}_{2}\) can be expressed as follows:

$${F}_{2}= \sum \limits_{c=1}^{C}\left({\lambda }_{2}+{\lambda }_{3}{\Vert {a}_{-c}^{\left(i\right)}\Vert }_{2}\right){\Vert {a}_{c}^{\left(i\right)}\Vert }_{2},$$
(17)

where \({a}_{c}^{\left(i\right)}\) and \({a}_{-c}^{\left(i\right)}\) are the rows of \({A}_{c}\) and \({A}_{-c}\), respectively, which stand for the coding vectors belonging to and not belonging to the class \(c\);\({\lambda }_{2}\) and \({\lambda }_{3}\) are two hyperparameters. In this part, define \({\left[{W}_{C}\right]}_{ii}={\lambda }_{2}+{\lambda }_{3}{\Vert {a}_{-c}^{\left(i\right)}\Vert }_{2}\), and thus, (17) can be transformed into

$$ F_{2} = \mathop \sum \limits_{{c = 1}}^{C} W_{C} A_{{C1,2}} . .$$
(18)

Formulation

In this part, after the discriminative projection term \({F}_{1}\) and the discriminative sparse regularization term \({F}_{2}\) are determined, the final formula for the proposed WELM-ERDDL is given as follows:

$$ \begin{gathered} \mathop {{\text{min}}}\limits_{D,A} \frac{1}{2}\beta HW - DA^{2} + F_{1} \left( \beta \right)^{2} + F_{2} \left( A \right)^{2} = \mathop {{\text{min}}}\limits_{D,A} \frac{1}{2}\beta HW - DA^{2} \hfill \\ \;\;\; + \frac{{\lambda_{1} }}{2}{\text{tr}}\left[ {\beta \left( {\exp \left( {S_{w} + \alpha I} \right) - \exp \left( {S_{b} } \right)} \right)\beta^{T} } \right] + \mathop \sum \limits_{c = 1}^{C} W_{C} A_{C1,2} . \hfill \\ \end{gathered} $$
(19)

Remarks

In (19), \({F}_{1}\) and \({F}_{2}\) are both designed as discriminative regularization terms, but they are serve for different purposes: the discriminative projection term \({F}_{1}\) is used to learn a suitable projection for feature representations, while the discriminative sparse regularization term \({F}_{2}\) is designed for discriminative dictionary learning by regularizing the sparse coding vectors.

The ELM learning

In this section, the parameter learning process of the proposed WELM-ERDDL framework is to be explained in detail. Firstly, the output weights \(\beta \) are deduced with a more effective strategy using the elastic net regulation method. Next, a more robust adaptive online learning weight update rule for WELM is given.

Update of \({\varvec{\beta}}\)

Normally, the output weights \(\beta \) are always computed by minimizing the approximation error of the training data, but it suffers from the Moore–Penrose generalized inverse of \(H\). In order to solve this troublesome problem, Huang et al., added an \({l}_{2}\) norm regularization term in (5), while Tang et al. used the \({l}_{1}\) norm to restrict \(\beta \) to obtain a more meaningful and sparser value. However, the feature maps may outnumber the training data and the pairwise columns of \(H\) may have strong correlations. Fortunately, the elastic net regulation combining the \({l}_{1}\) norm and the \({l}_{2}\) norm provides an appropriate solution to the valuable selection problem. In this part, \(\beta \) is determined through elastic net regularization.

In the output weight learning problem, the elastic net regularization can be expressed as

$$P\left(\beta ;\omega \right)=\omega {\Vert \beta \Vert }_{1}+\left(1-\omega \right)\frac{1}{2}{\Vert \beta \Vert }_{2}^{2},$$
(20)

Combined with (20), the final formula for the proposed WELM-ERDDL given in (19) can be transformed as follows:

$$ \begin{aligned} \mathop {{\text{min}}}\limits_{D,A} \frac{1}{2} & \beta HW - DA^{2} + F_{1} \left( \beta \right)^{2} + F_{2} \left( A \right)^{2} + \lambda_{4} P\left( {\beta ;\omega } \right) \\ & \; = \mathop {{\text{min}}}\limits_{D,A} \frac{1}{2}\beta HW - DA^{2} \\ & + \frac{{\lambda_{1} }}{2}{\text{tr}}\left[ {\beta \left( {\exp \left( {S_{w} + \alpha I} \right) - \exp \left( {S_{b} } \right)} \right)\beta^{T} } \right] \\ & \;\;\; + \mathop \sum \limits_{c = 1}^{C} W_{C} A_{C1,2} + \lambda_{4} \left[ {\omega \beta_{1} + \left( {1 - \omega } \right)\frac{1}{2}\beta_{2}^{2} } \right], \\ \end{aligned} $$
(21)

where \({\lambda }_{4}\) is the regularization parameter for elastic net penalty. According to Lagrangian multiplier strategy, assuming that \(D\) and \(A\) are constant, we have

$$ \begin{aligned} \mathop {{\text{min}}}\limits_{{\gamma ,\beta ,u}} \frac{1}{2} & \gamma HW - DA^{2} + \frac{{\lambda _{1} }}{2}{\text{tr}}\left[ {\gamma \left( {\exp \left( {S_{w} + \alpha I} \right) - \exp \left( {S_{b} } \right)} \right)\gamma ^{T} } \right] \\ & + \lambda _{4} \left[ {\omega \gamma _{1} + \left( {1 - \omega } \right)\frac{1}{2}\gamma _{2}^{2} } \right] + \frac{\rho }{2}\gamma - \beta + u^{2} , \\ \end{aligned} $$
(22)

Furthermore, the problem in (22) can be decomposed into three subproblems:

$$ \begin{aligned} \gamma^{k + 1} & = \mathop {{\text{min}}}\limits_{\gamma } \frac{1}{2}\gamma HW - DA^{2} + \frac{{\lambda_{1} }}{2}{\text{tr}}\left[ {\gamma \left( {\exp \left( {S_{w} + \alpha I} \right) - \exp \left( {S_{b} } \right)} \right)\gamma^{T} } \right] \\ & \;\;\; + \lambda_{4} \left[ {\omega \gamma_{1} + \left( {1 - \omega } \right)\frac{1}{2}\gamma_{2}^{2} } \right] + \frac{\rho }{2}\gamma - \beta^{k} + u_{2}^{k2} , \\ \end{aligned} $$
(23)
$${\beta }^{k+1}={\mathrm{argmin}\Vert {\gamma }^{k+1}-\beta +{u}^{k}\Vert }_{2}^{2},$$
(24)
$${u}^{k+1}={u}^{k}+{\gamma }^{k+1}-{\beta }^{k+1}.$$
(25)

Consequently, among these three subproblems, the first subproblem in (23) is a sparse coding problem with the Lasso regularization which can be computed by shrinkage function.

The second subproblem in (24) is a quadratic optimization problem whose close form solution is given as follows|:

$$ \begin{gathered} \beta^{{k + 1}} = ( {2H^{T} HW + 2\lambda _{4} } ( {1 - \mu } )I \hfill \\ \quad + \lambda _{1} ( {\exp ( {S_{w} + \alpha I} ) - \exp ( {S_{b}})}) \hfill \\ \quad + \rho I)^{{- 1}} ({2H^{T} HT + \rho \gamma ^{{k + 1}} - u^{k}}).\hfill\\ \end{gathered} $$
(26)

Finally, the optimization problem of (21) can be summarized in Algorithm 1.

figure a

Adaptive online weight update for the WELM

This part focuses on the weight setting problem for the novel WELM proposed in this paper. Formally, in the previous work of Zong et al., the weights are computed in terms of the number of samples belonging to each class. The drawback of this method is obvious. According to this strategy, as the labeled samples increase, the weights used to punish the newly added samples tend to decrease sharply, leading to the final classification model being more focused on the previous model irrespective of the newly added samples. So, in this paper, the adaptive online weight learning rule for WELM is given. For the newly added samples, the weights can be taken as follows:

$${w}_{i}=\left\{\begin{array}{cc}\frac{{N}^{+}}{{N}^{+}+{N}^{-}}& \begin{array}{ccccccc}\mathrm{if}& {x}_{i}& \mathrm{belongs}& \mathrm{to}& \mathrm{the}& \mathrm{majority}& \mathrm{class}\end{array}\\ \frac{{N}^{-}}{{N}^{+}+{N}^{-}}& \begin{array}{ccccccc}\mathrm{if}& {x}_{i}& \mathrm{belongs}& \mathrm{to}& \mathrm{the}& \mathrm{minority}& \mathrm{class}\end{array}\end{array}\right.,$$
(27)

where \({N}^{+}\) and \({N}^{-}\) denote the number of samples belonging to the positive class (majority class) and to the negative class (minority class), respectively. During the adaptive online weight update procedure, the weight mainly depends on the ratio\({N}^{+}\): \({N}^{-}\) (or \({N}^{-}\):\({N}^{+}\)).

Classification with the WELM-ERDDL

In this section, after the learning of \(\beta \) and \(w\) on the training set, each test sample \({x}_{\mathrm{test}}\) is set with WELM-ERDDL and the label can be predicted as

$$Y=H{\beta }^{*}.$$
(28)

Then, the class number \({c}_{i}\) of the unlabeled sample data can be determined by finding the maximum in the corresponding row:

$${c}_{i}= \text{argmax } {\text{Y}}_{ij}.$$
(29)

Finally, the whole process of the proposed WELM-ERDDL method is summarized in Algorithm 2.

figure b

Experimental results and analysis

In this section, several experiments are provided in diverse ways to demonstrate the effectiveness of the proposed WELM-ERDDL approach.

Each experiment is tested on a PC with Intel Core I7-8700 at 3.40 GHz and 16 GB RAM. The proposed method is implemented using Matlab2013a, with the code for the other models inherited directly from the code published by the respective authors. To verify the effectiveness and robustness of the WELM-ERDDL algorithm, the experiment is divided into four parts:

  1. 1.

    In Section A, the database used in this experiment and corresponding parameter setting are declared.

  2. 2.

    In Section B, the performance of the proposed WELM-ERDDL approach is evaluated using the following four different classical image classification databases: the AR face and the Extended Yale B for face recognition, the Fashion-MNIST dataset and COIL-100 dataset for object classification. Specifically, the proposed new method is compared with SRC, K-SVD, D-KSVD, FDDL, SVGDL, SDDL, ELM, WELM, and HI-DKIELM.

  3. 3.

    In Section C, the learned representation \(\beta HW\) is compared with the original data and the classical ELM output \(\beta H\) to test its effectiveness.

  4. 4.

    In Section D, the importance of the choice of \({\lambda }_{2}\) and \({\lambda }_{3}\) of the discriminative sparse regularization term \({{\varvec{F}}}_{2}\) of the given WELM-ERDDL method is validated. In the meantime, the effect of the sparse representation classifier (SRC) is also tested for image classification.

  5. 5.

    In Section E, the performance of the WELM-ERDDL in practical application learning tasks is verified and compared with other baseline methods.

Database and experiment parameter setting

In these experiments, four different classical image classification databases are used: the AR and the Extended Yale B for face recognition, the Fashion- MNIST and COIL-100 for object classification.

The AR face database consists of 4000 color images from 126 people which have high brightness and wide pose variations. The sub-datasets are shown in Fig. 1a. Following the experiment setting, a commonly used subset is used which includes 2600 images from 50 males and 50 females, each of whom has 26 facial images with size \(165\times 120\). 20 images are selected at random for training and the 6 images are remained for testing.

Fig. 1
figure 1

The four benchmark databases

The Extended Yale B face database has 2414 frontal face images collected from 38 people, each of whom has about 64 images. This face database is challengeable for the reason that all the images are captured with different facial expressions, occlusions, and lighting variations. The specific illustrations of the database are shown in Fig. 1b. Following the common experiment parameter setting, each image is normalized into \(192\times 168\) pixels. Half of the images are selected at random for training, with the rest for testing.

The performance of the proposed method on object recognition is also evaluated with the Fashion MNIST database in substitution for the classical MNIST database. This database contains 70,000 images, 60,000 of which serve for training and others for testing. Especially, each image is a \(28\times 28\) gray scale image belonging to each class. The illustrations of the data are shown in Fig. 1c.

Another dataset for object recognition is the COIL-100 dataset which contains almost 7200 images associated with 100 objects, and each image is captured from different views against a clear background. The examples of the dataset can be found in Fig. 1d. Generally, 10 images are selected at random from each object for training and the rest for testing, and each image is resized to \(32\times 32\) pixels.

For the proposed method, for each database, the optimal parameter is selected using the cross validation method. The hidden dimension of ELM is set as \(L=2000\), while the output dimension is set as \(n\). The number of dictionary atoms \(m\) and the regularization parameters \(\lambda_{1} - \lambda_{4}\) of each dataset are shown in Table 1.

Table 1 Parameter settings of each database

For the ELM, the parameter involves the number of hidden neurons. We select the optimal \({\lambda }_{1}\) and the number of hidden neurons through contrast experiment, parameters is set as follows: \({\lambda }_{1}\) = [0.3, 0.5, 0.7, 0.9], the number of hidden neurons between 1 and 100. Figure 2 shows the classify accuracy of different \({\lambda }_{1}\) and the number of hidden neurons on four graph data sets. In the following experiments, we choose \({\lambda }_{1}\) = 0.9 and 40 neurons when apply our method on PTC, because it has higher accuracy on all the four data sets.

Fig. 2
figure 2

Classification accuracy at varying \({\lambda }_{1}\) and number of neurons on the four benchmark databases

Evaluation of the performance of the proposed WELM-ERDDL

In this part, the focus is laid on evaluating the performance of the proposed WELM-ERDDL method on the four benchmark databases. Specifically, the proposed new method is compared with SRC, K-SVD, D-KSVD, FDDL, SVGDL, SDDL, ELM, WELM, and HI-DKIELM. Among these compared methods, SRC is nonparametric with the training samples used as the dictionary directly; D-KSVD is based on the K-SVD method with different discriminative regularizations; FDDL employs Fisher discrimination while SVGDL further extends this method using support vector formulation, and thus SDDL applies the \({l}_{12}\) norms in the regularizations to constrain the supports of the coding vectors; WELM is the weighted version of ELM with the weights computed in terms of the number of samples belonging to each class; HI-DKIELM is covered in our former works given in [29]. The experimental results of the proposed WELM-ERDDL as well as those of the compared methods are summarized in Fig. 3. For each method, all experimental results are measured with the optimal hyperparameter settings and averaged almost 10 runs. In the meanwhile, the detailed analysis is made with respect to each database.

Fig. 3
figure 3

Recognition accuracy of the four benchmark databases

The detailed analysis on the experimental results is performed in the following:

  1. 1.

    For the Extended Yale B database, it can be deemed that the proposed WELM-ERDDL method achieves the best recognition result among all the compared methods. A scrutiny into the data reveals that the margin between the proposed method and SDDL is not large, almost 0.82%. The proposed approach avoids \({l}_{0}\) minimization and is more stable for multiple runs with std of 0.24% over 10 runs. However, for the SDDL method, for the reason that it suppresses the overlapping support of different classes via \({l}_{0}\) minimization which is approximate to the \({l}_{2}\) minimization, it is unstable and sensitive to the regulation factor. On the other hand, it can be seen that SDDL, SVGDL, and HI-DKIELM achieve high accuracy compared with other methods. Moreover, our method has over 2.11% accuracy gain compared with SRC, which means the proposed method can achieve effective discriminative sparse representation.

  2. 2.

    For the AR face database, the proposed WELM-ERDDL method outperforms all other compared methods, while the SDDL method achieves large margins than do other dictionary learning methods, showing the better performance in support discrimination.

  3. 3.

    For the COIL-100 database, our method also outperforms the other compared methods. From the results in Fig. 2, the K-SVD method has the worst recognition performance with recognition rate being only 71.50%; meanwhile, the D-KSVD method exhibits a significantly superior performance that demonstrates the importance of discrimination for dictionary learning. From the detailed results in Fig. 3, it is also found that the SRC method has a superior performance to SVD-based methods probably because the COIL-100 database is set up against a clear background and because of the regular structure in which the images can be constructed nicely.

  4. 4.

    For the Fashion-MNIST database, specifically, the performance is tested under a restricted experimental environment with 600 images selected at random for training and the whole test set for testing evaluation. From the experimental results in Fig. 3, our proposed method demonstrates the best accuracy. Specifically, the SDDL method fails to yield a superior result to other compared methods on previous three databases for the reason that the support size is not large enough to describe the test partition.

Testing of the effectiveness of the learned representation

In this part, the learned representations \(\beta HW\) are visualized to evaluate the effectiveness of the proposed WELM-ERDDL method using the Extended Yale B database and compare its results with the output results \(\beta H\) of traditional ELM. While the output weights \(\beta \) are learned via (26) and original data points, the weights \(W\) are learned via (27). Specifically, the learned representations \(\beta HW\) given in this paper are further used to compute the sparse representations \(A\), while the original samples, the traditional output results \(\beta H\) of ELM, and the sparse coding vectors obtained in this paper are embedded into two dimensions using t-SNE, with the experimental results shown in Fig. 4.

Fig. 4
figure 4

Visualizations of learned representations using t-SNE. Data points of the same color belong to the same class. a Corresponds to the original data points. b Corresponds to the ELM outputs. c Corresponds to the sparse coding vectors

From the results in Fig. 4a, it can be clearly found that the original data points are cluttered with mixed structure. However, when projected with learned output weights \(\beta \) via (26), the representations become more separable. This phenomenon shows that the learned projection is more able to generate discriminate outputs. But Fig. 4b reminds that there are still some data points that are far apart and that some classes are even mixed with points of other classes. While the outputs of sparse coding remedy these drawbacks and exhibit clear clusters with better separate representations as shown in Fig. 4c, which means the class weights learned via (27) can effectively improve the discriminative learning and empirically justify the designed WELM-ERDDL scheme in this paper is meritorious for image classification.

Testing of the effectiveness of \({\lambda }_{2}\), \({\lambda }_{3}\), and SRC

This part begins with a test on the importance of the proper choice of \({\lambda }_{2}\) and \({\lambda }_{3}\) which control the sparsity of the coding vectors for the proposed WELM-ERDDL method. The similar simple illustrations in the log scale for the four databases are shown in Fig. 5.

Fig. 5
figure 5

Average accuracy of WELM-ERDDL for different choices of \({\lambda }_{2}\) and \({\lambda }_{3}\) for each database

From the results in Fig. 5, it can be clearly seen that the algorithm is more sensitive to \({\lambda }_{2}\) which controls the intraclass sparsity. On the other hand, when a suitable value of \({\lambda }_{3}\) is selected, the proposed method should be able to exhibit a reasonable performance. It can also be noted that for Extended Yale B, AR Face, and COIL-100, the parameters are more stable to parameter choices, while for Fashion-MNIST the parameters should be selected with greater care.

Finally, the effect of the sparse representation classifier (SRC) for image classification is evaluated. In the experiments, the SRC method is chosen since it uses the nonparametric coding vectors from the training set learned with the class specific weighted \({l}_{\mathrm{1,2}}\) norm while computing the test code using the \({l}_{1}\) norm. In the experiments, it is compared in other three cases: using no ELM embedding (also called “w.o.ELM” for short), using no MMC, (abbreviated as “ w.o.MMC”), and using a linear predictor for SRC during test (abbreviated as “Lin Pred”). The detailed results are shown in Fig. 6.

Fig. 6
figure 6

Average test accuracy of WELM-ERDDL under different experimental settings

From the results in Fig. 6, the use of SRC improves performance. For AR Face and COIL-100, SRC does not give significant boost (only about 0.1%). On the other hand, SRC is critical to the Fashion-MNIST database, which improves the performance by almost 2.9%.

Real-world application learning tasks

In this section, the performance of the proposed WELM-ERDDL is to be evaluated in the real-world application: image classification task.

The original data are color images from the Corel dataset. Each image is segmented, using the Blob world system, into fragments that represent instances. The fragments containing specific visual contents (e.g., elephant) are labeled positive, while the remaining fragments are labeled negative. Therefore, the fragments (i.e., instances) from the same kind of images (e.g., elephant) create a binary learning problem. Given the five different image datasets: Tiger, Elephant, Fox, Bikes, and Cars, the number of instances is 1096, 1259, 1474, 5215, and 5600, respectively. The instances in the datasets Tiger, Elephant, and Fox are described by a 230-dimensional feature vector which represents the color, texture, and shape of the region, while the instances in the datasets Bikes and Cars are represented by a 90-dimensional feature vector.

Visual content-based image retrieval is an important application of image classification, for example, finding pictures containing an elephant from a dataset. In this subsection, sample images from the benchmark datasets are shown in Fig.7.

Fig. 7
figure 7

Example images used in the experiments from the COREL image categorization database

The detailed experimental results are shown in Tables 2 and 3 for image classification tasks in terms of classification accuracy (ACC) and area under the curve (AUC), respectively. In both tables, the proposed WELM-ERDDL method achieves the best ACC and AUC for all image datasets, indicating that it is superior to other methods in performing content-based image retrieval tasks. The extraordinary performance is owing to many local approximations created by this proposed method. The results show that the naive Bayes (NB) method on the datasets Tiger, Elephant, Bikes, and Cars has the worst ACC and AUC performance. However, the SVM method has the worst performance for the dataset Fox. For other baselines, more detailed experimental results can be found in Tables 2 and 3.

Table 2 Experimental results on image classification datasets concerning classification accuracy (ACC) % and runtimes(S)
Table 3 Experimental results on image classification datasets concerning the area under the curve (AUC) of ROC (%) and runtimes (S).

Conclusion

In this paper, a novel nonlinear feature extraction approach called Weighted Extreme Learning Machine Exponential Regularized Discriminative Dictionary Learning (WELM-ERDDL) has been proposed. Evaluations on common benchmark datasets have shown that the proposed method has achieved better results than state-of-the-art dictionary learning algorithms. The proposed method has several distinct features from those of existing ELM-based methods.

  1. 1.

    A nonlinear WELM embedded feature projection strategy via Exponential Regularized Discriminative Dictionary Learning has been given to achieve feature diversity and low computational efficiency.

  2. 2.

    During the ELM learning stage, the output weights have been updated through elastic net regularization to enhance their compactness and meaningfulness.

  3. 3.

    An effective adaptive online weight update criterion has been designed for the WELM.

In future work, there is development space for exploring the proposed method. First, it is still challengeable to bring more insights into ELM to explore its deep learning capability. Second, the dropout technique or locality encoding method may be considered to further improve the performance of the algorithm, more effective approaches will be needed to cope with large-scale image classification problems.