1 Introduction

Artificial neural network (ANN) has been widely used in various fields due to its good learning ability and high speed optimization capability [1, 2]. However, the artificial neural network is used to calculate the parameters of the model by the gradient descent method. The gradient descent method will increase the time complexity of the algorithm and cause the algorithm to fall into the local optimal solution easily. Due to these shortcomings, it may take more time to train the network and not necessarily guarantee the best solution. Therefore, searching for high efficiency and high real-time neural network has become the main research direction of many scholars.

The Extreme Learning Machine (ELM) [3] is a new single-hidden-layer feed-forward neural networks (SLFNs) proposed by Huang et al. in 2006. The method can randomly generate input weights and hidden layer node thresholds, and calculate the output layer weights without the need for iterative operations, so its computation speed is much better than the traditional neural network algorithm. When constructing the model, the traditional neural network algorithm is different from the traditional neural network algorithm. The limit learning function gets the minimum norm of the weight of the output layer while reaching the minimum training error. According to the neural network theory: For feed-forward neural networks, the smaller the training error and norm of the weight of the ELM model, the stronger its network generalization ability [4]. Therefore, we can theoretically demonstrate the generalization ability of ELM algorithm. In recent years, ELM has been applied to various fields of various disciplines, such as face recognition [5], classified, regression, image processing [6], ground reconstruction [7], and so on, because of the ability to effectively improve the defects of traditional neural networks.

Although ELM has fast learning speed and good generalization performance, when some column vectors in the hidden layer design array are approximated to linear correlation, that is, the hidden layer design array has multiple collinearity or ill posed. It can result in poor generalization performance and stability by using ordinary least square method to estimate the solution of the ill conditioned matrix. To this end, many scholars have improved the ELM model, for example, Li et al. Proposed an improved extreme learning machine regression algorithm (CV-ELM) based on conditional index and variance decomposition ratio (CV-ELM) [8], Ceng Lin and others proposed the extreme learning machine (PC-ELM) [9] based on the principal component estimation.

The CV-ELM algorithm improves the generalization performance of the extreme learning machine to a certain extent and can ensure good algorithm robustness. However, this algorithm still has defects in some cases, so that the model can’t achieve the minimum error. Our analysis considers that there are two main reasons: First, high-dimensional data may obscure the noise components in the data, making the proposed method unable to completely isolate the relevant variables, that is, the algorithm can’t reduce the noise completely. Secondly, some initial parameters of CV-ELM, such as input weight and hidden layer bias, are generated randomly, which makes the generalization ability and robustness of the model affected by them. Therefore, this paper introduces the ensemble learning method [10] in the CV-ELM algorithm. This method can train a number of similar learners at the same time, and then extract the best set of learners from these neural networks to integrate [11]. Through regression experiments of multiple data sets, it is proved that this method can achieve good generalization performance and stability.

2 Review of ELM and CV-ELM

In this section, we mainly review ELM [3] and CV-ELM [10].

2.1 Extreme Learning Machine

For N arbitrary distinct samples \( \left( {{\text{x}}_{\text{i}}, {\text{y}}_{\text{i}} } \right) \), where \( {\text{x}}_{\text{i}} = \left[ {{\text{x}}_{{{\text{i}}1}} ,\,\,{\text{x}}_{{{\text{i}}2}} ,\,\, \ldots ,\,\,{\text{x}}_{\text{in}} } \right]^{\text{T}} \, \in {\text{R}}^{\text{n}} \) is an n-dimensional feature of the ith sample, and \( {\text{y}}_{\text{i}} = \left[ {{\text{y}}_{{{\text{i}}1}} ,\,\,{\text{y}}_{{{\text{i}}2}} ,\,\, \ldots ,\,\,{\text{y}}_{\text{im}} } \right]^{\text{T}} \in {\text{R}}^{\text{m}} \), Then, the output of a feed-forward neural network with L hidden nodes and excitation function G(x) can be expressed as

$$ {\text{f}}_{\text{L}} \left( {\text{x}} \right) = \mathop \sum \limits_{{{\text{i}} = 1}}^{\text{L}}\upbeta_{\text{i}} {\text{G}}\left( {{\text{a}}_{\text{i}} \cdot {\text{x}}_{\text{i}} + {\text{b}}_{\text{i}} } \right),\,\,{\text{a}}_{\text{i }} \in \,{\text{R}}^{\text{n}},\,\,\upbeta_{\text{i}} \, \in \,{\text{R}}^{\text{m}} , $$
(1)

where \( {\text{a}}_{\text{i}} = \left[ {{\text{a}}_{{{\text{i}}1}} ,\,\,{\text{a}}_{{{\text{i}}2}} ,\,\, \ldots , \,\,{\text{a}}_{\text{in}} } \right]^{\text{T}} \) is the weight connecting the i-th hidden node and the input nodes, and \( {\text{b}}_{\text{i}} \) is threshold of the i-th hidden nodes, \( \upbeta_{\text{i}} = \left[ {\upbeta_{{{\text{i}}1}} ,\,\,\upbeta_{{{\text{i}}2}} ,\,\, \ldots , \,\,\upbeta_{\text{im}} } \right]^{\text{T}} \) is the weight connecting the i-th hidden node and the output nodes, \( {\text{a}}_{\text{i}} \cdot {\text{x}}_{\text{i}} \) denote the inner product of \( {\text{a}}_{\text{i}} \) and \( {\text{x}}_{\text{i}} \). The excitation function G(x) can choose “Sigmoid”, “Sine” or “RBF” and so on.

If this feed forward neural network with L hidden layer nodes and M output layer nodes can approximate this N samples with zero error, then the above N equations can be written compactly as

$$ {\text{f}}_{\text{L}} \left( {\text{x}} \right) = \sum\nolimits_{{{\text{i}} = 1}}^{\text{L}} {\upbeta_{\text{i}} {\text{G}}\left( {{\text{a}}_{\text{i}} .{\text{x}}_{\text{i}} + {\text{b}}_{\text{i}} } \right)} = {\text{y}}_{\text{i}} ,\,\,{\text{i}} = 1, 2,\, \cdots ,\,{\text{L}}, $$
(2)

(2) can be simplified as

$$ {\text{H}}\upbeta = {\text{Y}} $$
(3)

where

$$ {\text{H}} = \left[ {\begin{array}{*{20}c} {{\text{G}}\left( {{\text{a}}_{1} , {\text{b}}_{1} , {\text{x}}_{1} } \right)} & \cdots & {{\text{G}}\left( {{\text{a}}_{\text{L}} , {\text{b}}_{\text{L}} , {\text{x}}_{1} } \right)} \\ \vdots & \ddots & \vdots \\ {{\text{G}}\left( {{\text{a}}_{1} , {\text{b}}_{1} , {\text{x}}_{\text{N}} } \right)} & \cdots & {{\text{G}}\left( {{\text{a}}_{\text{L}} , {\text{b}}_{\text{L}} , {\text{x}}_{\text{N}} } \right)} \\ \end{array} } \right] $$
(4)
$$ \upbeta = \left[ {\begin{array}{*{20}c} {\upbeta_{1}^{\text{T}} } \\ \vdots \\ {\upbeta_{\text{L}}^{\text{T}} } \\ \end{array} } \right]_{{{\text{L}} \times {\text{M}}}} \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,{\text{Y}} = \left[ {\begin{array}{*{20}c} {{\text{y}}_{1}^{\text{T}} } \\ \vdots \\ {{\text{y}}_{\text{N}}^{\text{T}} } \\ \end{array} } \right]_{{{\text{N}} \times {\text{M}}}} $$
(5)

H is called the hidden layer output matrix of the network, and ELM training can be transformed into a problem of solving the least squares solution of output weights. The output weight matrix \( {\hat{\upbeta }} \) can be obtained from (6)

$$ {\hat{\upbeta }} = \left( {{\text{H}}^{\text{T}} {\text{H}}} \right)^{ - 1} {\text{H}}^{\text{T}} {\text{Y}} = {\text{H}}^{ + } {\text{Y}} $$
(6)

Where \( {\text{H}}^{ + } \) represents the Moore-penrose generalized inverse of the hidden layer output matrix H.

2.2 CV-ELM

The hidden layer matrix H is calculated according to the ELM model, and then the H matrix is separated according to the condition number and variance decomposition ratio to obtain

$$ {\text{H}} = \left( {{\text{H}}1,{\text{H}}2} \right) $$
(7)

where \( {\text{H}}1 \) is the non-interference data column in the hidden layer matrix H and \( {\text{H}}2 \) is the interference data columns.

According to the LS principle, the output weight matrix \( {\hat{\upbeta}} \) can be obtained from (8)

$$ \begin{aligned} {\hat{\upbeta }} & = \left( {{\text{H}}^{\text{T}} {\text{H}}} \right)^{ - 1} {\text{H}}^{\text{T}} {\text{Y}} \\ & = \left( {\left[ {{\text{H}}1, {\text{H}}2} \right]^{\text{T}} \left[ {{\text{H}}1,{\text{H}}2} \right]} \right)^{ - 1} \left[ {{\text{H}}1, {\text{H}}2} \right]{\text{Y}} \\ & = \left[ {\begin{array}{*{20}c} {{\text{H}}1^{\text{T}} {\text{H}}1} & {{\text{H}}1^{\text{T}} {\text{H}}2} \\ {{\text{H}}2^{\text{T}} {\text{H}}1} & {{\text{H}}2^{\text{T}} {\text{H}}2} \\ \end{array} } \right]^{ - 1} \left[ {{\text{H}}1,{\text{H}}2} \right]{\text{Y}} \\ \end{aligned} $$
(8)

In order to enhance the generalization performance and stability of the model without destroying the authenticity of the data of the non-interference data, and add a small constant to the diagonal elements of the data matrix of the interfering data. The output weight matrix \( {\hat{\upbeta }} \) can be obtained from (9)

$$ {\hat{\upbeta }} = \left[ {\begin{array}{*{20}c} {{\text{H}}1^{\text{T}} {\text{H}}1} & {{\text{H}}1^{\text{T}} {\text{H}}2} \\ {{\text{H}}2^{\text{T}} {\text{H}}1} & {{\text{H}}2^{\text{T}} {\text{H}}2 + {\text{kI}}} \\ \end{array} } \right]^{ - 1} \left[ {{\text{H}}1, {\text{H}}2} \right]{\text{Y}} $$
(9)

where k is a small constant and \( {\text{I}} \) is the unit matrix.

CVELM algorithm is described as follows:

Known training samples \( \left( {{\text{x}}_{\text{i}} ,\,\,\,{\text{y}}_{\text{i}} } \right) , {\text{i}} = 1 ,\,\, \cdots ,\,\,{\text{N}} \), the number of hidden nodes is L, and the excitation function is \( {\text{G}}\left( {\text{x}} \right) \).

  1. (1)

    Random setting of input weights \( {\text{a}}_{\text{i}} \) and bias \( {\text{b}}_{\text{i}} ,\,\,{\text{i}} = 1,\,\, \cdots ,\,\,\,{\text{L}} \)

  2. (2)

    Computing hidden layer output matrix H

  3. (3)

    Through the condition number and variance decomposition machine decomposition matrix H, and get (H1, H2)

  4. (4)

    Determining the ridge parameter k

  5. (5)

    The output matrix \( {\hat{\upbeta }} \) is calculated by Eq. (10)

3 Improved Ensemble Extreme Learning Machine

The CV-ELM algorithm overcomes the situation that the generalized performance and robustness of the algorithm deteriorate when the hidden layer design array is ill-conditioned. However, the random generation of input weights and incomplete cancellation of noise under high-dimensional data can cause generalization performance and robustness to be poorly processed. Therefore, this paper proposes a CV-ELM regression algorithm based on ensemble learning (ECV-ELM), which makes use of the complementarity between multiple learners, thus making the ensemble better performance. ECV-ELM overcomes the shortcomings of poor model stability due to the random generation of input weight, bias and incomplete cancellation of CV-ELM noise. It combines the ensemble learning method with the CV-ELM regression algorithm and uses some common methods to selects the appropriate sub CV-ELM model, which can further improve the performance of the entire CV-ELM.

It is assumed that the training set and test set are \( {\text{G}} = \left\{ {\left( {{\text{x}}_{\text{i}}, {\text{y}}_{\text{i}} } \right) | {\text{i}} = 1,2, \cdots ,{\text{l}}} \right\} \), \( \text{G}^{{\prime }} = \left\{ {\left( {{\text{x}}_{\text{i}} , {\text{y}}_{\text{i}} } \right)|{\text{i}} = 1,2, \cdots ,{\text{l}}} \right\} \), and \( {\text{x}}_{\text{i}} \) is the model input, and the \( {\text{y}}_{\text{i }} \) is the output of the model. First, according to the training set G come into the different training sub set \( {\text{G}} = \left\{ {{\text{G}}_{1} , \cdots ,{\text{G}}_{\text{T}} } \right\} \), several different sub CV-ELM models are generated by different training subsets. Then, a part of the excellent sub CV-ELM model is selected according to the training results. Finally, the results are made by means of the average method.

In summary, the proposed ECV-ELM integrated regression algorithm can be summarized as follows:

Input: Training sample set T

Output: Integrated CV-ELM regression model

  1. (1)

    Using training set G to randomly generate T intersecting data sub sets \( {\text{G}} = \left\{ {{\text{G}}_{1} ,{\text{G}}_{2} , \cdots ,{\text{G}}_{\text{T}} } \right\} \), set the activation function of all sub-models to g(x), and the number of hidden layer neurons is L;

  2. (2)

    Initialization t = 1;

  3. (3)

    Determine whether it reaches the number of iterations, that is, t <= T; if yes, execute step (4); otherwise, execute step (7);

  4. (4)

    Using the random function to generate the input weight a and the hidden layer offset b;

  5. (5)

    The t-th sub CV-ELM model is trained using the randomly generated a, b, and t-th data subsets;

  6. (6)

    Perform step (3);

  7. (7)

    Calculate the MSE values of all the sub-models. According to the size of the MSE values, select the k best sub CV-ELM models;

    $$ {\text{MSE}} = \frac{1}{{{\text{n}}\left( {{\text{Y}} - {\text{H}}{\hat{\upbeta }}} \right)^{\text{T}} \left( {{\text{Y}} - {\text{H}}{\hat{\upbeta }}} \right)}} $$
    (10)
  8. (8)

    using the simple average method to integrate K sub CV-ELM and get the final model, that is, the ECV-ELM model.

4 Experiment and Analysis

This section analyzes the CV-ELM regression method (ECV-ELM) based on ensemble learning proposed in the previous section. In order to better carry out experimental analysis, the time and prediction results of standard ELM algorithm, CV-ELM algorithm and ECV-ELM algorithm are compared. The experiment uses 5 regression analysis data sets from UCI database and LIACC, which are Balloon data set, California House data set, Cloud dataset, Strike data set, Bodyfat data set. There are huge differences between the input attributes and the number of samples in these five data sets, which can better analyze the performance of the algorithm, as shown in Table 1. The number of hidden layer nodes required for each dataset in the ECV-ELM neutron model is shown in Table 2. All experiments in this chapter are run on Windows7 64 bit operating system and Matlab 2016 environment in 3.30 GHz i5-4590 CPU, 4G RAM.

Table 1. Regression analysis dataset.
Table 2. Number of required hidden layer nodes in each dataset.

Table 3 shows the comparison of test time and training time of ELM, ECV-ELM and CV-ELM on multiple data sets. Table 4 shows the comparison of RMSE of ELM, ECV-ELM and CV-ELM on multiple data sets. Table 5 shows the comparison of DEV of ELM, ECV-ELM and CV-ELM on multiple data sets. Among them, the activation functions of the standard ELM, the CV-ELM model and the ECV-ELM model all use the Sigmoid function. In the ECV-ELM model, the number of training sub-models is T = 20, the number of integrated sub-models is k = 10, and the sub-training set selected by the sub-model is three quarters of the randomly selected training set.

Table 3. Comparison of training and testing time of ELM, ECV-ELM, CV-ELM.
Table 4. Comparison of testing RMSE of ELM, ECV-ELM, CV-ELM.
Table 5. Comparison of testing DEV of ELM, ECV-ELM, CV-ELM.

5 Conclusions

In this paper, a CV-ELM regression algorithm based on ensemble learning is proposed. This algorithm uses ensemble learning method to overcome the disadvantage of poor model stability caused by CV-ELM random input weight, bias and incomplete noise elimination. Through different regression data sets, the performance of ECV-ELM algorithm is analyzed. It is concluded that although the time cost of ECV-ELM algorithm is longer than that of CV-ELM, the generalization performance and robustness of the algorithm have been greatly improved.

Compared to the CV-ELM algorithm, the ECV-ELM algorithm proposed in this section is trained by training multiple CV-ELM sub-models, and then using ensemble learning methods to make multiple CV-ELM learners complementation, thus improving the generalization ability and robustness of the algorithm.