SGB-ELM: An Advanced Stochastic Gradient Boosting-Based Ensemble Scheme for Extreme Learning Machine

A novel ensemble scheme for extreme learning machine (ELM), named Stochastic Gradient Boosting-based Extreme Learning Machine (SGB-ELM), is proposed in this paper. Instead of incorporating the stochastic gradient boosting method into ELM ensemble procedure primitively, SGB-ELM constructs a sequence of weak ELMs where each individual ELM is trained additively by optimizing the regularized objective. Specifically, we design an objective function based on the boosting mechanism where a regularization item is introduced simultaneously to alleviate overfitting. Then the derivation formula aimed at solving the output-layer weights of each weak ELM is determined using the second-order optimization. As the derivation formula is hard to be analytically calculated and the regularized objective tends to employ simple functions, we take the output-layer weights learned by the current pseudo residuals as an initial heuristic item and thus obtain the optimal output-layer weights by using the derivation formula to update the heuristic item iteratively. In comparison with several typical ELM ensemble methods, SGB-ELM achieves better generalization performance and predicted robustness, which demonstrates the feasibility and effectiveness of SGB-ELM.


Introduction
Extreme learning machine (ELM) was proposed as a promising learning algorithm for single-hidden-layer feedforward neural networks (SLFN) by Huang [1][2][3], which randomly chooses weights and biases for hidden nodes and analytically determines the output-layer weights by using Moore-Penrose (MP) generalized inverse [4]. Due to avoiding the iterative parameter adjustment and time-consuming weight updating, ELM obtains an extremely fast learning speed and thus attracts a lot of attention. However, random initialization of input-layer weights and hidden biases might generate some suboptimal parameters, which have negative impact on its generalization performance and predicted robustness.
To alleviate such weakness, many works have been proposed to further improve the generalization capability and stability of ELM, where ELM ensemble algorithms are the representative ones. Three representative ELM ensemble algorithms are summarized as follows. The earliest ensemble based ELM (EN-ELM) method was presented by Liu and Wang in [5]. EN-ELM introduced the cross-validation scheme into its training phase, where the original training dataset was partitioned into subsets and then pairs of training and validation sets were obtained so that each training set consists of -1 subsets. Additionally, with updated input weights and hidden biases, individual ELMs were trained based on each pair of the training and validation set. There were totally × ELMs that were constructed for decision-making in EN-ELM algorithm. Cao et al. [6] proposed a voting-based ELM (V-ELM) ensemble algorithm, which made the final decision based on the majority voting mechanism in classification applications. All the individual ELMs in V-ELM were trained on the same training dataset and the learning parameters of each basic ELM were randomly initialized independently. Moreover, a genetic ensemble of ELM (GE-ELM) method was designed by Xue et al. in [7], which used the genetic algorithm to produce optimal input weights as well as hidden biases for individual ELMs and selected ELMs equipped with not only higher fitness values but also smaller norm of output 2 Computational Intelligence and Neuroscience weights from the candidate networks. In GE-ELM, the fitness value of each individual ELM was evaluated based on the validation set which was randomly selected from the entire training dataset. There are still several other types of ELM ensemble algorithms which can be found in literatures [8][9][10][11][12][13].
As for ensemble of the traditional neural networks, the most prevailing approaches are Bagging and Boosting. In Bagging scheme [14], it generates several training datasets from the original training dataset and then trains a component neural network from each of those training datasets. Boosting mechanism [15] generates a series of component neural networks whose training datasets are determined by the performance of former ones. There are also many other approaches for training the component neural networks. Hampshire [16] utilizes different object functions to train distinct component neural networks. Xu et al. [17] introduce the stochastic gradient boosting ensemble scheme to bioinformatics applications. Yao et al. [18] regard all the individuals in an evolved population of neural networks as component networks.
In this paper, a new ELM ensemble scheme called Stochastic Gradient Boosting-based Extreme Learning Machine (SGB-ELM) which makes use of the mechanism of stochastic gradient boosting [19,20] is proposed. SGB-ELM constructs an ensemble model by training a sequence of ELMs where the output weights of each individual ELM is learned by optimizing the regularized objective in an additive manner. More specifically, we design an objective based on the training mechanism of boosting method. In order to alleviate overfitting, we introduce a regularization item which controls the complexity of our ensemble model to the objective function concurrently. Then the derivation formula aimed at solving output weights of the newly added ELM is determined by optimizing the objective using second-order approximation. As the output weights of the newly added ELM at each iteration are hard to be analytically calculated based on the derivation formula, we take the output weights learned by the pseudo-residualsbased training dataset as an initial heuristic item and thus obtain the optimal output weights by using the derivation formula to update the heuristic item iteratively. Because the regularized objective tends to employ not only predictive but also simple functions and meanwhile a randomly selected subset rather than the whole training set is used to minimize training residuals at each iteration, SGB-ELM can continually improve the generalization capability of ELM while effectively avoiding overfitting. The experimental results in comparison with Bagging ELM, Boosting ELM, EN-ELM, and V-ELM show that SGB-ELM obtains better classification and regression performances, which demonstrates the feasibility and effectiveness of SGB-ELM algorithm.
The rest of this paper is organized as follows. In Section 2, we briefly summarize the basic ELM model as well as the stochastic gradient boosting method. Section 3 introduces our proposed SGB-ELM algorithm. Experimental results are presented in Section 4. Finally, we conclude this paper and make some discussions in Section 5.

Preliminaries
In this section, we briefly review the principles of basic ELM model and the stochastic gradient boosting method to provide necessary backgrounds for the development of SGB-ELM algorithm in Section 3.

Extreme Learning
Machine. ELM is a special learning algorithm for SLFN, which randomly selects weights (linking the input layer to the hidden layer) and biases for hidden nodes and analytically determines the output weights (linking the hidden layer to the output layer) by using MP generalized inverse. Suppose we have a training dataset with instances It is known that = 1 for regression and > 1 for classification. In ELM, the input weights and hidden biases can be randomly chosen according to any continuous probability distribution [2]. Namely, we randomly select the learning parameters within the range of and where is the number of hidden-layer nodes in SLFN. Depending on the theory proved in [2], the output-layer weights in ELM model can be analytically calculated by Here, H † is the MP generalized inverse of the hidden-layer output matrix is the target matrix. Generally, for an unseen instancex = (̂1,̂2, ⋅ ⋅ ⋅ ,̂푑), ELM predicts its outputŷ as follows: where h(x) = [ (xw 1 + 1 ), ⋅ ⋅ ⋅ , (xw 퐿 + 퐿 )] is the hiddenlayer output vector ofx. Due to avoiding the iterative adjustment to input-layer weights and hidden biases, ELM's training speed can be thousands of times faster than those of traditional gradient-based Computational Intelligence and Neuroscience 3 learning algorithms [2]. At the meantime, ELM also produces good generalization performance. It has been verified that ELM can achieve the equal generalization performance with the typical Support Vector Machine algorithm [3].

Stochastic Gradient
Boosting. Stochastic gradient boosting scheme was proposed by Friedman in [20], and it is a variant of the gradient boosting method presented in [19]. Given a training set {(x 푖 , y 푖 )} 푁 푖=1 , the goal is to learn a hypothesis 퐾 (x) that maps x to y and minimizes the training loss as follows: where (⋅, ⋅) is the loss function which evaluates the difference between the predicted value and the target and K denotes the number of iterations. In boosting mechanism, K additive individual learners are trained sequentially by and where = 1, 2, ⋅ ⋅ ⋅ , . It is shown that the optimization problem depends much on the loss function and becomes unsolvable when (⋅, ⋅) is complex. Creatively, gradient boosting constructs the weak individuals based on the pseudo residuals, which are the gradient of loss function with respect to the model values predicted at the current learning step. For instance, let (푘) 푖 be the pseudo residual of the th sample at the th iteration written as and thus the th weak learner 푘 (x) is trained by As gradient boosting constructs additive ensemble model by sequentially fitting a weak individual learner to the current pseudo-residuals of whole training dataset at each iteration, it costs much training time and may suffer from overfitting problem. In view of that, a minor modification named stochastic gradient boosting is proposed to incorporate some randomization to the procedure. Specifically, at each iteration a randomly selected subset instead of the full training dataset is used to fit the individual learner and compute the model update for the current iteration. Namely, let { ( )} 푁 1 be a random permutation of the integers {1, 2, ⋅ ⋅ ⋅ , }, and the subset with sizẽ< of the entire training dataset can be given by {(x 휋(푖) , y 휋(푖) )} 푁 푖=1 . Furthermore, the th weak learner using the stochastic gradient boosting ensemble scheme is trained by solving the following optimization problem as * Given the base learner 0 (x) which is trained by the initial training dataset, the final ensemble learning model constructed by stochastic gradient boosting scheme predicts an unknown testing instancex as follows: Stochastic gradient boosting is also considered as a special linear search optimization algorithm, which makes the newly added individual learner fit the fastest descent direction of partial training loss at each learning step.

Stochastic Gradient Boosting-Based Extreme Learning Machine (SGB-ELM)
SGB-ELM is a novel hybrid learning algorithm, which introduces the stochastic gradient boosting method into ELM ensemble procedure. As boosting mechanism focuses on gradually reducing the training residuals at each iteration and ELM is a special multiparameters network (for classification tasks particularly), instead of combining the ELM and stochastic gradient boosting primitively, we design an enhanced training scheme to alleviate possible overfitting in our proposed SGB-ELM algorithm. The detailed implementation of SGB-ELM is presented in Algorithm 2, where the determination of optimal output weights for each individual ELM learner is illustrated in Algorithm 1 accordingly. There are many existing second-order approximation methods including sequential quadratic programming (SQP) [21] and majorization-minimization algorithm (MM) [22]. SQP is an effective method for nonlinearly constrained optimization by solving quadratic subproblems. MM aims to optimize the local alternative objective which is easier to solve in comparison with the original cost function. Instead of using second-order approximation directly, SGB-ELM designs an optimization criterion for the output-layer weights of each individual ELM. In view of that, quadratic approximation is merely employed as an optimization tool in SGB-ELM.
In SGB-ELM, the key issue is to determine the optimal output-layer weights of each weak individual ELM, which is expected to further decrease the training loss and meanwhile keep a simple network structure. Consequently, we design a learning objective considering not only the fitting ability for training instances but also the complexity of our ensemble model as follows: where (⋅, ⋅) is a differentiable loss function that measures the difference between the predicted outputŷ 푖 and the target 4 Computational Intelligence and Neuroscience value y 푖 . The second term Ω represents the complexity of the ensemble model consisting of weak individual learners. Moreover, is a regularization factor that makes a balance between training loss and architectural risk. It is obvious that the objective falls back to the traditional gradient booting method when the regularization factor is set to zero.
As for boosting training mechanism, each individual ELM is greedily added to the current ensemble model sequentially so that it can most improve our model according to (8). Specifically, letŷ (푘−1) 푖 be the predicted value of the th instance at the ( − 1)th iteration and 푘 (x) be the th weak ELM learner that needs to be incorporated into the ensemble model, then the prediction of the th instance at the th iterationŷ (푘) 푖 can be written aŝ In order to obtain the newly added individual ELM, we first introduce 푘 (x) to the existing learned ensemble model and then minimize the following objective: where 1 , 2 , ⋅ ⋅ ⋅ , 푘−1 is already obtained at the previous iterations. As a consequence, the complexity of the learned ensemble model ∑ 푘−1 푗=1 Ω( 푗 ) is a constant, and we only need to take Ω( 푘 ) into consideration. Removing the constant item, the objective (푘) is simplified as Stochastic gradient boosting selects a random subset with sizẽ < of the whole training set to fit the individual learner at each iteration. Namely, let { ( )} 푁 1 be a random permutation of the integers {1, 2, ⋅ ⋅ ⋅ , }, then we can define a stochastic subset as {(x 휋(푖) , y 휋(푖) )} 푁 푖=1 . Accordingly, the objective using stochastic gradient boosting is transformed as We use second-order approximation to optimize the above learning objective, where the lose function is derived by Taylor expansion as follows: where = ( ) is the new index for (x 휋(푖) , y 휋(푖) ) in the randomly generated subset, is the first-order gradient statistics on the loss function with respect to the current predicted output 푘−1 (x 휋(푖) ), and is the second-order gradient statistics on the loss function with respect to the current predicted output 푘−1 (x 휋(푖) ).
Due to the approximation for training loss, we can provide a general solution scheme regardless of the specific type of loss function. In addition, second-order optimization tends to achieve better convergence in comparison with the traditional gradient method [23]. Obviously, ] is a fixed value, and thus the objective can be further expressed as , and the objective can be rewritten in a matrix form as where . . . and The th individual learner 푘 (X) is a basic ELM model, which randomly selects input-layer weights W 푘 and hidden biases B 푘 . Given the hidden-layer output matrix Computational Intelligence and Neuroscience 5 푘 (X) can be expressed as where is the output-layer weight matrix that needs to be determined. As Bartlett [24] pointed out that networks tend to perform better generalization with not only small training error but also small norm of weights (‖ 푘 ‖), we use L2-norm to evaluate the complexity of a basic ELM model as Accordingly, the conclusive objective can be written as From (30), we can find that the objective is only sensitive to 푘 at the th iteration. For single-variable optimization, solving partial derivative is conducted as where each element in 푘 conducted a partial derivative, respectively. Thus we obtain the derivation formula as follows: where = 1, 2, ⋅ ⋅ ⋅ , and = 1, 2, ⋅ ⋅ ⋅ , . It is shown that 푘 is difficult to be calculated analytically. Since our designed regularized objective tends to generate an ensemble model employing predictive as well as simple hypotheses, (32) derived by the objective can be used as an optimization criterion. Specifically, we take the output-layer weights determined by pseudo-residuals dataset as an initial heuristic item and thus obtain the optimal output-layer weights by using the derivation formula to update the heuristic item iteratively. Algorithm 1 illustrates how the optimal output weight matrix is determined and the detailed implementation of SGB-ELM is presented in Algorithm 2.
In Algorithm 2, all the input weights and hidden biases of individual ELMs are randomly chosen within the range of [−1, 1]. For boosting-based ensemble methods, the initial base learner is expected to be enhanced by adding weak individual learners to the current ensemble model step by step. In view of that, high-precision initial base learner might affect the effectiveness of ensemble negatively. In order to control the fitting ability of the initial base learner 0 (x) and meanwhile reduce the instability brought by random determination of the input weights and hidden biases, SGB-ELM conducts multiple random initializations for parameters in ELM 0 and takes the average at last. For instance, we take the average of 100 random initializations as where = 1, 2, ⋅ ⋅ ⋅ , and = 1,2,⋅⋅⋅ , . For the weak individual ELM, which plays a smaller role in the whole ensemble model, random initialization of parameters exactly increases the diversity between weak individual learners.

Performance Validation
In this section, a series of experiments are conducted to validate the feasibility and effectiveness of our proposed SGB-ELM algorithm, and meanwhile we compare the generalization performance and predicted stability of several typical ensemble learning methods (EN-ELM [5], V-ELM [6], Bagging [14], and Adaboost [15]) on 4 KEEL [25] regression and 5 UCI [26] classification datasets. Among all the above-mentioned ensemble methods, the basic ELM model proposed in [2] is The optimal output weights (푘-Optimal) =̂푘 Algorithm 1: The determination of (푘-Optimal) .   Training samples  Testing samples  1  Image segmentation  19  7  1155  1155  2  Texture  40  11  2750  2750  3  Spambase  57  2  2295  2294  4  Banana  2  2  2650  2650  5 Ring predicted value and the target of the th instance, respectively, and the loss function (⋅, ⋅) is given by

No. Datasets Condition attributes Decision attributes
Since V-ELM and EN-ELM are designed for classification applications, we compare the generalization capability of SGB-ELM with the basic ELM, simple ensemble ELM, Bagging ELM, and Adaboost ELM in regression tasks.
Among them, simple ensemble ELM can be considered as a variant of the V-ELM method, which trains a number of individual ELMs independently and takes the simple average of all the predictions at last. Adaboost ELM is implemented by Adaboost.R2 method [27], which applies the primitive Adaboost algorithm designed for classification tasks [15] to the regression field. Furthermore, we adopt resampling the original training dataset rather than assigning a weight to every instance to train each individual learner in Adaboost.R2 ELM.

8
Computational Intelligence and Neuroscience The performances of the traditional ELM, simple ensemble ELM, Bagging ELM, Adaboost ELM, and our proposed SGB-ELM are compared on 4 representative regression datasets, which are selected from the KEEL [25] repository. Experimentally, all the inputs of each dataset are normalized into the range of [0, 1]. The characteristics of these datasets are summarized in Table 1, where each original dataset is divided into two groups including a training set (70%) and a testing set (30%). In our regression experiments, for each dataset, the number of hidden nodes is selected from  Table 2. Figure 1 shows the training and testing RMSE of different learning methods during 50 trials on Friedman dataset. The detailed comparison results between SGB-ELM and other learning algorithms on 4 regression benchmark datasets are shown in Table 2. Furthermore, we compare the training and testing performances of SGB-ELM with those of Adaboost.R2 with regard to the number of iterations on Mortgage dataset, which is presented in Figure 3(a).
As for classification problem, like other typical feedforward neural networks (for instance, BP neural networks [28]), SGB-ELM evaluates the predicted output by calculating the sum of squared errors. Specifically, letŷ 푖 be the predicted output vector and y 푖 be the target encoded by One-Hot scheme [29] of the th sample, respectively, and we define the loss function (⋅, ⋅) in SGB-ELM for classification as follows: It is shown that SGB-ELM aims at reducing the training RMSE inch by inch for classification problem. Accordingly, we compare SGB-ELM with several representative ensemble learning methods including V-ELM, EN-ELM, Bagging ELM, and Adaboost ELM. Among them, V-ELM and EN-ELM have been briefly summarized in Section 1, and Adaboost ELM is implemented by Adaboost.SAMME method [30], which extends the original Adaboost designed for binary classification to multiclassification problem.
Similarly, we select 5 popular classification datasets from the UCI Machine Learning Repository [26] to verify the performance of our proposed SGB-ELM algorithm. For each dataset, all the decision attributes are encoded by One-Hot scheme [29]. The characteristics of these datasets are described in Table 3, where each original data set is equally divided into two groups including a training set (50%) and a testing set (50%).  Table 4. Figure 2 shows the training and testing accuracy of different algorithms during 50 trials on Segmentation dataset. The detailed performances of SGB-ELM in comparison with other learning algorithms on 5 classification benchmark datasets are summarized in Table 4. Lastly, the training and testing accuracy of SGB-ELM and Adaboost.SAMME with regard to the number of iterations on the Segmentation dataset are presented in Figure 3(b).
Tables 2 and 4 present the comparison results including training time, training RMSE/accuracy, and testing RMSE/accuracy for regression and classification tasks, respectively. It is shown that SGB-ELM obtains the better generalization capability in most cases without significantly increasing the training time. At the same time, SGB-ELM tends to have smaller training Dev and testing Dev than those of the comparative learning algorithms, which exactly validates the robustness and stability of our proposed SGB-ELM Algorithm. In particular, since SGB-ELM adopts the similar training mechanism with Adaboost which integrates multiple weak individual learners sequentially, the number of hidden nodes is set as a smaller value in both SGB-ELM and Adaboost method. It is worth noting that SGB-ELM can achieve better performance than the existing methods with less hidden nodes and outperforms Adaboost with the same number of hidden nodes.
From Figures 1 and 2, we can find that SGB-ELM is more stable than the traditional ELM, simple ensemble, Bagging, and Adaboost.R2 in regression problem and also produces better robustness than V-ELM, EN-ELM, Bagging, and Adaboost.SAMME in classification problem. It is shown that SGB-ELM not only focuses on reducing the predicted bias as other boosting like methods, but also generates a robust ensemble model with a low variance. As observed in Figure 2 although Adaboost.SAMME generates higher training accuracy than SGB-ELM during the most of 50 trials, SGB-ELM obtains the better generalization capability (testing accuracy). It can be explained by two reasons as (1) we introduce a regularization item (L2-norm) to the learning objective to control the complexity of our ensemble learning model; (2) a randomly selected subset rather than the whole training dataset is used to minimize the training loss at each iteration in our proposed SGB-ELM algorithm. Figure 3 shows the training RMSE/accuracy and testing RMSE/accuracy of Adaboost (Adaboost.R2 for regression and Adaboost.SAMME for classification) and SGB-ELM with regard to the number of iterations. The fixed reference line denotes the training and testing performance of a traditional ELM, which is equipped with much more hidden nodes. As shown in Figure 3, SGB-ELM obviously improves the generalization capability of the initial base ELM in both regression and classification tasks. From Figure 3 and testing accuracy curve show an increasing trend in Figure 3(b). Because we conduct multiple random initializations for parameters in the initial base learner 0 and take the average at last, the fitting ability of 0 is artificially weakened to some extent. As a result, the initial training and testing RMSE/accuracy of SGB-ELM are much lower than the initial Adaboost. It is shown that both SGB-ELM and Adaboost outperform the traditional ELM equipped more hidden nodes after a small number of learning steps. Furthermore, we can find that SGB-ELM produces better performance than Adaboost after only 5 iterations in regression tasks and 10 iterations in classification tasks. It verifies the significant convergence of second-order optimization method, which is incorporated into the procedure of SGB-ELM.
From the experimental results of both regression and classification problems, we can conclude that our proposed SGB-ELM algorithm can not only achieve better generalization capability (low predicted bias) than the typical existing variants of ELM, but also obtain an enough robust ELM ensemble learning model (low predicted variance).

Impact of Learning Parameters on Training SGB-ELM.
To achieve good generalization performance, three learning parameters of SGB-ELM including the number of hidden nodes , the regularization factor , and the size of subsetñ eed to be chosen appropriately. In this section, we attempt to evaluate the impact of learning parameters on training SGB-ELM algorithm and provide some empirical references of choosing these parameters.
For the basic ELM model, the number of hidden nodes decides the model's capacity. In other words, an ELM with more hidden nodes is more complex and can deal with more training instances. However, it tends to obtain Computational Intelligence and Neuroscience 13 an overfitting model when is set as a value too large. The regularization factor makes a balance between the training loss and the complexity of model. It means that can control the capacity or the complexity of our model. The size of subset̃represents the number of training instances at each iteration and it introduces some randomization to the training procedure of SGB-ELM. Firstly, we use grid-search method to observe the training and testing performance of SGB-ELM with different and . Specifically, we set = {10, 15, ⋅ ⋅ ⋅ , 160}, = {0.001, 0.002, ⋅ ⋅ ⋅ , 0.03}, and a fixed̃= 0.8. The training and testing performance of SGB-ELM with regard to the combination of ( , ) on the Spambase dataset is shown in Figure 4. Secondly, as we empirically find that the optimal̃depends much on the size of training dataset, we conduct two experiments (including a small dataset and a large dataset) to measure the impact of̃on training SGB-ELM. We choose the optimal value of ( , ) according to the grid-search results and set / = {0.1, 0.2, ⋅ ⋅ ⋅ , 1}. Figure 5 shows the training and testing performance of SGB-ELM with different sampling fraction (̃/ ) on the Wizmir and Spambase datasets.
As shown in Figure 4, changing the value of ( , ) has a significant effect on the training and testing accuracy of SGB-ELM algorithm. It is obvious that SGB-ELM with excess hidden nodes is more likely to produce overfitting when the regularization factor is set as a small value. It also demonstrates that SGB-ELM can effectively reduce overfitting when is assigned a proper value. In addition, from Figure 4 we can find that SGB-ELM achieves better performance with enough hidden nodes and a proper . It can be explained by the rule that although SGB-ELM with a small number of hidden nodes can avoid overfitting intuitively, meanwhile it produces a barrier to fit the current training residuals appropriately.
From Figure 5, it is obvious that randomization improves the performance of SGB-ELM substantially. As each weak individual ELM is learned based on randomly selected subset of the whole training dataset, it exactly increases the diversity between all the individuals. On the other hand, randomization introduces a noisy estimate of the total training loss. As a result, it slows down the convergence and even makes the learning curve fluctuate (higher variance) if̃is too small. It is shown that the best value of the sampling fraction is approximately 50% on the Wizmir dataset and 70% on the Spambase dataset, where there are a typical improvement in testing performance comparing to no sampling at all. Since the optimal values of̃/ are different on the Wizmir and Spambase datasets, it indicates that the sampling fraction (̃/ ) is expected to be determined based on the specific learning tasks and assigned a bigger value on the training dataset containing more instances.

Conclusions
In this paper, we proposed a novel ensemble model named Stochastic Gradient Boosting-based Extreme Learning Machine (SGB-ELM). Instead of combining ELM and stochastic gradient boosting primitively, we construct an ELM flow or ELM sequence where the output-layer weights of each weak ELM are determined by optimizing the regularized objective additively. Firstly, by minimizing the objective using second-order approximation, the derivation formula aimed at solving the output-layer weights of each individual ELM is determined. Then we take the output-layer weights learned by the current pseudo residuals as a heuristic item and thus obtain the optimal output-layer weights by updating the heuristic item iteratively. The performance of SGB-ELM was evaluated on 4 regression and 5 classification datasets. In comparison with several typical ELM ensemble methods, SGB-ELM obtained better performance and robustness, which demonstrated the feasibility and effectiveness of SGB-ELM algorithm.