Regularized ELM bagging model for Tropical Cyclone Tracks prediction in South China Sea

This paper aims to improve the prediction accuracy of Tropical Cyclone Tracks (TCTs) over the South China Sea (SCS) with 24 h lead time. The model proposed in this paper is a regularized extreme learning machine (ELM) ensemble using bagging. The method which turns the original problem into quadratic programming (QP) problem is proposed in this paper to solve lasso and elastic net problem in ELM. The forecast error of TCTs data set is the distance between real position and forecast position. Compared with the stepwise regression method widely used in TCTs, 8.26 km accuracy improvement is obtained by our model based on the dataset with 70/1680 testing/training records. By contrast, the improvement using this model is 16.49 km based on a smaller dataset with 30/720 testing/training records. Results show that the regularized ELM bagging has a general better generalization capacity on TCTs data set. (cid:1)


Introduction
The Tropical Cyclone (TC) ranks first in the top ten natural disasters of the world, which often causes heavy casualties and property losses. So, delivering the accurate Tropical Cyclone Tracks (TCTs) forecast is extremely significant to reduce the loss of disasters.
In the past, The forecasting techniques about TCTs can be divided into the following three major categories: (1) statistical; (2) dynamical and numerical; (3) statisticaldynamical (Roy & Kovordányi, 2012). In recent years, Artificial Neural Networks (ANNs) based techniques have been developed which announcing the better forecasting accuracy in TCTs (Ali, Kishtawal, & Jain, 2007;Chaudhuri, Dutta, Goswami, & Middey, 2015;Wang et al., 2011;Zhu, Jin, Cannon, & Hsieh, 2016). However, the prediction should be done as efficiently as possible for the short term TCTs. The learning speed of ANNs is in general far slower than expected. Extreme Leaning Machine (ELM) is a new learning algorithm with fast learning speed and good generalization ability proposed by Huang et al., which has been successfully applied to a number of real-world applications in recent years (Huang, Zhu, & Siew, 2004Huang, Wang, & Lan, 2011b). (ridge), l 1 penalty (lasso) and the mixture of the both (elastic net). Huang et al. derived the closed-form expression for the ridge problem in ELM (Huang, Zhou, Ding, & Zhang, 2012). The lasso and elastic net problem in ELM is usually solved with the coordinate descent method due to its high speed (Friedman, Hastie, & Tibshirani, 2010).
Inspired by Elad which explained how to solve the lasso problem of Linear Systems, we propose a new method from a new perspective in this paper, which turns the lasso and elastic net problem in ELM into quadratic programming (QP) problem with expanding the scale of independent variable in l 1 norm (Elad, 2010). Quadratic programming problems can be solved with many methods in convex optimization (Boyd, Vandenberghe, & Faybusovich, 2006). Encouraging results have been achieved on TCTs data set with ensemble methods (Goerss, 2000;Huang, Jin, & Shi, 2011a;Wang, 2012;Lee and Wong, 2002). The performance of a single neural network can be expected to improve using an ensemble of neural networks with a plurality consensus scheme (Hansen, 1990). Bagging is one of the popular ensemble machine-learning methods. There are two key advantages on bagging algorithm: (1) reducing the learning time greatly with the network trained in parallel.
(2) the outof-bag examples produced by bootstrap sampling, which can be used to estimate the expert (Breiman, 1996;Zhou, 2012). Here we applied the regularized ELM bagging to the short term forecasting of TCTs recorded in the 44year period . The forecasting accuracy of our proposed model will be analysed in comparison with that obtained by stepwise regression.This work is extended from that using a smaller dataset, and more description and analysis are provided (Zhang & Jin, 2018 (Huang et al., 2004, Huang, Zhu, & Siew, 2006. The calculation flow diagram is shown in Fig. 1. The structure of the ELM network is shown in Fig. 2.
For N arbitrary distinct samples x i ; t i ð Þ, where x i 2 R n and t i 2 R m ; i ¼ 1; Á Á Á ; N . The hidden layer output matrix H is given by where G ð Þ is the activation function, W N ÂL ¼ w 1 ; Á Á Á ; w L ½ is the weight matrix connecting input layer to hidden layer, L is the number of hidden nodes and b i is the bias of the ith hidden node in each hidden layer. W and b i can be initialized randomly without tuning.
Only the weights B in the output layer are to be adjusted which could be done by the Moore-Penrose Generalized Inverse (Banerjee, 1971) where H y is the Moore-Penrose inverse of matrix H and Thus, the essence of basic ELM can be summarized in Algorithm 1.

Regularized ELM
The ELM algorithm is based on empirical risk minimization principle which is known as lack of robustness. In order to overcome the drawback, ridge ELM is proposed by Huang et al. (2012), lasso ELM and elastic net ELM is proposed by Escandell-Montero et al. (2011).

Ridge ELM
Ridge regression adds the l 2 norm of the output layer's weights on the loss function, which makes all the weights of output layer tend to small. That coincide with Bartlwtt's theory, the smaller the weights in the output layer are, the better generalization performance the feedforward neural networks tend to have (Bartlett, 1998). The mathematical model is and Based on the Karush-Kuhu-Tucker theorem, different solutions can be obtained according to the size of training set (Huang et al., 2012).
In the case that the size of training set is small relatively, we have In the other case that the number of training samples is much larger than the dimension of the feature space. N ) L, we have In theory, the above solutions can be used in any size of training data sets. The generalization performance of ELM is not sensitive to the dimension of the feature space (L) (Huang et al., 2012). Thus, one may prefer to Eq. 9 in order to reduce computational costs. The detailed algorithm of ridge ELM is shown in Algorithm 2.

Algorithm 2. Ridge ELM
Require: Given training set ð Þ in hidden layer; the number of hidden nodes: L; hyper-parameters: k. ensure weights of input layer:W, bias of hidden layer: b, weights of output layer:B. 1: Randomly generate the parameters w i ; b i ð Þ; i ¼ 1; Á Á Á ; N ; 2: Calculate the hidden layer output matrix H; 3: Calculate the output weight matrix B based on Eq. 8 or Eq. 9; 4: return W; b; B.

Lasso ELM and ELastic net ELM
The lasso method has a tendency to prefer sparse solutions. We can obtain more zero values about the weights of output layer. The structure of neural network tends to be more compatible with lasso method. The loss function of lasso problem in ELM can be written as Bring Eq. 11 into Eq. 10, the lasso function can be rewritten as The Eq. 12 can be divided into m equations to solve We can obtain B by solving the above m expressions separately. Each of the m equations can be solved in the same way because they have the same form. By taking one of the m equations as example, we illustrate how to solve the lasso problem in ELM with a new trick.
We make b 1 ¼ u À v in Eq. 14, u; v 2 R m . u T v=0 when we add non-negative constraints on u; v, if we assume the kth entry in both u and v is non-zero(and positive), u k > v k , then by repacing these two entries by k as the final soultion in order to minimize the loss function. Thus In this manner, Eq. 14 can be represented by

ELastic net ELM
Elastic net is a mixture of ridge and lasso. The loss function can be described as Similar to the lasso problem solving, we can also decompose this equation into the same m sub-equations. one of the m sub-equations can be written as k 2 0; 1 ð Þ; g 2 0; 1 ð Þ, Using the same trick as Eq. 16, Eq. 18 can be expressed as The loss function of Eq. 16 and Eq. 19 are convex function. And obviously, both Eq. 16 and Eq. 19 are QP problems (Boyd et al., 2006). The QP problem can be described as and we can use quadprod function in MATLAB's Optimization Toolbox to solve Eq. 20. The detailed algorithm of lasso and elastic net ELM is shown in Algorithm 3.

Bagging
The two key components of bagging are bootstrap and aggregation. Bagging adopts bootstrap sampling to generate different experts, that is, the training set is further divided into a training set and a validation set which is different for each expert by putting back the samples, and then the model for each expert can be trained with base learning algorithm. Bagging adopts the most popular strategies for aggregating the model's outputs of each expert, in other words, voting for classification and averaging for regression (Breiman, 1996;Zhou, 2012). The bagging algorithm is summarized in Algorithm 4.

Algorithm 4. Bagging
Require: Given training set The number of experts: E. ensure: Aggregation of E experts. Process: 1: for i = 1:E do (i) obtain training set DATA_TRA and validation set DATA_VAL by bootstrap sample in D; (ii) train expert with U in DATA_TRA and DATA_VAL. endfor 2: return the model of E experts.

Data set
The TCTs data set used in the experiments is published by China Meteorological Administration which formed in or moved into the SCS in July, August and September during 1960-2003 and last for at least 48 h. The TCTs data is sampled every 12 h since the TC generated in the SCS.
A TCT and its changes are associated with its intensity, accumulation and replenish of energy, and various nonlinear changes in its environment flow field, which are referred to as variable in this paper. The variables can be divided into two categories: (1) the climatology and persistence factors representing changes of TC itself, such as changes in the latitude, longitude, and intensity of a TC at 12 and 24 h before prediction time, and (2) the physical variables calculated from the NCEP/NCAR global reanalysis data, representing the ambient flow field of the TC center (Jin, Huang, & Shi, 2010). The observations of TCTs is a position on the surface of our planet, which are latitude and longitude of the earth. The predicators (v1 to v16) and observations (Lat.t and Lon.t) of TCTs data set are listed in Table 1. Our object is to predict the latitude and longitude based on the predictors in the next 24 h.
The total number of TCTs dataset is 1750. The first 1680 samples are used as the training set and the last 70 samples as the testing set.

Experiment setup
From the above description of the data set, obviously the problem in this paper is a regression problem. Thus, the output of our model is the average of all the experts' output. In order to obtain better generalization capability, all experts should be able to make a difference while producing good forecasting accuracy. There are two differences between all experts: (i) the training set DATA_TRA and the validation set DATA_VAL, and (ii) hyper-parameters in the base learning algorithm.
The hyper-parameters which have great influence on the performance of the model need to be set before model training. Table 2 shows the hyper-parameters that three regularized ELM have. Table 3 shows the hyperparameters that unique to each of the three regularized ELM. In the paper, the hyper-parameters of the model are determined by grid search combined with sobol series (Joe & Kuo, 2003;Sobol, 1967), owing to the points sampled by sobol series are better uniform in multidimensional space. In order to choose a reasonable value about the hyper-parameters in the model, we set the initial ranges of each hyper-parameter firstly, and then, C candidate pairs of points generated by sobol series in the initial hyper-parameters space, which are used as hyperparameters inputs to the model; next, the first k best-performing models in DATA_VAL will be selected among trained models, the minimum and maximum values for each hyper-parameter in the first k models will be used as the final hyper-parameter ranges. These operations can be viewed as hyper-parameters selection in order to find wider reasonable ranges of hyper-parameters. From the perspective of probability, as long as the number of model's candidates C is large enough, the final model performance must tend to be better. The process of hyperparameters selection is summarized as Algorithm 5.

Algorithm 5. Hyper-parameters selection
Require: Given original training set D; Base Learning Algorithm U; The initial ranges of hyper-parameters RH; The number of model's candidates in hyperparameters selection: C; The number of best-performing candidates: k. Ensure: The final ranges of hyper-parameters, DATA_TRA, DATA_VAL. 1: obtain DATA_TRA and DATA_VAL by bootstrap sampling in D. 2: generate C points in RH by sampling with sobol.
3: obtain C models by using each point as the hyperparameters in U. 4: train C models in DATA_TRA. 5: choose the best performing k models in DATA_VAL. 6: the minimum and maximum value are obtained for each hyper-parameter in k models and used as the final ranges of hyper-parameters. 7; return the final ranges of hyper-parameters, DATA_TRA, DATA_VAL.
Finally, taking expert of ridge ELM as an example, we explain the choice about the specific ranges of each hyper-parameter and parameters in the experiment. The hyper-parameter k in ridge ELM which is the trade-off between 1 2 P m i¼1 n T i n i and 1 2 kBk 2 2 and L is the number of hidden nodes. The initial ranges of k and L are 0; 1000 ð Þ and 1; 1000 ð Þ. Obviously, the effects of k 2 0; 1 ð Þ and k 2 1; 1000 ð Þ on the experiment results are two different situations. It is unreasonable and a waste of computing resources if we perform hyper-parameters selection on k 2 0; 1000 ð Þ. That's because, the length of k 2 0; 1 ð Þ and k 2 1; 1000 ð Þ varies greatly. 200 models are trained separately in both cases in order to determine whether k 2 0; 1 ð Þ or k 2 1; 1000 ð Þ. The initial ranges of k and L to 0; 1 ð Þ and (1,1000) are set respectively based on the experiment results. C = 200, k = 5 are set in the Algorithm 5 empirically in order to determine the final hyperparameters ranges. The final hyper-parameters and parameters is determined by the best performing candidate among 40 candidates which hyper-parameters generated in the final hyper-parameters ranges. Taking expert of elastic ELM as an example, L's ranges of elastic ELM are shown in Fig. 3. A final expert can be obtained based on the above algorithm.
The setting of the number of experts E is a problem worth studying. Due to lacking of parallel computing tool, E is set to 20 in order to saving computing resource in the experiment. The final ensemble model is achieved after 20 experts are trained. The lasso and elastic net ELM ensemble model are also trained based on the above mentioned idea.

Result and analysis
The number of testing set DATA_TES which consist of four TCs is 70. 70 prediction points of latitude and longitude are obtained after making prediction on DATA_TES with final ensemble model. The mean distance errors (MDE) Dd evaluates the model more intuitively in general, which is calculated by Dy where E is the number of experts, Dx and Dy are mean absolute errors (MAE) between the predictionŝ t ¼t 1 ;t 2 Â Ã and observations t ¼ t 1 ; t 2 ½ of longitude and latitude on the DATA_TES.
Stepwise regression which has good performance in meteorological data is a kind of multivariable regression, specifically focusing on selecting variables. The results of stepwise regression are commonly used as benchmark in meteorological area. Table 4 shows the MAE of longitude and latitude and the MDE in the ensemble model and the expert in ensemble model, which indicates that ensemble model improve model's performance significantly compared with a single expert. The MDE of Lasso ELM bagging and stepwise regression are shown in Fig. 4. We can see that the most MDE of the lasso ELM bagging is lower than the MDE of the stepwise regression from Fig. 4. The Table 2 The hyper-parameter in three regularized ELM.
hyper-parameter meaning L The number of hidden nodes E The number of experts C The number of model's candidates Table 3 The hyper-parameters that unique to each of the three regularized ELM. MDE between lasso ELM ensemble and stepwise have no much difference, but the gap between the structure of the network is obvious. Table 5 shows the average of number of hidden nodes in regularized ELM ensemble models. The results shows that the elastic ELM ensemble is able to find more efficient networks (similar accuracy and more compact structure).The MDE of the first DATA_TST is nearly 400 km in all predication models, which is an unacceptable forecast. we think the first DATA_TST is a singular TC which is difficult to make accurate predictions. Regularized ELM ensemble models have more consistent performance compared to its base learner, the comparison of single model and integrated model are shown in Fig. 5-7. Fig. 8 shows comparison of ridge ELM's training error and testing error.    indicators. Regularized ELM ensemble is more accurate than the stepwise regression, the MDE dropped 8.26 km.

Conclusion
In this study, a novel algorithm for solving lasso ELM and elastic net ELM is presented. If we hope the network is compact, elastic ELM is a good choice. Compared to lasso ELM and elastic net ELM, ridge ELM only takes a little training time because it does not need to solve the QP problem. However, ridge's forecast accuracy is almost equivalent to stepwsie. The regularized ELM bagging has been applied to the forecasting of TCTs over the SCS.
Stepwise regression analysis is then used for comparison.   The comparison results show that most models in this paper are superior to the stepwise regression model in forecast accuracy. The data in TCTs data set is trained together in the model, because we cannot recognize whether it is a non-singular TC. In the further work, we will try to normalize and cluster the DATA first effectively to avoid the influence between singular TC and non-singular TC (Chen, Wang, Zhang, Yang, & Wang, 2019;Jin, Li, & Jin, 2015).

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.