Stock Market Forecasting Using Restricted Gene Expression Programming

Stock index prediction is considered as a difficult task in the past decade. In order to predict stock index accurately, this paper proposes a novel prediction method based on S-system model. Restricted gene expression programming (RGEP) is proposed to encode and optimize the structure of the S-system. A hybrid intelligent algorithm based on brain storm optimization (BSO) and particle swarm optimization (PSO) is proposed to optimize the parameters of the S-system model. Five real stock market prices such as Dow Jones Index, Hang Seng Index, NASDAQ Index, Shanghai Stock Exchange Composite Index, and SZSE Component Index are collected to validate the performance of our proposed method. Experiment results reveal that our method could perform better than deep recurrent neural network (DRNN), flexible neural tree (FNT), radial basis function (RBF), backpropagation (BP) neural network, and ARIMA for 1-week-ahead and 1-month-ahead stock prediction problems. And our proposed hybrid intelligent algorithm has faster convergence than PSO and BSO.


Introduction
Stock market plays a leading and crucial role in the market mechanism, which connects the savers and investors [1,2]. e operating mechanism of the stock market reflects the situation of national economy and is recognized as the signal system of the national economy [3,4]. Because of some uncontrollable factors, such as economic growth, economic cycle, interest rate, fiscal revenue and expenditure, money supply, and price, the prediction of the stock market index is considered to be a difficult job [5][6][7].
Many machine learning (ML) methods containing statistical models, artificial neural networks, and hybrid prediction models have been proposed to model and predict the stock index. As a classical statistical model, the ARIMA model has proposed to predict the New York Stock Exchange (NYSE) and Nigeria Stock Exchange (NSE), and the results revealed that the ARIMA model performed better for short-term prediction [8][9][10]. Compared with the ARIMA model, the artificial neural network (ANN) model has more strong prediction and modeling ability. Adebiyi et al. made the comparison of ARIMA and ANN models for stock price prediction and found that the stock forecasting model based on ANN approach had superior performance over ARIMA models [11].
In the past decades, many ANN models have been employed for solving real problems, especially stock market prices forecasting [12,13]. Dong et al. presented backpropagation (BP) neural networks for stock prediction [14]. Feedforward ANN was proposed to predict price movement of the stock market [15]. Akita et al. proposed a novel deep learning method based on paragraph vector and long shortterm memory (LSTM) to predict the Tokyo Stock Exchange [16]. Rout et al. used the radial basis function (RBF) neural network to forecast DJIA and S&P 500 stock indices [17]. Wang et al. proposed a novel method based on complexvalued neural network (CVNN) and Cuckoo search (CS) algorithm to forecast stock price [18]. Chen et al. presented the flexible neural tree (FNT) ensemble technique to analyze 7-year Nasdaq-100 main index values and 4-year NIFTY index values [19].
However, the existing methods mainly trained the black box with the training sample. e model could change its internal structure and parameters to make it approximate to the training sample. e gained model could not display the distinct input-output relationship and deeply understand the internal mechanisms of real-world problems. And, in most of these methods, all variables are input into the models, which easily lead to overfitting problem. Recently the methods based on mathematical formulations have been proposed to predict time series, which could clearly indicate the mathematical relationship between the input data and output data. Zuo et al. proposed that gene expression programming (GEP) was utilized to identify differential equation for time series prediction [20]. Graff et al. proposed genetic programming (GP) to forecast time series [21]. Grigioni et al. proposed a modified power-law mathematical model to predict the blood damage sustained by red cells with the load history [22]. Mina et al. proposed a betafunction formula to forecast the maxillary arch form [23]. Chen et al. identified ordinary differential equations (ODEs) to forecast the small time scale traffic measurements data and proved that the ODE model was more feasible and efficient than ANN models [24].
As a classical nonlinear differential equation, the S-system model has been proposed to predict time series and identify genetic networks. Zhang and Yang proposed a restricted additive tree (RAT) to represent the S-system model for stock market index forecasting [25]. However, the RAT method has nonlinear structure and is implemented inconveniently. In this paper, a novel stock index prediction method based on S-system model is proposed. Restricted gene expression programming (RGEP) is proposed to encode and optimize the structure of S-system. In order to optimize the parameters of the S-system model accurately, a new hybrid intelligent algorithm based on the brain storm optimization (BSO) algorithm and particle swarm optimization (PSO) algorithm is proposed.
Dow Jones Index, Hang Seng Index, and NASDAQ Index are old and famous stock indexes in the world, which are usually utilized to reflect the development of the global economy. Shanghai Stock Exchange Composite Index and SZSE Component Index represent the general trend of China's stock market and economic development. ese five stock indexes have been considered as the standard datasets to evaluate the performance of stock prediction models [26][27][28][29][30]. us, Dow Jones Index, Hang Seng Index, NAS-DAQ Index, Shanghai Stock Exchange Composite Index, and SZSE Component Index are collected to validate the performance of our proposed method.

Background Concepts and
Related Technologies 2.1. Data Description. Let stock time series data to be [X 1 , X 2 , . . . , X T ] (T is the number of time points). Generally, the data from the past time points are used to predict the data at the current time point. Figure 1 shows an example of data partition with m input variables. e data in the box are utilized as the input vector, and the data on the right side of the box is the prediction value. Two forecasting strategies, 1-week-ahead (m � 7) and 1-month-ahead (m � 30), are utilized in this paper.

S-System
Model. e S-system model has a complex and powerful structure, which captures the dynamic nature of the real system, and achieves a good performance in the terms of precision and flexibility [31,32]. e ith nonlinear differential equation in S-system is described as follows: where N is the number of equations, X i is the ith variable, α i and β i are the rate constants of production function and consumption function, and g ij and h ij are the kinetic orders.

Brain Storm Optimization
Algorithm. Brain storm optimization (BSO) algorithm is a new swarm intelligence optimization algorithm, which was proposed by Shi in the year 2011 [33]. In BSO, the cluster algorithm is proposed to search the local optimal solution and the global optimal solution is obtained through the comparison of all local optimal solutions. Mutation strategy is utilized to enhance the diversity of the algorithm and avoid obtaining local optimal solution [34]. e BSO process is described as follows: (1) Initialize the population and generate N potential solutions (x 1 , x 2 , . . . , x N ). (2) e k-means clustering algorithm is utilized to divide the N individuals into k classes. e fitness value of each individual is calculated. e best individual in each category is selected as the central individual. (a) Select randomly a class (the probability is proportional to the number of individuals in each class). A new individual (x s ′ ) is generated by adding the random perturbation to the central individual (x s ), which is defined as follows: where N(μ, σ) is the Gaussian random function and ζ is the factor that balances the random number, which is defined as follows: where log sig is a logarithmic S-transform function, max_interation is the maximum number of iterations in the algorithm, current_interation is the number of current iterations, is the gradient which is utilized to control the logarithmic S-transformation function, and rand() is the random number in the interval [0, 1].
(b) Randomly select a class and an individual in the selected class. A new individual is created with the selected individual and Gaussian value by equations (2) and (3). (c) Select randomly two classes, and two central individuals from the two classes are utilized as the candidate individuals x s1 and x s2 , which are fused with the following formula: where λ is a random number in the interval [0, 1].
After merging the candidate individuals, the individual is updated according to the formula (2).
(d) Two candidate individuals x s1 and x s2 are selected randomly from the two selected classes. e fusion and updating operators are implemented with equations (2) and (4).
After the new individual is generated, its fitness value is calculated. Compared with the fitness values of the candidate individuals, the individuals with the better fitness values are selected to the next generation. When N new individuals are generated, enter the next iteration process.
(5) When the maximum iteration number is reached, algorithm stops; otherwise, go to step (2).

Particle Swarm Optimization
Algorithm. e particle swarm optimization (PSO) algorithm is a classical swarm intelligent method [35]. In PSO, each potential solution is presented by a particle. A swarm of particles [x 1 , x 2 , . . . , x N ] moves in order to search the food source, with the moving velocity vector [v 1 , v 2 , . . . , v N ]. At each step, each particle searches the optimal position separately in the space, which is recorded in a vector P best i . e global optimal position is searched among all the particles, which is kept as G best [36].
At each step, a new velocity for the particle i is updated by the following equation: where w is the inertia weight and impacts on the convergence rate of PSO, which is calculated adaptively as w � (max_iteration − current_iteration/(2 * max_iteration)) + 0.4 (max_interation is the maximum number of iterations in the algorithm and current_interation is the number of current iterations), c 1 and c 2 are the positive constants, and r 1 and r 2 are uniformly distributed random numbers in [0, 1].
With the updated velocities, each particle changes its position according to the following equation:

Restricted Gene Expression
Programming. e restricted gene expression programming (RGEP) as the improved version of GEP was proposed to identify the S-system model for gene regulatory network (GRN) inference [37]. e flowchart of RGEP is described as follows: (1) Initialize the population. One example of chromosome in population is depicted in Figure 2. Each chromosome contains two genes and each gene contains head part and tail part, which are created randomly using the function set (F) and variable set (T): where * n is an operation of n variables multiplying, x i is the variable, m is the number of input variables, and R is the constant.
In order to make the chromosome similar to the S-system, each gene is allocated the corresponding parameters. For gene 1, α i is given as its coefficient and each variable is given exponent g ij . For gene 2, β i is given as its coefficient and each variable is given exponent h ij . Two genes are connected by the subtraction operation (−). Figure 3 shows the expression tree (ET) of Figure 2, and its corresponding S-system model is expressed as follows: (2) According to the given fitness function, evaluate the population with the training samples. In this process, the S-system model is solved by the fourth-order Runge-Kutta method [38]. For the differential equation (dy/dt) � f(x, y), the solution is as follows: , Computational Intelligence and Neuroscience 3 where h is the step size.
(3) If the optimal solution appears, RGEP is terminated; otherwise, turn to (4). (4) Selection, recombination, and mutation are used for reproduction of each chromosome, which are introduced in Reference [37].
In the initial stage of structural optimization, the symbols of the chromosome in RGEP are randomly selected, including function symbols and variable symbols. With training data, reproduction operators are used to optimize and change the chromosomal symbols in the optimization process. e optimized S-system structure does not contain all the input variables. According to the training data, RGEP could automatically select the appropriate input variables. In Figure 2, we can find that the coefficients α i and β i and the exponents g i1 , g i2 , g i3 , h i1 , h i2 , h i3 , and h i4 are needed to be optimized. In this paper, the parameters in each chromosome are optimized by a hybrid intelligent algorithm based on BSO algorithm and PSO algorithm.

Hybrid Optimization Algorithm.
e BSO algorithm is suitable for solving the problem of multipeak and highdimensional function.
e PSO algorithm has the advantages of easy realization, high accuracy, and fast convergence. But these two methods are easy to converge prematurely and fall into local optimum. In order to improve the diversity of population, a novel hybrid intelligent algorithm based on BSO and PSO (BSO-PSO) is proposed. In the BSO-PSO algorithm, the half of individuals are selected randomly and optimized by BSO. And the other individuals are optimized by PSO. e flowchart is described in Figure 4.

Time Series Data Forecasting Using S-System.
e flowchart of time series forecasting using the S-system model is described in Figure 5. During the training phase, the S-system model is optimized according to the genetic operators of RGEP, hybrid intelligent algorithm, and training dataset. During the test phase, the optimal S-system is used to make the prediction of the stock index. e detailed process is described as follows.

Training Phase
(1) Initialize the S-system population with the structure and parameters. Each S-system is encoded as the RGEP chromosome, which is described in Figure 2. (2) With the training samples, the S-system is solved by equation (4) and the fitness value of each S-system is calculated. Search the best S-system according to the fitness values. If the optimal model is found, the algorithm stops. (3) Selection, recombination, and mutation are used to search the optimal structure of the S-system. Go to step (2). (4) At some iterations in RGEP, BSO-PSO algorithm is used to optimize the parameters of RGEP chromosomes. In this process, the structure of the S-system model is fixed. According to the structure of the model, the number of parameters (α i , β i , g ij , and h ij ) is counted. With the hybrid intelligent algorithm, search and update the optimal parameters of each S-system.

Testing Phase.
With the data at the previous time point, the optimal S-system model obtained in the training phase is solved and the data at the current time point are predicted. Repeat this procedure until that the data at all testing time points have been predicted. According to the predicted data and target data, the predicted error is calculated.

Results and Discussion
Head Tail   Gene 1 Gene 2 x 3 x 5 x 2 Figure 2: e phenotype of chromosome in RGEP with parameters. RMSE (root mean square error), MAP (mean absolute percentage), and MAPE (mean absolute percentage error), R 2 (coefficient of multiple determinations for multiple regressions), ARV (average relative variance), and VAF (variance accounted for) are proposed to evaluate the performance of our method [30,39]: where N is the number of stock sample points, f i target is the real stock value at the ith time point, f i forecast is the predicting stock value at the ith time point, and f is the mean of stock indexes.
For 1-week-ahead prediction problem, function set is F � * 1, * 2, * 3, * 4 { } and variable set is T � x 1 , x 2 , . . . , x 7 in the RGEP method. By optimizing S-system models by our method, we could obtain the optimal phenotypes and expression trees (ETs) with five stock indexes, which are described in Figure 6. Five optimal S-system models gained are listed in Table 2 for five stock datasets. e forecasting results of five stock indexes by our method are depicted in Figure 7. From Figure 7, it can be clearly seen that the predicting curves are very near to the target ones, and the errors are nearly zero.
Comparison results of different prediction models' performance on five stock indexes are listed in Table 3. From Table 3, among the past five states of the art methods, the DRNN model performs best for five stock indexes prediction. But in terms of six indicators (RMSE, MAP, ARV, MAPE, R 2 , and VAF), our proposed method has better performance than the DRNN model. In terms of RMSE, our method is 34

Computational Intelligence and Neuroscience
-7.6373 -3.8423       Computational Intelligence and Neuroscience and 7.4% lower than DRNN for SZSEI dataset. In terms of ARV, our method is 58.7% lower than DRNN for DJI dataset, 67.1% lower than DRNN for HSI dataset, 68.8% lower than DRNN for NASI dataset, 36.9% lower than DRNN for SSEI dataset, and 16.5% lower than DRNN for SZSEI dataset. In terms of MAPE, our method is 37.5% lower than DRNN for DJI dataset, 48% lower than DRNN for HSI dataset, 42.9% lower than DRNN for NASI dataset, 35.2% lower than DRNN for SSEI dataset, and 18% lower than DRNN for SZSEI dataset. In terms of VAF, our method is closer to 100% than DRNN for five stock indexes. It could be seen clearly that our proposed method could improve the prediction accuracy sharply.
For 1-month-ahead prediction problem, function set is F � * 1, * 2, * 3, * 4 { } and variable set is T � x 1 , x 2 , . . . , x 30 in the RGEP method. With five stock indexes, we obtain five optimal phenotypes and expression trees (ETs), which are described in Figure 8. According to five ETs, the S-system models gained are listed in Table 4. e forecasting results of five stock indexes by our method are depicted in Figure 9. From Figure 9, we could see clearly that the predicting and target curves are very close.   x 12 x 18 x 9 x 9 (b)

2.2829
Head Tail * 2 x 3 x 29 * 2 x 17 x 5 -1.8748 -2.6364 -1.2427 -3.7315 x 25 x 12   Computational Intelligence and Neuroscience   Computational Intelligence and Neuroscience Six prediction models are used to forecast five stock indexes, and the prediction results are listed in Table 5. From Table 5, it can be seen that the five indicators (RMSE, ARV, MAPE, R 2 , and VAF) of our method are all the best of these six methods with the three datasets (DJI, HIS, and NASI).
e DRNN model has the highest MAP, which are 2.1368, 2.9568, and 6.3901, respectively. For SSEI and SZSEI datasets, our proposed method has the best performance in terms of RMSE, MAP, ARV, MAPE, R 2 , and VAF. In terms of ARV, our method is closer to 0 than other five methods. In terms of R 2 , our method is closer to 1. In terms of VAF, our method is closer to 100%. us, our proposed forecasting model tends to be more accurate.

Hybrid Intelligent Algorithm Analysis.
In order to test the performance of our proposed hybrid intelligent algorithm, we use BSO and PSO to optimize the parameters of  S-system models in the comparison experiments. rough 20 runs, with DJI dataset, the a-week-ahead prediction results by three evolutionary methods are listed in Table 6, which contains the best value, worse value, mean value, and standard error (SD) of the mean of 20-run RMSEs. From Table 6, we can see that through 20 runs, the best RMSE values by three evolutionary methods are very close, but the other three indicators seem to have a big difference. Our hybrid intelligent algorithm could obtain smaller worse RMSE, mean RMSE, and SD than PSO and BSO, which indicates that our hybrid intelligent algorithm is more robust and not easier to fall into local optimum than PSO and BSO. Figure 10 depicts the comparison of the RMSE convergence rate obtained from the application of our hybrid intelligent algorithm, BSO and PSO with DJI dataset for a-week-ahead prediction. Figure 10 reveals that our proposed intelligent algorithm has faster convergence than PSO and BSO in the early stage of the optimization process. When the number of iterations reaches 200, the RMSE convergence rate is dropping to 10 −3 that indicates the significant minimization of error.

Restricted Gene Expression Programming Analysis.
In order to test the performance of restricted gene expression programming for S-system optimization, the restricted additive tree is used to optimize the structure of the S-system model in the comparison experiments. rough 20 runs, with five stock indexes, the a-week-ahead prediction results by RGEP and RAT are depicted in Figure 11, which contains the best values, worse values, and mean values of 20-run RMSEs. From Figure 11, it could be clearly seen that RGEP could obtain smaller best, worse, and mean RMSE values than RAT, which reveal that RGEP could search the optimal S-system model more easily than RAT.

Conclusions
In this paper, a novel stock prediction method based on the S-system model is proposed to forecast the stock market. An improved gene expression programming (RGEP) is proposed to represent and optimize the structure of the S-system model. A hybrid intelligent algorithm based on BSO and PSO is used to optimize the parameters of the S-system model. Our proposed method is tested by predicting five real stock price datasets such as DJI, HIS, NASI, SSEI, and SZSEI. e results of predicting the stock price a-week-ahead and a-month-ahead reveal that our method could predict the stock index accurately and performs better than DRNN, FNT, RBFNN, BPNN, and ARIMA. e convincing performance of our method is mainly due to three aspects. e first is that the nonlinear ordinary differential equation model S-system has strong nonlinear modeling and forecasting ability. Table 6 and Figure 10 show that our hybrid intelligent algorithm is more robust and not easier to fall into local optimum than PSO and BSO. From  Tables 2 and 4, we can see that the optimal S-system models contain a portion of input variables. is is because our method can automatically select the proper input variables according to different stock data, which also prevents overfitting problem.
Data Availability e five stock indexes could be downloaded freely at https:// hk.finance.yahoo.com/.

Conflicts of Interest
ere are no conflicts of interest regarding the publication of this paper. Computational Intelligence and Neuroscience 13