A Novel Hybrid Deep Learning Model for Sugar Price Forecasting Based on Time Series Decomposition

Sugar price forecasting has attracted extensive attention from policymakers due to its significant impact on people’s daily lives and markets. In this paper, we present a novel hybrid deep learning model that utilizes the merit of a time series decomposition technology empirical mode decomposition (EMD) and a hyperparameter optimization algorithm Tree of Parzen Estimators (TPEs) for sugar price forecasting. The effectiveness of the proposed model was implemented in a case study with the price of London Sugar Futures. Two experiments are conducted to verify the superiority of the EMD and TPE. Moreover, the specific effects of EMD and TPE are analyzed by the DM test and improvement percentage. Finally, empirical results demonstrate that the proposed hybrid model outperforms other models.


Introduction
Sugar is an important food commodity around the world, and the fluctuations of food prices have a huge impact on people's daily lives due to its impact on overall inflation dynamics of many countries [1,2]. erefore, it is essential to forecast the sugar price accurately.
Forecasting of sugar prices has attracted a lot interest of researchers for several decades, and it can be divided into statistical methods and machine learning methods. e statistical methods have the advantages of low complexity and fast calculation speed [3]. In 1975, Meyer and Kim [4] applied the autoregressive integrated moving average (ARIMA) method in sugar price forecasting. However, the ARIMA requires the time series data to be stable or stable after being differentiated, which might limit the application of this method. In 2009, Xu et al. [5] used a neural network with multiple fully connected layers for sugar price forecasting using a Chinese database. In 2011, Ribeiro and Oliveira [6] introduced a hybrid model built upon artificial neural networks (ANNs) and Kalman filter. In 2019, Silva1 et al. [7] investigated ANNs, extreme learning machines (ELMs), and echo state networks (ESNs) for sugar price forecasting. However, one limitation of abovementioned three methods is they do not optimize the hyperparameter of neural networks. Hyperparameter optimization is a commonly used strategy in machine learning area [8] especially in time series forecasting [3] to improve the performance of machine learning models. is is largely due to the explosion in the field of machine learning in recent years [9] and makes some very common technologies appeared in recent years in the field of machine learning, such as SGD [10], and Adam [11] are proposed after the year of 2014, while most sugar price forecasting literatures are published before 2014. On the other hand, the nonstationarity and nonlinearity of sugar prices [12] make it harder to accurately predict the future sugar price. In order to handle nonstationarity features, the inputs for machine learning models need to be properly preprocessed [13]. erefore, some multiresolution analysis techniques are widely used in many forecasting problems [14,15]. Conventionally, the discrete wavelet transformation (DWT) was used [16,17]. Hajiabotorabi et al. [18] improved the recurrent neural network (RNN) with the multiresolution based on B-spline wavelet produced by an efficient DWT. Yong and Awang [19] used DWT for improving the forecast accuracy. However, the DWT generally requires a lengthy trial and error process [20]. Moreover, the empirical mode decomposition [21] (EMD) multiresolution technique is introduced to time series, which provides selfadaptability. EMD extracts the salient features via the temporal local decomposition method and isolates these significant features into subseries that represent the physical structure of the time series [22]. e EMD-based machine learning models have been adopted in time series forecasting. Ali and Prasad [20] predicted the significant wave height by ELM and improved complete ensemble EMD. Bedi and Toshniwal [23] adopted the EMD for electricity demand forecasting.
is broadly adapted technique can boost forecasting performance.
To this end, in this paper, to investigate the power of hyperparameter optimization and multiresolution analysis in sugar price forecasting, we propose a hybrid deep learning model for sugar price forecasting. e model uses Tree of Parzen Estimators (TPEs) [24] to optimize long short-term memory (LSTM) networks [25]. A time series decomposition technique named empirical mode decomposition (EMD) is used to decompose the sugar price to extract the salient features. e effectiveness of the proposed approach is tested at the daily sugar price of London Sugar Futures. To fairly compare with the mainstream methods for sugar price forecasting, we build the deep neuron networks (DNNs) with multiple fully connected layers which is equal to models in [5][6][7] in the machine learning field and the ARIMA compared with [4]. e results are compared against other machine learning algorithms such as the support vector regression (SVR) machine [15,[26][27][28][29], the DNN, and traditional time series model ARIMA. e rest of this paper is organized as follows: Section 2 describes the theoretical background, such as the LSTM, EMD, and TPE. Section 3 describes the proposed hybrid model in detail. Section 4 provides details of experiments and evaluations. Section 5 shows the discussion of experimental results, and Section 6 concludes the paper and points out possible future work.

Long Short-Term Memory (LSTM).
e LSTM neural network is heavily used as a basic building block in the modern deep learning-based time series forecasting model [30], which is an improved version of the recurrent neural network (RNN) and mainly solves the problem of gradient vanishing by its internal memory unit and gate mechanism. It can make the network memorize for a longer time and make the network more reliable. It was proposed by Meyer and Kim [4] in 1997. It solves problems that RNN cannot learn the long-term dependence of time series data. It has been widely used in the fields of sentiment analysis [31], speech recognition [32], early crop classification [33], and so on and has achieved satisfactory results.
e key mathematical equation of the LSTM model is as follows: where f t , g t , i t , and o t are the output value of the forget gate, update gate, output gate, and input gate, respectively. Moreover, o and w denote the product operation and the network parameters; b f,i,g,o are the bias vectors; σ is the sigmoid activation function; and f t is the memory cell. e former LSTM output value h t−1 and the input data x t are the inputs of the four gates.

Empirical Mode Decomposition (EMD).
Empirical mode decomposition (EMD) [21] is a time series decomposition technique, and it was proved to be effective in time series forecasting [20]. Considering the nonlinearity and complexity of sugar price sequences, accurately capturing sugar price characteristics will be a difficult task. us, the time series decomposition strategy EMD is adopted to conduct a decomposition in terms of the original sugar price sequences. e procedure of EMD technology is described as follows.
(1) Identify all local minima (l min ) and local maxima (l max ) in sugar price sequences x (t) , t � 1, 2, 3,. . .,T (2) Connect all l min and l max to form upper envelopes (x up(t) ) and lower envelopes (x low(t) ) (3) Compute the average m t � (x up(t) + x low(t) )/2 (4) Extract the intrinsic mode functions IMF � x t − m t (5) Iterate on the residual m t

Tree of Parzen Estimators (TPEs).
As stated by James et al. [24], TPE is a global optimization algorithm based on a sequence model. e algorithm uses a probabilistic model to model the loss function and make informed guesses about the specified number of iterations to find the best hyperparameters. When optimizing multiple hyperparameters, this algorithm has shown performance over grid search and random search, especially for deep learning models that usually have more hyperparameters than traditional machine learning models [34].
TPE uses the Bayes rule, and the probabilistic model p(y|x) � p(x|y)p(y)/p(x), and p(y|x) is broken down into l(x) and g(x), such that where l(x) means that one distributions for the hyperparameter where objective function value is less than the threshold and g(x) means that another one distribution for the hyperparameter where objective function value is larger than the threshold. e expected improvement (EI) metric is used to identify which hyperparameters to be chosen based on the probabilistic model. Given some set of hyperparameters and a threshold value for the objective function, y * , the EI is given by When EI is positive, this means that the hyperparameter set x is expected to obtain an improvement over the threshold y * . erefore, the working principle of TPE is to extract sample hyperparameters from l(x), evaluate them according to l(x)/g(x), and then return the set x that gives the best EI value.

e Proposed LSTM Model.
In this paper, for a fair comparison with the mainstream sugar price forecasting model, we proposed two-layer LSTM model which is illustrated in Figure 1, and the DNN model used in the three sugar price forecasting literatures will be compared with the proposed LSTM model in the same network structure. e performance of TPE and EMD will validate experimentally via comparing LSTM and DNN.

Hyperparameter Optimization.
For the deep learning models, the hyperparameters are model parameters that are defined in advance before training [35]. ere are several hyperparameters used in this paper: l i,n means the number of neurons of i th hidden layer. As shown in Figure 1, the deep learning models usually contain multiple hidden layers and have several neurons in each hidden layer. l i,a means the activation function of i th hidden layer. e activation function is a function that runs on the neurons of the neural network and is responsible for mapping the input of the neuron to the output [36]. Each neuron contains an input, output, weight, and processing unit. e output signal of the neuron is obtained after processing by the activation function. tanh, ReLU, and LeakyReLU activation functions [37] are used in this paper.
Dropout Rate. Dropout is a very useful and successful technique that can effectively control the overfitting problem [38]. Generally speaking, dropout will randomly delete some neurons with the probability of dropout rate to train different neural network architectures on different batches. e dropout rate is a real number between 0 and 1.
Optimizer. In deep learning, the loss defines the performance of the model. e loss is used to train the network to make it perform better. Essentially, a lower loss means that the model will perform better. A process of minimizing (or maximizing) any mathematical expression is called optimization, and the optimizer is used to minimize loss. e rmsprop [39], Kingma and Jimmy [11], and sgd [40] optimizers are used in this paper.
Batch Size. e deep learning model updates its parameters for each minibatch of training datasets with a batch size [41]. For the same number of epochs, the number of batches required for a large batch size is reduced, so training time can be reduced. However, within a certain range, increasing batch size helps the stability of convergence; however, as the batch size increases, the performance of the model will decrease [10]. erefore, batch size is an important hyperparameter.
Learning Rate. Choosing the optimal learning rate is important because it determines whether the neural network can converge to the global minimum. Choosing a higher learning rate may bring undesirable consequences on the loss, so that the neural network may never reach the global minimum, because the neural network is likely to skip it. Choosing a smaller learning rate will help the neural network converge to the global minimum, but it will take a lot of time because it only makes very few adjustments to the weight of the network and more time need to be spent to train the neural network. A smaller learning rate is also more likely to trap the neural network in a local minimum because the smaller learning rate is relatively hard to jump out of the local minimum. erefore, we must be very careful when setting the learning rate. Figure 2 illustrates the whole process of the proposed hybrid deep learning LSTM neural network model. e model is based on EMD technique and TPE algorithm. e modeling includes three steps as follows.

Proposed Hybrid Forecasting Model.
Step 1. Sugar prices sequence is decomposed by EMD and forms a set of IMFs.
Step 2. e lagged values of sugar price sequences and IMFs are used as input of the deep learning model. e TPE algorithm is applied to optimize the hyperparameters of the LSTM model, then the LSTM model with the best hyperparameter combination is obtained, and then its forecasting performance is tested on the test set.
Step 3. e trained model can be deployed to real world sugar price forecasting for police makers.

Experiments and Evaluations
In this section, sugar price sequences from the market are used to evaluate the performance of the proposed hybrid deep learning model via performing two distinct tests. All Mathematical Problems in Engineering the tests are run on Ubuntu 18.04 operation system with Intel Xeon (R) E5-2650 v4 CPU, GeForce GTX 1080 GPU, and 32 GB RAM. Furthermore, to avoid influence of random factor, each method is run 20 times, and the averaging value is used as final results.

Data Set Description.
e dataset used in this paper was the price of London Sugar Futures from April 2010 to May 2020. Data are fetched from the investing.com. Table 1 shows the statistic analysis of this dataset.
To implement the different experiments, we divide the data into three sets: training set, validation set, and testing set. e training set is used to train different models. e validation set is used to select the optimal hyperparameters, and the testing set is used to compare the models. e details of those three sets are described in Table 2.

Performance Evaluation.
To assess the superiority of the proposed hybrid deep learning model, the mean absolute error (MAE), mean absolute percentage error (MAPE), and root mean square error (RMSE) are employed for forecasting one day ahead sugar price. e formulas of MAE, MAPE, and RMSE are as follows: where m is the number of prediction datasets, y i is the real value of feed valve opening, and y i is the prediction value.

Experiment I. In this section, the dataset from London
Sugar Futures is used to verify the superiority of time series decomposition technique EMD and the hyperparameters optimization algorithm TPE, respectively. ree models such as the model used EMD and TPE combining LSTM (TPE-EMD-LSTM), the model used EMD combining LSTM (EMD-LSTM), and the single LSTM model are put together for comparison. In order to have a fair comparison, those models are using the same structure which is shown in Figure 3.
For the LSTM model, we set 100 neurons in each LSTM layer, and with ReLU activation function, the dropout rate is 0.5 of dropout layer. e EMD-LSTM model has the same hyperparameter with the LSTM model. After optimization, the obtained optimal hyperparameters of TPE-EMD-LSTM are summarized in Table 3. Moreover, the RMSE, MAE, and MAPE are used to test the performance of the TPE-EMD-LSTM and other compared models. e results are shown in Table 4. e TPE-EMD-LSTM model attained 2.415 of MAE, 0.682 of MAPE, and 2.969 of RMSE. ese values are the smallest among the three models, indicating the TPE-EMD-LSTM model surpasses the LSTM and EMD-LSTM model. Figure 4 shows the IMFs after decomposition. e performance of the EMD-LSTM is better than that of the LSTM in terms of achieving smaller MAE, MAPE, and RMSE, further ascertaining the performance of EMD to handle nonstationarity features in sugar price series. e DNN model has the same structure of the LSTM model, which is 100 neurons in each DNN layer and with ReLU activation function, and the dropout rate is 0.5 of dropout layer. For the second category, in order to fully test the hybrid model, two other prediction models are applied, including TPE-EMD-DNN and EMD-DNN. After optimization, the obtained optimal hyperparameters of TPE-EMD-DNN are summarized in Table 5. e performance evaluation metrics for this test are listed in Table 6 3.193. It demonstrated the effectiveness of the TPE algorithm. Moreover, the TPE-EMD-DNN model shows a significant improvement over the EMD-DNN; this not only shows the universality of TPE (effective for both LSTM and DNNs) but also shows its effectiveness.

Discussion
In order to evaluate the performance of the proposed hybrid deep learning model more comprehensively and find methods to improve forecasting capabilities and accuracy, followed by [15,28,42,43], we perform the Diebold-Mariano (DM) test [44] and improvement percentage.

DM Test.
e DM test is a method for making a comparison between the forecasting models and determines whether forecasts are significantly different. e DM test is described as follows: Input layer Hidden layer Output layer   where f d is the consistent estimate of spectral density of loss-differential, d is the mean of the loss-differential between two forecasts, and T is the length of forecasting time series. e results of the DM test between our proposed TPE-EMD-LSTM and other models are shown in Table 7. According to Table 7, the DM values range from 4.2 to 30.4, all far above 2.58, the upper boundary of the 1% significance level. at is to say, there is a significant difference between our proposed TPE-EMD-LSTM model and other models.
is means that using the proposed TPE-EMD-LSTM model will ultimately achieve the most important forecasting.

Improvement Percentage.
As discussed in the DM test section, the TPE-EMD-LSTM model does significantly better than all other models. However, the details of forecasting performance improvement are not clear. erefore, in this section, we apply three evaluation matrices (IP MAE , IP RMSE , IP MAPE ) to discuss the superiority of TPE and EMD more specifically. IP MAE , IP RMSE , and IP MAPE represent the improvement percentages of MAE, RMSE, and MAPE, respectively. ey are defined as follows:    Two comparisons are conducted. Firstly, EMD-DNN and EMD-LSTM are compared with DNN and LSTM, respectively. Secondly, TPE-EMD-DNN and TPE-EMD-LSTM are compared with EMD-DNN and EMD-LSTM, respectively. Moreover, we found that the traditional time series forecasting model ARIMA still has strong performance on sugar price forecasting. erefore, ARIMA is compared with LSTM and DNN to discuss the traditional time series model and deep learning based model. Table 8 summarizes the improvement percentage results of different models.
First of all, compared with DNN, it should be noted that EMD is more effective for LSTM network. EMD-LSTM vs. LSTM shows 22.761, 24.722, and 37.401 of IP MAE , IP MAPE , and IP RMSE , respectively. However, the EMD-DNN vs. DNN only shows improvement in IP RMSE , and it indicates that the DNN is not as good as the LSTM network in capture EMD features for time series prediction. For TPE optimization, it shows a huge improvement for EMD-DNN and EMD-LSTM.
e TPE optimization enhances the forecasting precision to a great extent as the improvement percentages relative to the contrasted models are prominent. From the comparison of TPE-EMD-LSTM with TPE-EMD-DNN, it can be seen that even if TPE optimization brings a huge improvement to the DNN, the LSTM network is still better than the DNN. Finally, ARIMA still outperforms DNN and LSTM, which shows the effectiveness of the traditional time series model and can be found in many time series applications [30]. However, it also should be noticed that after hyperparameter optimization, the deep learning-based model shows significant improvement in forecasting accuracy and surpassing the ARIMA model.
In the next step, ensemble forecasting will be investigated as it was surged recent years and shows state of the art performance on many time series forecasting task [33]. Also, in order to verify its generalizability and robustness, the proposed model needs to be applied to predict other food commodities.

Conclusions
Sugar price forecasting plays a vital role in policy making of sugar industries. In order to accurately predict sugar price, a hybrid deep learning model that utilizes the merit of time series decomposition technology and a hyperparameter optimization algorithm is proposed. is enhances forecasting performance of the proposed model compared with all other models. A large number of experiments have been conducted to prove the effectiveness of TPE and EMD. Moreover, DM tests are conducted to find improvement percentage and to reveal their specific effects. e DM values from Diebold-Mariano tests between our proposed TPE-EMD-LSTM model and other models range from 4.2 to 30.4, all far above 2.58, the upper boundary of the 1% significance level, indicating huge significance between the compared models. e improvement percentage of the proposed TPE-EMD-LSTM model to the other three models (EMD-DNN, EMD-LSTM, and TPE-EMD-DNN) in the values of IPRMSE is 78, 42, and 7, indicating a significant improvement.
is enhanced forecasting performance will contribute to the sugar factory's recruitment plan for new employees and the sugarcane farmers' plan for planting sugarcane. After finishing this sugar price forecasting work, our future work is to develop ensemble forecasting as it is surged recent years and shows state of the art performance on many time series forecasting task [33]. Also, in order to verify its generalizability and robustness, the proposed model needs to be applied to predict other food commodities.
Data Availability e dataset used in this paper was the price of London Sugar Futures from April 2010 to May 2020. Data are fetched from the investing.com.

Conflicts of Interest
e authors declare that they have no conflicts of interest.