EMDFormer model for time series forecasting

: The adjusted precision of economic values is essential in the global economy. In recent years, researchers have increased their interest in making accurate predictions in this type of time series; one of the reasons is that the characteristics of this type of time series makes predicting a complicated task due to its non-linear nature. The evolution of artificial neural network models enables us to research the suitability of models generated for other purposes, applying their potential to time series prediction with promising results. Specifically, in this field, the application of transformer models is assuming an innovative approach with great results. To improve the performance of this type of networks, in this work, the empirical model decomposition (EMD) methodology was used as data preprocessing for prediction with a transformer type network. The results confirmed a better performance of this approach compared to networks widely used in this field, the bidirectional long short term memory (BiLSTM), and long short term memory (LSTM) networks using and without EMD preprocessing, as well as the comparison of a Transformer network without applying EMD to the data, with a lower error in all the error metrics used: The root mean square error (RMSE), the root mean square error (MSE), the mean absolute percentage error (MAPE), and the R-square (R 2 ). Finding a model that provides results that improve the literature allows for a greater adjustment in the predictions with minimal preprocessing.


Introduction
Time series forecasting has become a key issue in research fields related to energy production and economic or financial analysis due to the need to know the time evolution of the variables under study.Many forecasting tools have been developed along time to deal with the prediction of future values of those variables.They can be roughly classified into two classes [1]: statistical methods and those based on artificial intelligence.The former is usually known as classical techniques since they provide the first tools to process time series [2].The most widely used tool is the autoregressive moving integrated moving average (ARIMA).This tool has been widely used to predict time series of energy-related variables [3,4]; nevertheless, as this model has a linear structure, it has problems when dealing with hard nonlinear time series.
Although statistical models were able to provide good predictions with many time series, tools based on artificial intelligence are more widely used in this field because of their ability to deal with highly nonlinear time series.There are several tools in this field that have been used as forecasting tools: random forest, gradient and extreme gradient boosting [5], or support vector machine [6].Among them, artificial neural networks (ANN) clearly stand out, and they have been widely used to forecast variables related to economy or energy, becoming one of the most popular forecasting methods in these fields.
Many works have been published where ANN clearly outperform classical tools.In [7], the behavior of ANN was compared with six traditional statistical methods for predicting time series, pointing out a better performance of neural networks compared to the other techniques.In [8], the existing literature was completed with new studies using a novel model that combines traditional statistical techniques and ANN, obtaining empirical results that show it as an effective way to improve the accuracy of time series prediction.There are lots of works where ANN has facilitated complex time series prediction tasks with results that improve classical statistical techniques [9][10][11], even by incorporating hybrid models, such as the one presented in [12], in which a recurrent hybrid model was used for the prediction of time series of different types with results that improve those obtained with other models.
Many of these applications use a simple neural model, multilayer perceptron (MLP) [13][14][15], because of its ability to approximate any measurable function [16], even though they are presented as a nonlinear time series or there are noisy or missing data.
New complex neural structures, capable of processing large amounts of data with strong temporal relationships between them, have been developed to address problems closely related to human abilities, such as text and speech processing or object identification.Due to the high number of layers and neurons they have, they are usually known as deep learning neural networks.One of these structures is the long short-term memory (LSTM), originally proposed to process written text or speech.Since these problems present a high time dependence among the data they process, it seemed logical to assume that they could also provide accurate predictions in time series forecasting, and thus they have been widely used to do so [17,18].Although this model has been able to provide good results, a new structure based on it has been proposed to try to improve its performance: The bidirectional long short-term memory network (BiLSTM), which, unlike traditional LSTM networks, executes additional training by traversing the input twice from left to right and then from right to left.In [19], the possibility of incorporating more additional layers during the training phase was explored, and, as a result, the BiLSTM model provided greater efficiency compared to the results obtained by an LSTM.Along the same lines, [20] made predictions from financial time series with similar results.Similarly, [21] reached the same conclusion by comparing the BiLSTM, support vector regression, and ARIMA models in forecasting economic time series.
In recent years, transformer neural networks (TNNs) have been presented as a revolutionary alternative in the field of prediction, although they were originally developed for applications such as natural language processing and computer vision.Their introduction has meant a change in the way in which sequential problems are addressed, with research on the application of TNNs to time series prediction accelerating in recent years, demonstrating their efficiency in modeling complex temporal patterns.Its novelty lies in its ability to handle sequences of variable length and to capture complex temporal relationships more effectively than recurrent neural networks (RNNs), such as LSTM, eliminating the need to maintain long-term memory thanks to the incorporation of attention mechanisms.In [22], it was mentioned how RNNs were proposed as an effective alternative in which despite the appearance of different variants such as LSTM, it was difficult to capture long-term dependencies in the time series data.Unlike RNNs, Transformers networks allow the model to access any part of the history regardless of distance, making it potentially suitable for capturing recurring patterns with long-term dependencies.
Transformers have significantly increased their use for time series forecasting tasks [23], as they appear to be more successful at extracting complex correlations between data than other models used for these tasks, such as LSTM.To date, much superior performance has been demonstrated in many natural language processing and computer vision tasks [24,25], a fact that has sparked researchers' interest in using this type of network for time series forecasting [26,27].Some works have also tested the performance of variations of the basic Transformer model by comparing their performance with other forecasting tools [28].
Other ANN models have also been used to forecast time series although they have not deserved the same attention as those described above.Some works [29,30] have used the convolutional neural network (CNN) to forecast time series with good performance.Gated recurrent gates (GRU), a simplification of the basic LSTM structure, have also been used [31,32].Even though CNN and GRU have been able to provide better performance than other neural models when forecasting specific time series, they have not been as widely used as the previously mentioned models because they are caseoriented tools; CNN for image processing and GRU for text processing, which makes them less adapted to the time series forecasting task.Among the most recent models studied for time series prediction are those of graph convolutional networks (GCN), showing their effectiveness in taking advantage of the relationships between the data and representing each time series as if it were a graph (see [33,34]).
In addition to the use of new forecasting models, new more sophisticated structures have been proposed.They combine two or more forecasting tools trying to achieve more accurate predictions.Thus, we can find in literature several combinations of forecasting models: CNN-LSTM [35], LSTM-MLP-extreme gradient boosting [36], ARIMA-CNN-LSTM [37], and CCNN-MLP-transformers [38].Although all these combinations were able to perform well with the time series they were designed to handle, it is not clear that they would be able to overcome other simpler models with time series different from those they were used with.
Following the strategy of combining several forecasting tools in a hybrid model, some authors have proposed an alternative way to improve the accuracy of predictions: preprocessing of the time series to be forecast.Apart from applying statistical tools to improve the quality of these data to facilitate forecasting [39], the time series could be decomposed into sub-series with a uniform behavior, making them easier to forecast.This assumption is because many energy or economic variables are influenced by social and weather factors that have a certain periodic behavior.Therefore, the objective should be to decompose the time series into sub-series that are closely related to those periodic behaviors.In this way, each of these sub-series should retain a certain periodic behavior closely related to specific frequencies embedded in the overall behavior of the time series; thus, they could be more accurately predicted.Empirical mode decomposition (EMD) is one of the techniques that has attracted the attention of many researchers because it has been able to improve the performance of the forecasting tools with which it has been used.It has been applied to several neural models such as MLP [40], LSTM [40,41], or Transformers [42].In [43], an LSTM was combined with two attention mechanisms to process a time series previously decomposed with a complete ensemble EMD with adaptive noise (CEEMDAN) to forecast several datasets.This last variation of the basic EMD model has also been used with transformers [26].All these works showed that the models with preprocessing were able to outperform those without it.
Since EMD has been shown to significantly improve the performance of neural networks for time series forecasting, it is used in this work along with a transformer to forecast economic and energyrelated time series.The aim is to prove that the combination of a preprocessing stage with a forecasting neural network can provide a better performance than the forecasting tool alone.It is worth noting that the basic structure of both tools has been used instead of more sophisticated modifications proposed in the literature with the aim of also showing that it is not necessary to use very sophisticated models to achieve good performance.The transformer was chosen as the forecasting tool because it is being tested to find out whether it can outperform other neural models in these tasks.In this work, it will be demonstrated that it can outperform both LSTM and BiLSTM, tools that have been widely used for these tasks.
This paper is organized as follows: Section 2 provides a detailed description of the methodology and its architecture.Section 3 focuses on the experiments carried out and the comparison of the results.Finally, Section 4 describes the major conclusions and future lines of work.

Methodology
In this work, a Transformer is used with an EMD to forecast time series, defining a unique forecasting tool called EMDFormer.Its performance will be compared with that of the transformer without EMD.Two other neural models, LSTM and BiLSTM, are also tested with and without EMD for comparison.The proposed EMDFormer model is described below.
The choice of the BiLSTM and LSTM models to perform a contrast is due to the results reflected in the existing literature, which reflect truly adjusted predictions, which are essential in this type of time series.
Zhao et al. [44] describe how LSTM networks treat the hidden layer as a memory unit so that it can cope with correlation within both short-term and long-term time series.

EMD Model
EMD decomposes the original data into a collection of intrinsic mode functions (IMFs) and a residual based on the local characters of the time series, including maximums, minimums, and zero crossings [45].IMFs must meet two conditions: (1) The number of extremes and the number of zero crossings must differ by a maximum of one.
(2) The mean value of the envelope defined by local maxima and the envelope defined by local minima must be zero at any point.
The decomposition method is called sieving process [46] and is described as follows: (1) Given a time series   , identify all the maximums and minimums and form with all the maximums an envelope   , and with the minimums a lower envelope   using an interpolation technique.
(2) Calculate the average value of the maximum and minimum envelopes.
(3) Calculate the difference between   and   to obtain a detailed component ℎ  =   −   .
(4) Steps 1-3 will be repeated with ℎ  as the new input until it satisfies the two conditions mentioned above or the number of iterations reaches the maximum defined by the user.Now ℎ  will be defined as the first IMF as  1 .
(5) Subtracting  1 from   a new sequence will be obtained without the high-frequency components  1 =   −  1 .
(6) The process will be repeated until all the IMFs and a residue are obtained.(7) In this way the original time series is decomposed as:

Transformer
Transformers neural networks are a deep learning architecture based on attention mechanisms.[23] introduced the scaled dot-product attention algorithm as a goal to ensure that models were able to focus only on the most relevant elements of long sequences.To achieve this, it is necessary to obtain the weighted sum of the values V, where the weights are calculated by applying the softmax function to the scalar products of the queries Q with the keys K, scaled by the square root of the dimension of the keys dk.

𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄, 𝐾, 𝑉) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ( 𝑄𝐾 𝑡
√  ) . ( Transformers use a variant of this algorithm called multi-head attention.This version uses h learnable linear projections to the queries, keys, and values before applying individual attention to each of the projections.After this step, the result obtained in each of the attentions will be concatenated before the last linear projection. The input at each time step will first be transferred by an embedding layer to a vector, which consists of a representation of information in a high-dimensional space.The vector will then be combined with the positional information to be the input to the multi-head attention layer (Figure 1).For each attention head, 3 parameter matrices are generated for learning the transformer: the key weights , the query to be keyed weights , and the value weights .The embedding X is multiplied with the above 3 matrices to get the key matrix K, query matrix Q and value matrix V [47].
A tuple of the weight matrix (Wk,Wq,Wv) is called the attention head, and in a multi-head attention layer, there are several heads.As seen in the figure, the results of each head will be added and normalized to move to the next layer.The feedforward layer is the weight matrix that is trained during training and applied to each respective time step position.

EMDFormer
The proposed model combines the predictions obtained from the different IMF's generated from the application of EMD to the time series to be studied.The process flow is represented in Figure 2. Initially, the EMD is applied to the time series, and this process will result in 11 IMF's, represented in Figure 3, which will subsequently be the input of the transformer neural network.

Data description
Next, the data used in the evaluation process of the TNNs will be described by predicting the resulting IMFs after the EMD process.To test the robustness of the proposed model, the experiments are carried out with two-time series; the first contains the data related to the west texas intermediate (WTI) crude price index obtained through the Thomson Eikon Reuters platform, and the data is in a temporal daily form from January 10, 1983 to June 15, 2022.WTI prices are the most widely used spot price of oil, along with Brent spot prices as a reference to set the price of oil.
The total number of observations analyzed is 10,289, included in a DataFrame from the Pandas library in Python.
The second time series used is the Bloomberg commodities total return (BCTR) commodity index on a daily basis, from January 2, 1991 to May 25, 2022.The data were obtained from Thomson's Eikon database, Reuters.The index represents certain commodities related to energy, livestock, soft commodities, industrial metals, precious metals, and grains.
The forecast is made on the 8,192 observations collected in the indicated interval and is imported into a Python DataFrame generated with the Pandas library.

Evaluation metrics
In order to evaluate the performance of the different experiments, the following error metrics are used: Root mean squared error (RMSE): mean squared error (MSE): mean absolute percentage error (MAPE): R-squared (R 2 ): In these expressions,   is the i-th element of the original time series,   ̂ is its corresponding forecasted value, ̅ is the mean value, and n is the number of elements it has.The MAPE and RMSE metrics are used to measure the error in the predictions, which will allow us to know the adjustment of the neural network; thus, the value should be as low as possible.Frechtling [48] announced that the optimal value for the MAPE metric should be between 10% and 20%, where an optimal prediction between 20 and 30 will be considered acceptable.
The R 2 metric is like MSE and will be used to evaluate the performance of the model.It will reflect the variation in the dependent variable that occurs from the prediction of the independent variable.The closer to 1, the better the network performance.

Model parameters
The best performance of the model will be obtained by adjusting the hyperparameters of the network; tt is necessary to keep in mind that we must minimize the risk of overfitting.
(1) Learning rate: A value that is too low may make it necessary to increase the number of epochs and make training slower.
(2) Batch size: Defines the number of samples that will be analyzed before updating the internal parameters of the model.
(3) Epoch: Defines the number of times that the learning algorithm will run on the entire set of training data.
(4) Hidden layers: The number of hidden layers and the number of neurons largely determine the complexity of the model and thus its potential learning capacity.For the selection of the number of hidden layers, different units were experimented with, selecting the optimal value by comparing the evaluation metrics.
(5) Optimization algorithm: The choice of the optimization algorithm can have a notable impact on the learning of the models.It will update the parameter values based on the set learning rate.In this case, it was selected as Adam because it tries to combine the advantages of RMSProp (similar to gradient descent) together with the advantages of gradient descent with momentum [49].(6) Num_heads: This parameter refers to the number of attention heads in the multi-head attention layer of the transformer network.Multi-head attention allows the network to focus on different parts of the input sequence simultaneously, and the number of heads controls how many different perspectives network can be considered when processing information.(7) Ff_dim: This parameter indicates the dimension of the feedforward layer within the transformer network structure.The advance layer is a dense layer that is applied after the focus layer.The selection of this parameter can affect the network's ability to learn more complex or simplified patterns in the data.
Goodfellow [50] made recommendations for the optimal parameters for predictions with neural networks, suggesting comparisons using different numbers of cycles.These will be determined according to the computational limitations and possible overfitting of the model.The values selected for each of these parameters are reflected in the Table 1.Beck and Arnold [51] stated that the choice of parameters can be easily estimated and differentiated if the parameters are not dependent on each other.This approximation is possible when two parameters are compared, since, if there were more parameters, the computational cost would increase exponentially.Smith [52] highlights the importance of an appropriate choice of the hyperparameters of a DNN to minimize the error obtained describing a new method for choosing the learning rate that eliminates the need to experiment with different values to find the maximum network performance.

Experimental results and discussion
The results obtained by the combination of EMD and Transformer are compared with the results obtained from analyzing the same time series with a Transformer type network without applying EMD and after this process the comparison is made with the results obtained by a BiLSTM network.Under the same conditions, applying EMD and without applying it, the choice of a BiLSTM network is justified by recent literature, where these models have shown great accuracy in predicting time series.As stated previously [21,53], this process is repeated with an LSTM network, which allows the results of the proposed model to be contrasted with two models widely used for this purpose.
Table 2 shows the results obtained for the 4 processes used, observing better performance of the Transformer network after applying EMD to the time series, obtaining lower values in all evaluation metrics, demonstrating its effectiveness for price forecasting of WTI.
It is observed that the Transformer models have better performance and greater precision than the predictions made with BiLSTM networks, obtaining a significantly lower error.The RMSE error metric is reduced by 82.47% after applying EMDFormer compared to the model applying EMD to a BiLSTM network, with a reduction of 46.11% in the RSME metric with a traditional BiLSTM model and 40.53% compared to the Transformer model without applying EMD.The results obtained are mainly due to the ability of transformer networks to capture long-term patterns, as well as to efficiently handle dependencies at different distances, facilitating their learning.Furthermore, its lower propensity for overfitting and the lower need for hyperparameter adjustment make this model ideal for this type of time series, improving the analysis, and allowing the error to be minimized.
A negative result in the R 2 parameter of the EMD-BiLSTM and EMD-LSTM models may be due to overfitting of the network, which coincides with insufficient performance in the rest of the metrics obtained.
The graphs belonging to the six model-based approaches with and without EMD show predictions very close to the actual WTI index price values in Figure 4, especially in the predictions obtained by the EMDFormer model.
Table 3 shows the results obtained in the predictions of the BCTR time series using the same models.Once again, results are observed that improve the remaining results from the EMDFormer model, which confirms that the combination of EMD with the Transformer models provides new possibilities in time series prediction.The BCTR time series shows less adjusted results than those obtained with the other time series; however, once decomposed into IMF's, the results obtained by the EMDFormer model show much lower error metrics than the rest of the models, improving the results by 96% with respect to the transformer model without applying EMD.
The ability of Transformer networks to efficiently handle dependencies at different distances is revealed once again, as well as how the application of EMD preprocessing allows the error to be minimized.
The rest of the models present less efficient results and with negative results in the R 2 metric in the Transformer and EMD-BiLSTM models, which shows a lower capacity to capture the characteristics of the time series.
The detailed results of the 6 approaches are shown in Figure 5, with a comparison of the BCTR values with the predictions returned by the models.Despite the positive results obtained, it is worth noting that the use of empirical mode decomposition as a data preprocessing method for time series prediction with TNNs, while promising, poses several challenges.Implementation complexity arises as it requires advanced technical knowledge, and its effectiveness heavily depends on the quality of input data, making it susceptible to inaccuracies or biases.Additionally, employing EMD adds time and computational resources to the modeling process, which can be prohibitive in resource-constrained environments.Finally, the interpretability of the results may be affected, as EMD decomposition could obscure underlying patterns, making it difficult for end-users to understand and trust the generated predictions.

Conclusions and future lines of research
We propose a new methodology time series forecasting that combines the potential of Transformer networks together with the EMD to achieve greater precision.First, EMD is applied to the time series with the objective of obtaining the resulting IMF's to later be processed by the Transformer network, which will be able to capture the characteristics of each of the series and make more accurate forecasts.This type of networks has the ability to capture the characteristics of different time series of WTI oil price data or data from raw materials using the BCTR index, not only being limited to the economic field, but it is also possible to apply its performance to other types of time series.
Based on this study, further investigation will continue into the feasibility of the proposed methodology for predicting various economic values, as well as time series of different characteristics, considering other recent models based on transformers such as PatchTSTST and DLinear.Different prediction horizons will be considered to identify if there are significant differences in each period.

Table 2 .
Models performance in terms of MSE, RMSE, R 2 and MAPE.The best values for each metric are in bold.

Table 3 .
Models performance in terms of MSE, RMSE, R 2 and MAPE for BCTR time series.The best values for each metric are in bold.