The Use of LSTM Neural Networks to Implement the NARX Model. A Case Study of EUR-USD Exchange Rates

The paper focuses on financial data forecasting in terms of one-step-ahead nonlinear model with exogenous inputs. The main aim is the development of a methodology to forecast the exchange rate between EURO and US Dollar. The prediction task is carried out by two recurrent neural networks, the standard NARX neural network and a LSTM-based approach. The exogenous inputs consist of historical trading data and three widely used technical indicators, namely a variant of moving average, the Upper Bollinger Frequency Band and the Lower Bollinger Frequency Band. In order to obtain accurate forecasting algorithms, the exogenous inputs are filtered using the well-known Gaussian low-pass filter. The quality of each method is evaluated in terms of both quantitative and qualitative metrics, namely the root mean squared error, the mean absolute percentage error, and the prediction of change in direction. Extensive experiments point out that the most suited forecasting method is based on the proposed LSTM neural network for NARX model.


Introduction
Modern financial markets are chaotic and very hard to predict systems mainly due to their volatile features and non-linearity.Therefore, stock market prediction is regarded as one of the most challenging applications among time series forecasting and it is under continuous investigations for years.A variety of statistical methods have been introduced to the time series domain and then applied to the problem of financial data analysis and forecasting, such as the autoregressive (AR) family models, exponential smoothing, and moving average (MA) techniques.However, conventional statistical models mostly failed to capture the complexity and non-stationary structure of financial data.During the past decades, machine learning (ML) models turned into serious competitors to classical statistical models in the time series analysis and forecasting field.The most successful ones, such as artificial neural networks (ANN) and support vector machines (SVM) have been widely used to predict financial time series due to their ability to learn from nonlinear data and successfully perform classification and prediction tasks.Several high forecasting accuracy machine-learning techniques have been reported in literature ( [1], [2], [3], [4]).Nowadays, the field of financial forecasting is developing fast, the financial data being the main valuable source of information for the investors and traders.Most of the modern studies on financial data forecasting are focused on developing hybridized models which can combine the advantages of statistical, data mining and machine learning algorithms (see de [5], [6], [7]).On one hand, the aim of using combined models is to process the financial datasets with data mining techniques before considering them for learning process in order to remove noise or to smooth data.Several data pre-processing techniques, as for instance feature extraction and feature selection, could decrease the redundancy and noise in time series data and consequently increase the prediction models performances ( [8], [9], [10], [11]).On the other hand, since the exact price of stocks is unpredictable, a series of research works are focused on predicting the price movements, therefore the problem of data forecasting is reduced to a classification problem [12].

1
In the literature, one of the most recently explored ML field is the deep learning research area (DL).A series of recurrent neural networks, such as convolutional neural networks and long-short time memory (LSTM) have been developed to solve classification and prediction tasks ( [13], [14], [15], [16]).The research reported in this paper aims to investigate the potential of the LSTMbased approaches to implement the NARX model.The outline of the paper is as follows.In Section 2 the NARX prediction model is discussed briefly.Next, the NARX neural network and the LSTM networks are described.The proposed methodology is provided in the fourth section of the paper.In Section 5 the experimental results of the proposed methodology together with a comparative analysis are presented.The concluding remarks are given in the final section of the paper.

The NARX prediction model. The NARX Neural Networks
The NARX model (Nonlinear Autoregressive with eXogenous input) is well suited for modelling non-linear time series as well as for one-step and multi-step ahead prediction.To develop the general forecasting model in case of non-stationary time series one can use Partial Autocorrelation Function (PACF) to establish the delay variable.In the following we denote by:  T -the total number of time periods   -the variable to be forecast   -the set of N latent variables uses to forecast, XT = ((1), (2), … , ())     ,   = (  (1), … ,   ()) -the corresponding delays.Note that the delay variable of a time series X corresponds to the lag d, where the PACF function computed for X drops immediately after the d th lag.For 1 ≤  ≤ , Y t is the value at the moment of time t, and we denote by Y ̂t the predicted value of Y t .The general NARX forecasting model is expressed by [17]: where  f is a non-linear function In our study we consider p=1, the equation ( 1) describing the one-step ahead prediction model.In order to simplify (1), we define  = max{  , max{  (1), … ,   ()}} and the prediction model where  f is a non-linear function (1), XT t (2), … , X t ())

𝑇
The non-linear function f can be computed by a neural network, the most commonly choice being the NARX networks.The NARX neural networks (NARXNN) are recurrent dynamic networks (RNN) specially tailored to model nonlinear systems, as for instance time series.Moreover, NARX networks have high generalization performance and good learning ability which makes them well suited for financial data.The NARXNN topology consists of three layers connected with each other, namely an input layer   , a hidden layer   , and an output layer   .The general architecture of the NARXNN is displayed in Figure 1.The training of NARXNN is of supervised type and usually uses a gradient descent approach.In most of the cases, the local memories of   and   are determined by the Levenberg-Marquardt variant of the backpropagation learning algorithm [18].Since the actual output is available during the training of the network, a series-parallel architecture is created, where the estimated target is replaced by the actual output.After the training step, the series-parallel architecture is converted into a parallel configuration, in order to perform the prediction task.The standard performance function is defined in terms of mean square network errors.Denoting by |. | the number of elements of the argument, the sizes of the hidden layers   can be computed many ways, some of the most frequently used expressions being

The LSTM Neural Networks
Long Short Term Memory is a type of RNN architecture that allows the long-time learning of time steps dependencies.The general idea of LSTM is to recurrently project the input sequence into a sequence of hidden representations.At each time-step, the LSTM learns the hidden representation by jointly considering the associate input and the previous hidden representation to capture the sequential dependency.[19] The learning process is accomplished using specific memory blocks located in the recurrent hidden layer.The memory blocks are created from auto-connected cells in which neural network temporal states are saved.Also, each block has an input structure and an output structure.The LSTM has the abilities to remove or add information to the cell state.A memory cell consists of four units: an input gate, a forget gate, an output gate, and a self-recurrent neuron.Each unit cell controls the interactions between neighbouring memory cells and the memory cell itself.The input gate decides if the input modifies the state of the memory cell.The forget gate chooses whether to remember or to forget the previous state of the memory cell.The output gate has the role to decide if the state of the memory cell should alter the state of other memory cells.[20] A LSTM neural network maps a certain input sequence  = ( 1 , … ,   ) to an output sequence  = ( 1 , … ,   ) iteratively for  = 1, … , .The topology of a LSTM network includes one or multiple hidden layers.The mathematics behind the most common model of the LSTM architecture is described as follows [20].We denote by ° the Hadamard product.First, one has to define the function used to remove a part from cell state information.The decision belongs to the so-called forget gate layer, usually modelled in terms of the sigmoid function, given by where   is the weight matrix,   is the bias vector and   is the value of the forget gate at the time t.where   and   are the weight matrices,   and   are the biases, and  −1 is the value of the memory cell at time t.Note that   is the value of the input gate while ̃ represents the candidate state of the memory cell at the time t.
Based on the first two steps   , an update to the state of the memory cell  −1 is computed as follows Finally, the output representing a filtered version of the cell state is given by where   is the weight matrix,  0 is the bias vector, and   is the value of the output gate at the time t.The inner structure of a LSTM cell containing the external dependencies is shown in Figure 2. Note that the additional layer representing a memory block can be added to the network in order to prevent a continuous processing of data without a corresponding segmentation.

Fig. 2. Standard LSTM cell
The training difficulty occurs when there are temporal dependencies at high distances between the time series values.In this case, small changes in the iterative flow could lead to high effects over future iterations results.The number of hidden neurons is usually computed based on the number of training samples, the input size and the dimension of output as follows [21] where  is an arbitrary scaling factor,  ∈ [2,10].Note that smaller values of the parameter  could lead to non-overfitting learning schemes, with significantly large computational effort.

The Proposed Methodology 4.1 Technical Analysis and Variable Selection
Technical analysis is used in financial evolutions studies, especially to determine various patterns of fluctuations in order to give the investor the advantage of profitable purchasing and selling decisions.The premise behind technical analysis is that all of the internal and external factors that affect a market at any given point in time are already factored into that market's price [22].At the same time, the ability to predict future price fluctuations is still a challenge.In order to establish a suitable methodology, we investigated several technical indicators to obtain more information about the monitored time series and, consequently to obtain more accurate prediction values.From the point of view of NARX model, the considered indicators are exogenous variables representing a subset of the neural network inputs.In the following we briefly describe the motivation and calculus formula for some of the most commonly used technical indicators involved by our study.A moving average (MA) is a time series constructed by taking averages of several sequential values of another time series.[23] A moving average is an indicator that shows the average value of a time series over a specified period of time.The n-period MA for the time series V is computed by averaging the values of the past n recorded values, as follows Note that all the previous values used to compute the MA indicator (11) are weighted by the same value  = 1

𝑛
. In case n is large, one can consider the weights such that higher values are associated to more current data and lower values are considered for the older data.The version of MA used in our study is inspired by the exponential ranking distribution probability and it is defined by: where   = 2  =1 (14) while the Lower Bollinger Frequency Band (LOBB) is given by

Data Pre-processing
The methodology is developed to analyse historical data representing the indicator of exchange rate between EURO and US Dollar, recorded between October 1999 and October 2019 [26].The targeted series is EURO/USD Closing Price, which represents the last prices recorded for every bank day.To forecast the targeted values, we use the following exogenous variables:  EURO/USD LOW, representing the indicator with the minimum value for every bank day;  EURO/USD HIGH, representing the indicator with the maximum value for every bank day. three technical indicators computed based on the targeted time series, namely MA, UPBB and LOBB.In order to estimate a suitable window of the delay values we computed the PACF value of each time series.Let d be such that, for all considered variables, PACF function drops immediately after the d th lag.This means that the delay should be set in a certain neighborhood of d.Next, all the time series recorded values are normalized, that is each variable is of zero mean and unit variance.Finally, each exogenous input is filtered using the standard Gaussian low-pass filter.We evaluated the forecasting capacity of the NARX based model by splitting the available data into train data (70%) and test data (30%), considered as new, unseen yet samples.

Performance Measures
To establish meaningful conclusions regarding the research work, various performance metrics have been selected to evaluate the prediction capabilities of models.We used two classes of measures to assess the forecasting capacity of the developed methods, namely quantitative indicators and qualitative (trend-based) metrics.
In the following we denote by {(1), … , ()} the set of actual values of Y and we assume that { ̂(1), … ,  ̂()} is the collection of the forecasted values.The mean absolute percentage error (MAPE) measures the magnitude of the forecasting error in percentage terms.Due to the fact that the predicted values are deviated both positively and negatively in relation to the actual values, the calculation of MAPEI becomes necessary in the analysis process.Naturally, past data should be discounted in a more gradual way [27].
The root mean square error (RMSE) evaluates the standard deviation of the differences between forecasted values and actual ones.The advantage of using RMSE (17) is that it penalizes big errors relatively more than small errors [28]  = √ 1  ∑ (() −  ̂()) 2  =1 (17) The prediction of change in direction (POCID) is a metric used to measure the percentage of the number of correct decisions related to the changes in direction or trend where Obviously, good forecasting performances correspond to small values of RMSE and MAPE measures and high POCID percentages.

Experimental Results
In order to develop a comparative analysis on the use of NARX neural network and LSTM neural network to implement the NARX forecasting model, a long series of experiments have been conducted.The results are presented in the following.First, we had to establish the delay parameter in prediction model (2).In order to solve this problem, we computed the Partial Autocorrelation Function (PACF) corresponding to each time series.We implemented the NARX forecasting model (2) using either the standard NARX neural network described in Section 2 or the LSTM neural network presented in Section 3.
The results are shown in Table 3 and Table 4.
The graphic representation of forecasted values versus the true ones corresponding to the data in Table 2, Table 3 and Table 4 is depicted in Figure 3, Figure 4 and Figure 5, respectively.
The POCID values vary between 55% in case of using NARX NN and 60% when LSTM NN is considered.
Based on the obtained results, we conclude that the LSTM neural network is more suited than standard NARX neural network to forecast future values of the analysed data from both accuracy and stability points of view.

Conclusions and Suggestions for Further Work
The reported work focuses on the development of financial data forecasting methods to implement the one-step-ahead nonlinear model with exogenous inputs.In our study the main aim is to obtain a prediction methodology to forecast the exchange rate between EURO and US Dollar.The prediction task is carried out by two recurrent neural networks, the standard NARX neural network and a LSTM-based approach.We analysed five exogenous time series, two of them of stock-based type and the other three belonging to the technical indicators class.In order to obtain accurate results, the exogenous inputs are filtered using the well-known Gaussian low-pass filter.
To establish meaningful conclusions regarding the accuracy of the obtained results, various performance metrics have been settled.Extensive experiments pointed out that the most suited forecasting method is based on the proposed LSTM neural network for NARX model.We conclude that the results are encouraging and entail future work toward extending this approach to more complex NN-based models as well as hybrid techniques.

Fig. 1 .
Fig. 1.The architecture of the three-layered NARX network model Next, we have to decide what new information should be preserved in the cell state.The process consists of two steps: the selection of the values that should be updated and the computation of a new vector of candidate values.The updated values are usually computed by the input gate layer of sigmoid type (5) while the vector containing the new candidate values is compute by a tanh layer   = (  ° [ −1 ,   ] +   ) (5) t = ℎ(  ° [ −1 ,   ] +   ) (6)

Table 1 .
The

Table 2 .
The

Table 3 .
The forecasting results using NARX NN for NARX model

Table 4 .
The