A COMPARISON OF LEARNING ALGORITHMS FOR SEASONAL TIME SERIES FORECASTING USING NARX MODEL

In this study, we propose a nonlinear autoregressive network with exogenous inputs (NARX) model with two deterministic seasonal dummy approaches, that is binary dummy variables and sine-cosine pairs. For significant lag is selected using a stepwise AIC method, including a deterministic seasonal dummy. While the number of neurons in the hidden layer is conducted by trial and error method on one to five neurons. The NARX model is trained using five types of algorithms and a tangent hyperbolic activation function. Each algorithm is compared on all approaches to see the speed of convergence and forecasting accuracy. In addition, time series data is performed using data without and with the first differencing process. The results of the case study show that the best approach to the NARX model is to use binary dummy variables and data with the first differencing process. On the other hand, the GRPROP algorithm shows the least computation time, the fastest training process steps, and forecasting accuracy with the best MAPE value. Overall, the GRPROP algorithm is the best training algorithm in this case. However, the GRPROP algorithm on the variation of its parameters shows it is not stable. While the RPROP algorithm for parameter variations shows a better speed and stability of convergence than backpropagation and GRPROP. The backpropagation algorithm occasionally outperforms GRPROP on its parameter variations.


INTRODUCTION
Forecasting time series data is mostly done using statistical methods. However, these statistical methods are only limited to linear models, whereas in many cases forecasting time series data has a nonlinear tendency. As an alternative, neural networks (NN) models can be used on data where the assumptions of the linear model are not met [1,2]. The NN model is a flexible method for modelling linear and nonlinear correlation. The popular NN model for modeling and forecasting time series data is the nonlinear autoregressive neural network (NARNN) model or better known as the feed-forward neural network (FFNN) [3].
In this study, we propose a NARNN model using external inputs or a nonlinear autoregressive network with exogenous inputs (NARX) model. The external input of the NARX model is carried out using two deterministic seasonal dummy approaches, that is binary dummy variables and sine-cosine pairs. These two approaches are compared their effects on the accuracy of forecasting time series data. In addition, time series data were given two treatments, namely raw data (without the first differencing process) and data with the first differencing process. In this case, the NARX model focuses on monthly time series data that has trend and seasonal patterns. The time series data is then linearly scaled between -0.8 and 0.8 to facilitate learning of the NARX model. The main data lag was selected using the same method as [4,5]. Next, the significant lag was selected using the stepwise AIC method. A similar approach was applied by [6] on the NARNN model.
The architecture of the NARX model is carried out using 1 input layer, 1 hidden layer with tangent hyperbolic activation function, and 1 output layer with linear activation function. The initial weights and biases were assigned randomly. The constant parameter values used refer to [7,8,9], see more detail in section 3. The maximum number of training processes (epoch, which in [7] is called stepmax) used is 1000000 steps. The target value of the work function used is a threshold of 0.01. The iteration will be stopped if the value of the work function is less than the threshold value. While the number of neurons in the hidden layer is done by trial and error method. In this study, we followed [10] who suggested the use of one to five neurons for time series data. Furthermore, the NARX model is trained using five types of algorithms, namely backpropagation, resilient backpropagation with weight backtracking (RPROP with WB), resilient backpropagation without weight backtracking (RPROP without WB), the modified globally convergent algorithm with the smallest absolute gradient (GRPROP with SAG), and the modified globally convergent algorithm with the smallest learning rate (GRPROP with SLR). Each algorithm is compared on the computational aspect to see the speed of convergence and forecasting accuracy. The measure of forecasting accuracy is done with the mean absolute percent error (MAPE). A method has excellent performance if the MAPE value is below 10% and has good performance if the MAPE value is between 10% and 20% [2]. This paper is structured as follows: In this section, we have explained the background as well as the contribution of our paper. In section 2, we provide a brief description of some of the theories required for a complete description of the paper. In section 3, we provide an empirical study with two real data, namely monthly data on the number of international airline passengers and monthly data on the number of deaths in the United States of America (USA). While the conclusions are given in Section 4.

PRELIMINARIES
2.1. NARX Model. The nonlinear autoregressive network with exogenous input (NARX) has been reported to be very essential for discrete time nonlinear systems and has also been defined by the following mathematical relationship [11]: where u(t) and y(t) indicate the input and output of the model at the moment t, respectively, n u ≥ 1 and n y ≥ 1 (n y ≥ n u ) represent input and output memory orders, w is the weight matrix while f is the non-linear function expected to be estimated using the multilayer perceptron (MLP) [12].
Basically, the NARX network is trained under one out of two models [13,14]. The first is the series-parallel architecture (or parallel architecture without feedback), where there is formation of the output's regressors only through the use of the output actual values: The second model is the parallel architecture (or parallel architecture with feedback), where the output is the feedback for the feed-forward neural networks input, being part of the standard architecture: The NARX networks trained in line with equations (2) and (3) are, therefore, represented as NARX-SP and NARX-P networks, respectively. This research was focused on NARX models with parallel architecture and without feedback (NARX-SP).
This research also applied the two conventional approaches as exogenous input variables in the NARX model. The first approach to modeling deterministic seasonal patterns is to use binary dummy variables S − 1 for each time period t, where S is the seasonal duration. The second approach is to use a set of sine and cosine dummy variables, which have been shown to capture deterministic seasonal elements of the time series well. Two inputs x s,1 and x s,2 encode seasonality using an explanatory variable that is created using Sin(t) and Cos(t) for an explicit representation of the point in time within an identified seasonality of length s, see discussion in [15,16] with

Backpropagation
Learning. This is the algorithm mostly applied to supervised learning with multi-layered feed-forward networks. This is fundamentally built on the repeated implementation of the chain rule to calculate each weight's influence in the network regarding an arbitrary error function E [17] where w i, j represents the neuron weight from i to j, s i indicates the output, and net i is the weighted sum of neuron i inputs. Moreover, after the determination of each weight's partial derivative, a simple gradient descent is used to minimize the error function.
The learning rate of ε selected to scale the derivative significantly affects the time required up to the achievement of convergence. In a situation the value set is too small, there is going to be a need for several steps to achieve a satisfactory solution but a big value would cause oscillation and this has the ability to keep the error from falling beneath a defined value.
An early method proposed to ensure this problem is solved is through the introduction of a momentum-term The momentum parameter is observed to have scaled the initial step's effect on the present step and the momentum-term has been discovered to have the ability to ensure better stability of the learning procedure and acceleration of convergence in the error function's shallow regions.
This has, however, been discovered not to be truth every time based on practical experience since the momentum parameter's optimal value was later found to be equally problemdependent as the learning rate without the possibility of achieving any general improvement [9].

RPROP Learning. RPROP is an acronym for resilient propagation and it is defined as
an effective new learning scheme applied to directly adapt the weight step with respect to the local gradient information. There was an introduction of an individual update-value ∆ i, j by [9] to ensure each weight strictly decides the weight-update size. The evolution of this adaptive update-value was observed in the learning process because of the local sight on the error function, E, through the use of the learning-rule presented as follows: The working principle of the adaptation-rule is such that each time there is a change in the sign of the corresponding weight w i, j partial derivative to show the last update was too enormous and that there is a jump of the algorithm on a local minimum, there is usually a reduction in the update-value ∆ i j by the factor η − . Meanwhile, in a situation the sign is retained, there is usually a slight increase in the update-value to ensure the acceleration of convergence in shallow regions.
The weight-update is also usually a straight-forward rule after the adaption of the updatevalue for each weight and, in a situation the derivative is positive (increasing error), there is a reduction in the weight using the update-value but a negative condition usually leads to the update-value addition There is, however, one exception and this is a situation there is a change in the sign on the partial derivative such that the initial step was observed to be too large while the minimum was missed, there is going to be the reversion of the previous weight-update The backtracking weight-step is expected to change the sign on the derivative once again in the following step without allowing the update-value to be adapted in order to avert double punishment. It is possible to practically achieved this through the setting of ∂ E ∂ w i, j represented by g is Lipschitz continuous on R n for any two points x and y ∈ R n , condition where L > 0 denotes the g satisfies Lipschitz constant, and x 0 is the starting point of the following iterative scheme The general iterative scheme (12) convergence, in which d k is the search direction and τ k > 0 is a step length requiring the adopted search direction d k to satisfy condition g(x k ) d k < 0 and this ensures d k becomes a descent direction of f (x) at x k . The length of steps in (12) is defined using some rules such as Wolfe's is the f gradient at x, and 0 < σ 1 < σ 2 < 1, and Wolfe's theorem was applied to produce the convergence results [18,19].
The fulfillment of the assumptions (i)-(iii), for any w 0 ∈ R n and any sequence {w k } ∞ k=0 produced by the RPROP scheme where sign(g(w k )) denotes the column vector of the signs of the components of g(w k ) = (g 1 (w k ), g 2 (w k ), ..., g n (w k )), τ k > 0 satisfying Wolfe's conditions, η k m (m = 1, 2, ..., i − 1, i + 1, ..., n) are small positive real numbers generated by RPROP's learning rates schedule: The modified RPROP, known as the GRPROP, is applied through the relations (15)- (19). It is, however, important to note that it is possible to apply relation (19) randomly or in a cyclic pattern on the local learning rates. In other situations, the selection of the ith coordinate with the absolute smallest no zero g i (w k ) value may be better to produce a larger value for η k i . Moreover, the replacement of each time there is a production of smallest learning rate value by the schedule of the RPROP with the η k i value derived from equation (19) is another method applicable and it was observed to have been used in all the experiments in [8].
The GRPROP algorithm used was induced by the globally convergent with the smallest absolute gradient and the smallest learning rate. This was associated with the resilient backpropagation without weight backtracking as well as modifying one learning rate, including those related to the smallest absolute gradient or the smallest learning rate. Moreover, there is the limitation of the learning rates in this algorithm to the vector-defined boundaries or the list with the minimum and maximum limits.

MAIN RESULTS
This research was conducted in two real cases: the first is the number of international air-  it can be seen in Figure 1 that the data has an uptrend and seasonal pattern with multiplicative nature.
In this study, the determination of the autoregressive lag was chosen using the following approach: if the length of the seasonal period is m, then the sequential m lag starting from lag 1 is used. For example, lag for quarterly data of 1:4 and for monthly data of 1:12 is used.
According to [4,5], seasonal patterns can be captured more easily through this approach. Next, the significant lag was selected using the stepwise AIC method. A similar approach was applied by [6] on a previous study for the NARNN model. In this study, the NARNN model is given external input or better known as the NARX model. The external input of the NARX model is made using a deterministic seasonal dummy, namely a binary dummy variable and a sine-cosine pair. These two approaches are compared their effects on the forecasting accuracy of the NARX model on data with and without the first differencing process.
The architecture of the NARX model used is 1 input layer, 1 hidden layer with tangent hyperbolic activation function, and 1 output layer with linear activation function. For the selection of the initial weights and biases the network is assigned randomly. The constant parameter values used refer to [7,8,9], namely the increase factor η + = 1.2, the decrease factor η − = 0.5,   In this study, the NARX model focuses on monthly data that has trend and seasonal patterns.
The data were then scaled linearly between -0.8 and 0.8 to facilitate the training of the NARX model. In the case of the first data, four approaches are taken: first, the NARX model with external input uses a binary dummy variable on the data without the first differencing process, where the second approach uses the first differencing process. Third, the NARX model with external input uses dummy variables of sine-cosine pairs on the data without the first differencing process, where the fourth approach uses the first differencing process. In the first and second approaches, the selected lags using the stepwise AIC method are 1, 4, 5, 8:12 and 1:8, 10, 12, respectively, including a binary dummy variable. Meanwhile, in the third and fourth approaches, the lag is equal to 1:12 and the dummy variable is the sine-cosine pair.
The results of the comparison of the NARX model with binary dummy variables for each algorithm can be seen in Tables 1 and 2, while the NARX model with dummy sine-cosine variables can be seen in Tables 3 and 4. The comparison process was carried out using data without the first differencing process (Tables 1 and 3) and with the first differencing process (Tables 2 and 4). The parameter used for comparison of each algorithm is the model that gives the best results. Overall, the results of the MAPE value on the testing data show that the NARX model with a binary dummy variable is more accurate. On the other hand, the  While the data with the first differencing process, the convergence of the GRPROP with SAG algorithm provides the least computation time (0.1093 secs) and the fastest training process steps (303 steps). The convergence speed of the RPROP with WB algorithm ranks second with 0.1353 secs and 490 steps. Overall, the best results were obtained using the GRPROP algorithm.
However, the RPROP algorithm shows that it is much better with respect to the resistance of its parameter variations than backpropagation and GRPROP. The backpropagation algorithm occasionally outperforms GRPROP with SAG with respect to the resistance of its parameter variations (indicated by the reached threshold). In this case, the GRPROP with SAG algorithm often does not reach convergence for the stepmax or threshold used.
Furthermore, the comparison of the forecasting accuracy of each algorithm is carried out using the MAPE value on the testing data. In this case, the GRPROP with SLR algorithm consistently gives the best MAPE value on the data without and with the first differencing process.
Sequentially, the best MAPE values were obtained at 0.0210 and 0.0101. While the GRPROP algorithm with SAG ranks second with a slight difference. On the other hand, the number of neurons in the hidden layer of each algorithm is obtained differently to achieve convergence.
The results generally show that the greater the number of neurons in the hidden layer, the smaller the MAPE value in the testing data obtained. This indicates that the number of neurons in the hidden layer affects forecasting accuracy. Overall, the GRPROP algorithm is more accurate than the backpropagation and RPROP algorithms. However, the RPROP algorithm outperforms backpropagation in terms of convergence speed and forecasting accuracy.

Accidental Deaths in the USA. Monthly data on the number of deaths in the United
States of America (USA) due to accidents, can be graphically seen in Figure 2. Monthly data from January 1973 to December 1978 were used as training data and data from January 1979 to June 1979 were used as test data. This data is quite popular because of its complex pattern and has been discussed in several studies. Brockwell and Davis [24] have used this data as an example of applying the ARIMA model and exponential smoothing.
Based on the results in the first case, the NARX model approach in the second case is only shown using data with the first differencing process and external input using a binary dummy variable. In this case, the selected lag using the stepwise AIC method is 1, 2, 8, 11, 12, and a binary dummy variable. The results of the comparison of each algorithm can be seen in Table 5.
In Table 5, the GRPROP with SAG algorithm provides the least computation time (0.1561 secs), the fastest training process steps (311 steps), and the best MAPE value (0.26%). Overall, the best results were obtained using the GRPROP with SAG algorithm with respect to convergence speed and forecasting accuracy.

CONCLUSION
In this study, the best approach to the NARX model was obtained using data with the first differencing process and external input using a binary dummy variable. The fastest computation time and training process steps are obtained using the GRPROP with SAG algorithm.
While the forecasting accuracy was obtained using a different GRPROP algorithm, GRPROP with SLR on monthly data on the number of international airline passengers and GRPROP with SAG on monthly data on the number of deaths in the USA. Overall, the GRPROP algorithm (either with SAG or SLR) is the best training algorithm in this study. However, the GRPROP with SAG algorithm for parameter variations often shows that it does not reach convergence, while GRPROP with SLR requires more computational time and training process steps. On the other hand, the RPROP algorithm for its parameter variations shows a better speed and stability of convergence than GRPROP. The RPROP algorithm (both with and without WB) shows that it is easy to implement. In addition, the computational time of the training process is obtained very small, while the number of steps of the training process is significantly reduced.
The RPROP algorithm on several data patterns was found to be very good with respect to convergence and robustness. Meanwhile, the backpropagation algorithm occasionally outperforms GRPROP with SAG for its parameter variations.