Prediction by a Hybrid of Wavelet Transform and Long-Short-Term-Memory Neural Network

Data originating from some specific fields, for instance tourist arrivals, may exhibit a high degree of fluctuations as well as non-linear characteristics due to time varying behaviors. This paper proposes a new hybrid method to perform prediction for such data. The proposed hybrid model of wavelet transform and long-short-term memory (LSTM) recurrent neural network (RNN) is able to capture non-linear attributes in tourist arrival time series. Firstly, data is decomposed into constitutive series through wavelet transform. The decomposition is expressed as a function of a combination of wavelet coefficients, which have different levels of resolution. Then, LSTM neural network is used to train and simulate the value at each level to find the bias vectors and weighting coefficients for the prediction value. A sliding windows model is employed to capture the time series nature of the data. An evaluation is conducted to compare the proposed model with other RNN algorithms, i.e., Elman RNN and Jordan RNN, as well as the combination of wavelet transform with each of them. The result shows that the proposed model has better performance in terms of training time than the original LSTM RNN, while the accuracy is better than the hybrid of wavelet-Elman and the hybrid of wavelet-Jordan. Keywords—Wavelet Transform; Long-Short-Term Memory; Recurrent Neural Network; Time Series Prediction


I. INTRODUCTION
The growth in the number of visitors and tourism investments makes tourism become a key factor in export earnings, job creation, business development and infrastructure.Tourism has shifted and become one of the largest fast growing economic sectors in the world.Despite the global crises that occur several times, the number of international tourist trips continues to show positive growth.As shown by the data from BPS, the Indonesian Central Agency for Statistics, the number of tourists visiting Indonesia has increased from year to year.Travel and tourism directly contributes 2.1 trillion dollars to global GDP.It is more than doubled, compared to the automotive industry, and nearly 40 percent larger than the global chemical industry [1].Travel and tourism sector is worth three quarters of the education sector, the banking sector, the mining sector, and the communications services sector.By knowing the number of the visitors to a country, the income of the country from the tourism sector can be predicted.
Fluctuation in the number of tourists visiting Indonesia in every year is not easy to predict.This has become a major problem for some parties such as hotels, restaurants and travel agents.This also causes those parties not able to devise good plannings for their business.The difficulty in determining the data traffic patterns is due to the existence of noise.
To overcome this problem, a technique is needed to separate the low frequency pattern from the high frequency pattern through the process of translation (shifting) and dilation (scaling) [2].Wavelet transform can reveal aspects of frequencies in the frequency decomposition process [2], [3].In [4], the use of wavelet sequence prediction models improves the effectiveness of multi layer perceptron (MLP) neural network.A merger between wavelet model and Kalman filter produces a powerful model for estimation technique [5], so does the merging of wavelet model with spectral analysis [6].Evaluations conducted to several wavelet-RNN models show that the combination between wavelet and RNN usually produces smaller error value [7].
Recurrent neural networks have the capability to dynamically incorporate past experience due to internal recurrence [2].RNNs can project the dynamic properties of the system automatically, so they are computationally more powerful than feed-forward networks, and the valuable approximation results are obtained for chaotic time series prediction [8], [9].One of RNN models is long-short-term memory.This model works when there is a long delay, and is able to handle signals that have a mixture of low and high frequency components.The learning process of RNN models however requires a relatively long time because there is a context layer in the network architecture [10].
LSTM is a successful RNN architecture model to fix the vanishing gradient problem in neural network [11].Sequencebased prediction of protein localization produces high prediction accuracy with LSTM and bidirectional model [12].Comparing LSTM RNN model with random walk (RW), supportvector machine (SVM), single-layer feed forward (FFNN) and stacked autoencoder (SAE) shows that the LSTM RNN model produces higher accuracy and generalizes well [13].Prediction of time series data for securities in Shanghai ETF180 obtains a good accuracy.Using LSTM RNN, the result increases by 4 percents compared to the previous model, while data normalization can also improve accuracy [14].
In this paper a new hybrid algorithm for prediction, which is based on a combination of wavelet analysis and LSTM neural networks, is proposed.The proposed method is then applied for predicting tourist arrivals.The wavelet is employed to denoise the original signal and decompose the historical number of tourist arrivals into better series pattern for prediction.An LSTM neural network is used as a non-linear pattern recognition to estimate the training data signal and to compensate the error of wavelet-LSTM prediction.The proposed method is applied to tourist arrival data, which is a set of 240 vector data.
The rest of the paper is organized as follows: The principles of the proposed method are described in Section II.Simulation result and the comparison of the proposed model with Elman and Jordan recurrent neural networks are presented in Section III.Finally, we conclude the paper and present future work in Section IV.

II. METHODOLOGY
Improving the accuracy of prediction can be performed by combining several different methods [15].In this paper, LSTM neural network model is used to identify data pattern, while the wavelet method is employed to decompose input data.Prediction model using the hybrid of wavelet transform and LSTM neural network consists of the following phases: • Phase 1: normalizing the data to values ranging between 0 and 1, • Phase 2: decomposing data into constitutive series through wavelet transform, • Phase 3: applying sliding windows to the data to form several variables, and • Phase 4: recognizing data pattern using LSTM neural network model through data training and data testing.
The result of the proposed model is then compared with Elman RNN, Jordan RNN and LSTM RNN, as well as the hybrid of wavelet-Elman and the hybrid of wavelet-Jordan.The following subsections further elaborate the general design of the proposed model, the details of the wavelet transform as well as LSTM, data normalization and the evaluation metrics used.

A. Design of the proposed model
Fig. 1 depicts the flowchart of the proposed model.The model is divided into two processes, i.e., training and testing, each of which further contains several processes.The first process in data training is to normalize the data with Min-Max normalization.The second process is data decomposition using wavelet transform.The purpose of wavelet transform is to divide the data into high and low frequencies.The third process is segmentation of time series data input using sliding windows.The next process is training using LSTM RNN.The training process uses 90 percent of data.The result of training process is a weight value for each neuron in the neural network and error value, which are shown by MSE and RMSE.After the weight value for each neuron is obtained, the testing process is started.The testing process produces error value.To restore the decomposed values to the original values, reconstruction and denormalization processes are performed.The purpose of reconstruction is to restore the data from high and low frequencies, while denormalization aims at restoring the data to original values.
In neural networks, data training is performed to gain bias values and weightings for prediction approximation and detail coefficients [3].Bias and weighting coefficients generated from learning data are used in the testing process.The process is conducted iteratively to generate the prediction coefficient The learning does not use the input from the beginning, but is performed on approximation coefficients at the lowest levels (DWT 3).Once the simulation result has reached the desired target or maximum epoch, the learning finishes.The learning process uses back-propagation algorithm with the addition of a context layer [16] to accelerate the convergence towards the desired minimum value error.

B. Data normalization
Prior to the decomposition process, research data obtained from BPS must be normalized.The technique used provides linear transformation on original range of data, and is called Min-Mix normalization [17], [18].The technique keeps the relationship among original data.Min-Max normalization is a simple technique, where the technique can specifically fit the data in a pre-defined boundary.A normalized value x of a data point x i in a predefined boundary [C, D] is defined by: where x min is the smallest value of the data, x max is biggest value of the data, and [x min , x max ] is the range of the original data.The normalization process will produce a value ranging between 0 and 1.The biggest value will produce a value of 1, while the smallest value will produce a value of 0 in the normalization process.
To restore the normalization value to the original value, denormalization process is conducted.This aims at restoring the output of the value to be in the range beforehand.Given a normalized value x, its denormalization, namely, the original data point x i can be calculated by:

C. Data decomposition with wavelet transform
A wave is a function that moves up and down space at a time on a periodic basis, while wavelet is a restricted or localized wave [19].Wavelet can also be regarded as a shortwave.The model provides a depiction of the frequency of the signal timing.
A wavelet transform (WT) is a time-frequency decomposition that provides a useful basis of time series in both time and frequency [8], [20], when the time series, like tourist arrival series, is non-stationary.Mother wavelet is the basic function used in wavelet transform [2] as it produces all functions used in the wavelet transformation through translation and scaling.The mother wavelet will determine the characteristics of the produced wavelet transform.Therefore, selection of the mother wavelet type must be done carefully in order to perform the transformation efficiently.
The wavelet transform can identify and analyze the signal moves.The purpose of analyzing the signal moves is to obtain information and the frequency spectrum at the same time.Discrete wavelet transform (DWT) is one of the wavelet transform development series.DWT works on two collections of functions called scaling functions and wavelet functions that are each associated with a low-pass filter and a high-pass filter [20].The decomposition structure of wavelet transforms for level 3 is shown in Fig. 2.

S A1 D1 A2 D2 A3 D3
Fig. 2. Wavelet decomposition tree at level 3 The type of wavelet used in this research is Haar wavelet transform.It is a simple Daubechies wavelet, which is suitable to detect time-localized information and increases the performance of the prediction technique [21].The Haar wavelet has two functions called approximation function and difference function.The approximation function produces a sequence of the averages between two consecutive data in the data input, while the difference function produces the current approximation sequence.Both functions are executed recursively and the process will stop when the element in the difference sequence is one [22].The i-th approximation sequence (A i ) is given by: where A i−1 (j) is the j-th element in the sequence (A i−1 ) for j = 1, 2, . . ., n.
Decomposition process can be through one or more levels.Discrete wavelet series contain approximated series (A t ) and detail series (D t ).Dimensional signal can be divided into two parts, the high frequency and the low frequency parts.The high frequencies is analyzed with low-pass filter, while the low frequencies is analyzed with high-pass filter.Both frequency filters are used to analyze the different resolutions of the signal.The signal could be subsampled by 2, by discarding every second sample.The decomposition for each layers are represented by [8]: where y low and y high is a low-pass and high-pass filters, respectively, both subsampling by 2 [8].k refers to the time decomposition and the original signal data x(i) is passed through to a high-pass filter g(•) and a low-pass filter h(•).The above functions can be reused in the next decomposition.DWT coefficient consists of the output of high-pass and lowpass filters.
The high and low-pass filter functions are followed in reverse order by the reconstruction.The signal at each layer is upsampled by two, through the synthesis filter g (•) and h (•) and added to each other [8].The reconstruction for each layer is given by: The reconstruction process aims at returning the original values of data.Reconstruction is started by combining DWT coefficients which are at the end of the previous decomposition upsampled by 2 (↑ 2) through a high-pass filter and low-pass filter.The reconstruction process is completely the opposite of the decomposition process according to the level of decomposition [19].

D. Sliding windows technique
Sliding windows technique is a kind of processing method of concept drift in data streams, which has many applications in intrusion detection.The essence of this technique is data update mechanism.The data stream (x) is divided into several parts of data blocks.When sliding window moves to the next block, new block is added to the window at intervals, and the oldest block is deleted.Through this dynamic sample selection method, the sample for modeling is updated [23].
The technique of sliding windows comes with a particular size of window; and this impacts the size of sample data.Suppose that x 0 , x 1 , x 2 , . . ., x n−1 , x n , x n+1 , . . . is a series of time-series data.When the window size is fixed at k, the data interval will be changed to x i−k , x i−k+1 , . . ., x i , x i+1 and has different data streams from older data streams.As shown in Fig. 3, the data interval is x i−2 , x i−1 , x i , x i+1 for window size of 3, where the value of x i+1 is obtained from In the proposed model, the window size in the sliding windows depends on the number of data.The window size determines prediction accuracy.Based on several experiments, the window size selected in this paper is 3, because it provides the best value and minimizes data reduction.The number of data is 120 after decomposition process with wavelet.Then, the sliding windows process generates 3 data input variables, 1 output variable, and there will be 117 data.The number of data is reduced because the windows eliminates the last data record.Each memory block in the original architecture contains an input gate (G) and an output gate (H).The input gate (G) controls the flow of input activation (X t ) into the memory cell.The output flow (C i ) of cell activation is controlled by the output gate into the rest of the network; the next process is memory block added by forget gate (w) [24].

The implementation of sliding windows technique on RNN is shown in
The forget gate (w) scales the internal state of the cell before adding it as input to the cell through the self-recurrent connection of the cell, therefore adaptively forgetting or resetting the cell's memory.In addition, modern LSTM architectures contain peephole connections from its internal cells to the gates in the same cell to learn the precise timing of the outputs [25].
An LSTM network computes from an input sequence x = x 1 , . . ., x T an output sequence y = y 1 , . . ., y T by calculating the network unit activations using the following equations, for t = 1, . . ., T [25]: We use W to denote the weight matrices (e.g., W ix is the matrix of weights from the input gate to the input), W ic , W f c , and W oc are diagonal weight matrices for peephole connections.Vector b denotes a bias vector and b i is the input gate bias vector.Function σ is the logistic Sigmoid function, i is the input gate, f is the forget gate, o is the output gate, and c denotes the cell activation vector.Functions i, f , o and c have the same size as the cell output activation vector m.Operator is the element-wise product of the vectors.Function g is the cell input function, while function h is a cell output activation function.Activation functions used in this paper are tanh and ∅, which denotes the network output activation function.

F. Prediction accuracy
To measure and evaluate the prediction accuracy of the proposed hybrid model, mean square error (MSE) and root mean square error (RMSE) methods are used.Let y i be the measured value of time i or the targeted minimum error on a neural network, f i is the predicted value at time i obtained from a particular model M , and n be the number of sample data.Then, MSE is given by [26]: and RMSE is given by: RMSE is used because it has a good performance to distribution error and could satisfy the triangle inequality requirement for a distance metric [27].MSE is an error function for evaluating the performance and efficiency of the forecasting methods.MSE can compare point-by-point for overall performance measure method of the actual time series values and the forecast value [28].If the values of RMSE and MSE are lower, the accuracy of the model is better.

III. RESULT AND ANALYSIS
The proposed model is applied to predict tourist arrivals in Indonesia.Data used in this research consists of the number of tourist visits to Indonesia from 1995 to 2014 in each month.Hence, there is 240 data records in total.Ninety percent of the data is for training, while 10 percent is for testing.The model is then compared to the original LSTM RNN and other types of RNN, i.e., Elman RNN and Jordan RNN, as well as the hybrid of wavelet-Elman and the hybrid of wavelet-Jordan.These models will be compared in terms of accuracy and the time required for data training and data testing.
The first process is data normalization which aims to simplify calculations, reduce the value range, and make the training process faster.The normalization, which utilizes Min-Max technique, produces data in the range of [0, . . ., 1].Data is then transformed using Haar wavelet function, and 3 levels of decomposition is applied.The wavelet transform is implemented in Matlab.The proposed model is compared to hybrid wavelet-Elman, hybrid wavelet-Jordan, the original Elman RNN, Jordan RNN, and LSTM RNN.Tabel II depicts the results of the comparison between the proposed model and other models.Fig. 8 depicts the comparison of the real data with the decomposition of signal produced using wavelet transform as well as the prediction of signal using LSTM neural network.In prediction where all data is used, the result is very similar to the original data.Due to decomposition, data is then reduced in each level of decomposition.Using data produced by DWT level 1, prediction result is still recognizable, and the accuracy is not too bad.However, using data of DWT levels 2 and 3, prediction results become more unrecognizable; hence, there is a big gap between real data and predicted data.This means that the deeper the decomposition process, the more predicted data will be unrecognized.The hybrid model at level 1 (A 1 ) generates an error value that is smaller than the hybrid model at level 2 (A 2 ) and 3 (A 3 ).From the experiments we have conducted, it can be inferred that the advantage of the use of the proposed model is mainly for shortening the time for data training rather than for increasing prediction accuracy.The lower accuracy compared to the original LSTM is due to the data reduction as the result of decomposition in wavelet transform.Hence, the proposed model can reduce the time of training process but not the error values.

IV. CONCLUSION AND FUTURE WORK
In this paper, a hybrid model of wavelet transform and LSTM neural network is proposed to predict the number of tourist arrivals in Indonesia.This model incorporates wavelet and LSTM neural network to predict the number of tourist arrivals each month.The wavelet algorithm is used to decompose time series data into the data of low frequency and high frequency, which is proven to reduce the time for data training.The LSTM neural network is employed for training and testing the results of wavelet transform.The predicted outcome of the proposed hybrid model is compared to the original LSTM, Elman and Jordan RNN, as well as the hybrid of wavelet-Elman and the hybrid of wavelet-Jordan.The evaluation shows that the hybrid model of wavelet and LSTM method gives better training time than the original LSTM, Elman, and Jordan RNNs.Furthermore, this method is able to predict the number of tourist arrival more accurately than other hybrid methods.
One of the issues which is interesting for future work is to employ clustering method, such as k-means, to form the hybrid of LSTM RNN and clustering for time series prediction.The purpose of this is to compare the training time and the accuracy, to know which hybrid model can give better results.

Fig. 6
Fig. 6 shows the architecture of the LSTM-RNN in the proposed model.The number of hidden layer in the model is 1, hence the context layer must be 1.The training process is performed iteratively until the output value approximates the original value.The number of epoch in the training process is 10 5 with the learning rate of 0.1.{x 1 , x 2 , x 3 } is an input data at input layer, {z 1 , z 2 , z 3 } is a hidden layer unit, {m 1 , m 2 , m 3 } is a context layer unit, and y is an output value (the result of the RNN architecture).

Fig. 7 Fig. 7 .
Fig. 7 shows three levels of wavelet decomposition of tourist arrival time-series data.Graph S shows the time series data after normalization by Min-Max technique.The first level of decomposition process using Haar function produces high and low frequencies, as shown in graph A 1 and D 1 .Data in each decomposition is divided into two, which each at level 3 produces 30 data records.The decomposition with level three A-series data (graph A 3 ) has the lowest frequency content and this tends to be incompatible with prediction.
The comparison between the models is based on the accuracy and the time required for data training and data testing.The result shows that the hybrid of wavelet transform and LSTM has better performance than other hybrid models in terms of accuracy.Nevertheless, the proposed model takes the most time for training process compared to other hybrid models.On the other hand, the hybrid model is able to reduce training time on the original LSTM model by 28 minutes.All hybrid models are able to reduce training time significantly; an average of twice faster than the training time required without the use of wavelet transform.Hybrid wavelet-Jordan produces the smallest training time among the models, but the error value is quite high compared to the original LSTM and the hybrid wavelet-LSTM model.

Fig. 8 .
Fig. 8. Data training and data testing using different levels of decomposition Table I depicts LSTM parameters for data training.The prediction is implemented using Theano library provided in Python.

TABLE I .
LSTM PARAMETERS FOR DATA TRAINING

Table
III compares the MSE of training and testing in each decomposition level to show the influence of data reduction, as

TABLE III .
COMPARISON OF ACCURACY WITH VARIOUS LEVELS OF DATA DECOMPOSITION WITH MSE