Investigation of denoising effects on forecasting models by statistical and nonlinear dynamic analysis

In this study, the denoising effect on the performance of prediction models is evaluated. The 13-year daily data (2002 – 2015) of hydrological time series for the sub-basin of Parishan Lake of the Helle Basin in Iran were used to predict time series. At ﬁ rst, based on observational precipitation and temperature data, the prediction was performed, using the ARIMA, ANN-MLP, RBF, QES, and GP prediction models (the ﬁ rst scenario). Next, time series were denoised using the wavelet transform method, and then the prediction was made based on the denoised time series (the second scenario). To investigate the performance of the models in the ﬁ rst and second scenarios, nonlinear dynamic and statistical analysis, as well as chaos theory, was used. Finally, the analysis results of the second scenario were compared with those of the ﬁ rst scenario. The comparison revealed that denoising had a positive impact on the performance of all the models. However, it had the least in ﬂ uence on the GP model. In the time series produced by all the models, the error rate, embedding dimension needed to describe the attractors in dynamical systems and entropy decreased, and the correlation and autocorrelation increased.

Time series are not always definite, and in hydrology, in particular, they are considered as stable random series. If a time series is considered as a definite time series with colored or white noise, then the time series can be predicted.
A better model for the time series can be determined when the noise in a time series is minimized very accurately (by a process called denoising). There are two common and convenient tools for denoising: Fourier transform (FT) and wavelet transform (WT). Due to the lower limitations and better performance of WT, in this study, this model was used for denoising. The WT function can be described as a time series decomposed by WT into several time series with different scales (Rioul & Vetterli ). Using WT, the original time series can be examined with varying accuracy.
WT can, therefore, be considered a Multiresolution Analysis (Alrumaih et al. ). In other words, signal decomposition causes the signal to change from the time domain to the time-scale domain, which can describe local features in the time and frequency domain very well and in detail (Guo et al. ).
In this study, prediction models including quadratic exponential smoothing (QES), autoregressive integrated moving average (ARIMA), artificial neural network-multi-layer perceptron (ANN-MLP), RBF based on neural network (RBF), and grid partitioning ANFIS (GP) were used to predict the time series of daily precipitation, daily maximum and minimum temperatures. The inherent characteristics of prediction models have had a significant impact on the selection of models for comparison in this study (e.g., the input information for the QES model for training is less than other models, and less time is needed to model (Duan & Niu ).
In this study, we investigate the effect of denoising on the performance of prediction models, using nonlinear statistical and dynamic analysis, as well as chaos theory. In statistical analysis, R 2 and RMSE were used to estimate the correlation and error of the predicted time series, respectively. In the nonlinear dynamic analysis and the use of chaos theory, the used tools were Approximate Entropy (ApEn) for evaluating the order and predictability of fluctuations in time series, Correlation Dimension for determining the degree of complexity of a nonlinear dynamical system with the help of determining the embedding dimension needed, the Autocorrelation for evaluating the periodicity and the irregularity in the time series, and finally, the Surrogate Data, Phase Space Reconstruction and Method of Delays were used to investigate the nature of the irregularity in the time series. The steps are as follows: In the first scenario, the time series of daily precipitation, daily maximum, and minimum temperatures are predicted. Then in the second scenario, the predicted time series are denoised with WT and again predicted using prediction models. In the next step, the predicted time series in the first and second scenarios are subjected to nonlinear, statistical, and chaotic dynamical analysis. Finally, by comparing the analysis results of the second scenario with those of the first scenario, it becomes clear whether prediction based on the denoised time series improves the performance of the prediction models.
In the following, in the Material and methods section, theoretically, analytical and predictive methods are discussed, and the results of the first scenario are presented; in the Results and discussion section, the results from denoising are compared with those of the first scenario.
Finally, in the Conclusions section, the final result of the research will be stated.

Area study data
The data utilized in this study relate to Parishan Lake.
Parishan wetland permanent freshwater lake is fed by springs and seasonal streams located in Iran, Fars province, 12 km southeast of Kazeroun between the mountains of Famur. The annual precipitation average of this basin is 450 mm and the annual temperature average is 22.2 C and it also has to be considered that the annual evaporation average is 2,470 mm. This region is located between In the present study, daily precipitation and daily maximum and minimum temperatures data have been applied to be forecasted by prediction models. All data relate to 2002-2015 and daily time series were forecasted for 13 years.
In this section about prediction methods, statistical and nonlinear dynamic analysis, and then WT are going to be discussed. The nonlinear dynamic analysis was calculated by MATLAB codes provided by Kugiumtzis & Tsimpiris ().
Artificial neural network multi-layer perceptron (ANN-MLP) Artificial neural networks (ANN) are intelligent systems that are able to discover patterns with a few prior assumptions and also learn a complex functional relationship from the data to model a phenomenon. ANN was inspired by the principles of data processing in the brain and the method based on experimental data utilized to train the ANN to precisely predict system performance. The most common algorithm in ANN is multi-layer perceptron (MLP). MLP is a network with feed-forward layers consisting of one input and output layer, and a number of hidden layers. In the first step, each node includes input and computes a weighted sum. In the next step, it passes the sum through a soft non- linearity. An ANN output can be written as follows (Paoli et al. ): where i is the index of neurons in the hidden layer, j is the index of an input to the neural network, f is the transfer function, x j is the input, w ij is the weight.
In this study, ANN-MLP was utilized to forecast minimum and maximum temperature time series which basic architecture for MLP is shown in Figure 2.
As can be noted, it is composed of a single output unit, k hidden, and n input units. T j represents the connecting weight from the jth hidden unit to the output unit. In the same way, w ij is the connection weight from the ith input unit to the jth hidden unit. The hidden layer consists of nodes connecting to both the input and output layer that can be considered as the most substantial part of a network.
The set is trained to the neural network to be constructed, and the test set is also used to measure the predictive error of the model. Ultimately, the training process is utilized to find the connection weights of the networks (Pao ). The measured value is an anomaly if the predicted error exceeds the pre-defined threshold. To select the most acceptable model of the neural network, numbers of modeling were assessed and the model in the following detail was preferred. The model consisted of 12 hidden layers, and the training ratio, validation ratio, and test ratio were 80, 10, and 10% respectively. The train parameter time and train parameter maximum fail were 2,000 and 1,000, respectively.
There is a similarity between the number of input and output nodes of MLP and RBF neural networks. The number of input and output nodes is determined by inherent features of input and output variables and RBF networks faster learning is one of the superiorities of RBF over MLP. The output of RBF NN is calculated as: (2) X and Y are the input and output values, respectively. φ( ) is the RBF and the hidden and output nodes are connected by weight (W ). Depending on the observed input data, the center of each hidden node can be represented by X r .
Finally, kX À X r k represents the Euclidean distance between input and hidden nodes. A group of input nodes is represented by each hidden node containing comparable information from the input data (Rezaeian-Zadeh et al.

Fuzzy inference system, GP of the antecedent variables
The artificial neural-fuzzy inference system (ANFIS) structure has two educable datasets: initial membership function parameters and polynomial parameters (result).
The ANFIS structure uses a gradient descent algorithm to optimize the initial parameters and a least-squares algorithm to solve the result parameters. Each rule in the ANFIS structure is as follows: If X 1 is A 1,j and X 2 is A 2,j and ...and X n is A n,j then y In Equation (3), A i,j is the jth linguistic expression of the ith input variable of X i . n is the number of inputs and y is the output of the model and C i , are the output parameters, which are specified in the training process. Since every rule has a definite output (versus Fuzzy output), the total output is obtained by the weighted average. In ANFIS, sequential layers are assigned to various tasks such as creating a gradual model refinement process. The learning process consists of a forward pass and a backward pass process.
During the forward pass process, the initial parameters are fixed, and the result parameters are optimized using the least-squares algorithm. The backward pass process uses the gradient descent algorithm to modify the initial parameters of the membership functions for the input variables.
The output is calculated as a weighted average of the result parameters. Each output error uses the back-propagation algorithm to adjust the foreground parameters (Guan et al. ). Figure 3 shows the structure of the adaptive neurofuzzy inference system. In this system, circular nodes represent fixed nodes, and square nodes represent nodes in which parameters can learn.
The ANFIS system includes five layers (Figure 4) that the first step of the process is conducted on the input by the first layer. Each node in this layer is: O 1,i (x) is necessarily a membership degree for x and y.
Layer 5 is representing a sum of the outputs of the previous stage (layer 4). (Equation (5)) In Equation (5), It has to be mentioned that the MATLAB codes to forecast time series utilizing ANFIS are provided in the link https://yarpiz.com/327/ypfz102-time-series-prediction-usinganfis.

Autoregressive integrated moving average
In statistics and in particular in time series analysis, an autoregressive integrated moving average (ARIMA) model is a generalization of an autoregressive moving average (ARMA) model. Both of these models are fitted to time series data either to better interpret the data or to predict where B is the backward shift operator, d is the non-seasonal order of differences, D is a seasonal order of differences, γ, μ, ρ, σ are polynomials in B and B s .
It has to be mentioned that there is a basic hypothesis that time series data incorporate statistical stationarity implying that measured statistical properties, such as the mean, variance, and autocorrelation remain constant over time (Yuan et al. ). Although if the training data display non-stationarity, the differenced data are required for the ARIMA model to be transformed to stationarity. This is denoted as ARIMA (p, d, q) where d obtains the degree of differencing (Rosca ).
Previous to the analysis, since both the mean and variance of the daily time series were considered to be nonstationary, a first-ordered difference was applied to obtain stationary set data.

QES model
The exponential smoothing model is a kind of weighted average model. The exponential smoothing model has a sensitivity to historical data closing to the forecasting period and also the weight of the model is reduced by time To utilize the QES model for forecasting, three sets of input training data and a parameter value are required and the quadratic exponential smoothing model can be defined as follows: Q t represents the smoothing values at time t. a is the smoothing coefficient (or damping coefficient (Spelta et al.

Correlation dimension
Concerning the review of a deterministic or random process, the correlation dimension is presumed to be a practical tool.
Although a random (a complete stochastic) process might utilize all available dimensions of the phase space, a deterministic process might consume a fraction of the dimension (it is usually much smaller than the degree of freedom of the system). The dimension is used as a measure of the complexity of a system and also as an indicator of the variables required to describe a specific system. Correlation The correlation dimension d is estimated as follows by taking a logarithm to C(r). Correlation dimension is shown as follows: The correlation integral C(r) is used by correlation dimension d to measure how many times a trajectory in the phase space occurs in a distance r of a specific point.
For a bounded data set, the correlation dimension is known as the correlation exponent. If the number of points is sufficiently large, and evenly distributed, a log-log graph of the correlation integral versus r will yield an estimate of d, corresponding to a given range of embedding dimension m (in this study m ¼ 1,2,3, …, 20) and radius r (e.g., r ¼ 0.01,0.1,1, …, 10,000). These embedding dimensions have to be employed to test the minimum dimension required to explain a nonlinear dynamic system. A lowdimensionality and parsimonious system is desirable and larger m values are necessarily unevaluated. The assortment of radius must be sufficient to examine entire cases of differences between u(i) and u(j) state vectors. A highly desirable correlation dimension in a low-dimensionality chaotic system has two properties. The first characteristic is that the correlation dimension must be consistent for various values of the radius. The second feature is that the correlation dimension must be bounded by all examined embedding dimensions. Complexity in a system represents randomness and bigger values of m demonstrate complexity; therefore, the system with lower m is preferable. A system with minimal values of m is deterministic and chaos concomitant of certainty. Accordingly, in this study, lower m is investigated.

Approximate entropy
To quantify the amount of regularity and the unpredictability of fluctuations in time series, approximate entropy (ApEn) can be utilized as an effective and practical technique.
Moment statistics, such as mean and variance, are not able to distinguish between two series which a series (I) alternates 10 and 20 and a series (II) has either a value of 10 or 20, chosen randomly, each with probability 1 2 . Series (I) is perfectly regular and can be predicted with certainty, and series  containing N raw data values from measurement equally spaced in time, and a sequence of vectors is the distance between vectors x(i) and x(j) which is the maximum difference in the vectors' scalar components. The sequence x(1), x(2), Á Á Á , x(N À m þ 1) is utilized for each i N À m þ 1, to C m i (r) be constructed. C m i (r) measures the regularity in consideration of r tolerance, or frequency of patterns identical to a given pattern of window length m. ApEn can be defined as Equation (9) (Pincus & Goldberger ): where Φ m (r) consists of N, m, and C m i (r) elements (more detailed information is furnished in supplementary file).
From Table 2

Autocorrelation
Autocorrelation, also called serial correlation, is the correlation of a signal with a delayed copy of itself functioning as a delay. In other words, it is the similarity between observations as a function of the time lag between them. An analytical apparatus is provided by autocorrelation function for distinguishing between periodic and stochastic behavior of a particular signal. The autocorrelation is periodic when a signal is periodic, whereas in a stochastic signal, the autocorrelation is irregular. The autocorrelation function is shown as follows: where to extend the times series beyond u N , periodic boundary conditions have to be imposed (boundary conditions ! u Nþk ¼ u k ). A simple quantitative measure of the linear correlation between data points is provided by the former function.

Surrogate data
In 1992, the method of surrogate data in nonlinear time To create a surrogate data set, the following steps should be followed. 1the FT of the experimental time series data is constructed; 2phases are randomized; 3 -FT is inverted. In other words, the experimental time series data x(t j ), j ¼ 1, 2, Á Á Á , N is inputted into a complex array (Mazaraki ). Then, the discrete FT is constructed and a set of random phases is constructed , and the randomized phases are applied to the Fourier transformed data. Finally, the inverse FT can be produced by Equation (11): where Z(m) is the discrete FT (further details are given in supplementary file).
The daily precipitation phase space graph shows that the phase space graphs of all the models except the QES and GP  The results indicate poor performance of QES in the case of daily temperatures forecast. Besides, the ANN-MLP and RBF models have the most proper performance, respectively.

Statistical evaluation criteria
To evaluate the performance of prediction models, the coefficient of determination (R 2 ) and root mean square error (RMSE) were used. Equations (12) and (13) designate how these indices are calculated.
where x i is the observational data and y i is the predicted data corresponding to the observational data.
x and y are the average observational and predicted data, respectively, and n is the number of data. The low RMSE value and the high R 2 coefficient indicate the higher precision of the model and its superiority to other models.
According to Table 3,

Denoising, WT
Wavelet coefficients correspond to details and additionally when details are small, they could be considered as noise, A bedrock model for the noisy series can be specified as the following form: where e t is the multivariate normal distribution process and the mean of zero and also covariance matrix Γ t .
The object of the denoising is to suppress the noise part of the series S t and in the next step recover f t . The time series s(t) could be decomposed as where K is the decomposition level, Φ K,k (t) is the associated scaling function, and Ψ j,k (t) is generated from Ψ(t). The coefficient cA K: k is called the approximating coefficient and the coefficient cD j: k is called the detail coefficient (more details are provided in supplementary file). There are two standards of thresholding methods for denoising the series: 1soft thresholding; 2hard thresholding.
The mechanism of hard thresholding is that the wavelet coefficient is assumed zero if its amplitude is smaller than a pre-defined threshold; otherwise, it is kept unchanged.
Additionally, if the absolute value of the wavelet coefficient is decreased to the threshold, we have soft thresholding.
Notice that owing to the discontinuation of the hard-thresholding method at the threshold, additional oscillation might be imposed on the original series, even though the continuous soft-thresholding method can offer a smooth series.
Consequently, the soft-thresholding method is utilized to reduce noise and outliers.
There are several approaches to choose the threshold.
Three of the approaches are as follows.
Minimax thresholding. The minimax principle was found by the threshold. To allow minimax performance concerning mean square error against a proper procedure, a fixed threshold has to be accepted (Donoho & Johnstone ).
SURE thresholding. The threshold could be selected based on Stein's Unbiased Risk Estimate (SURE).
Universal thresholding. The threshold could be calculated by the following formula: where σ j is an estimate of the variation of wavelet coefficients of level j, and n is the number of wavelet coefficients at that level.
Subsequently, a denoised version of the original signal is reconstructed from the approximation coefficient cA 0 K:k , and detail coefficients cD 0 j:k , utilizing the inverse WT.

RESULTS AND DISCUSSION
A random signal produced by noise is completely separate from a random signal produced by a deterministic dynamical system. The signal generated on the basis of a deterministic dynamical system is explained by a scanty number of variables and is also of moderate complexity. A signal may appear arbitrary, but it can be justified through dynamic systems with a few degrees of freedom. Statistical analysis is appropriate to determine the median, mean, variance, standard deviation, and linear properties of a time series without expressing nonlinear time series properties.
Statistical analysis cannot be employed to investigate the discrepancy between the random signal with the noise and the random signal produced by a deterministic dynamic system, as it does not reveal the difference and requires dynamic analysis, which is based on the phase space reconstruction. In the first scenario, the prediction is made based on observational data, and the following steps are executed for dynamic analysis. We first discuss the complexity of the system and the embedding dimension needed to describe the  Table 4.
Applying denoising revealed signs of an enhancement in the performance of the models; however, it had a slight impression on GP outcomes.
To sum it up, the results of the chaotic analysis revealed that applying the denoising method improved the performance of all models and in a few cases, that of GP; however, it was verified that some models were less affected than others. Nevertheless, in almost all cases, it had a neutral effect on GP performance. ANN-MLP and RBF accepted more positive influences than others and yielded the most accurate performance and outcomes.

ApEn
Comparison between Tables 2 and 5 suggests that after running the second scenario, the ApEn values of the predicted Daily Min c 6 6 6 6 5 5 a Precipitation; b maximum temperature; c minimum temperature.

Phase space reconstruction and surrogate data
Following the implementation of the second scenario, all models (in case of daily temperatures) produced promising results, and the outcomes became remarkably similar to those of the observational data. Furthermore, the performance of QES strongly developed, thereby suggesting a higher degree of dynamic property than the first scenario.
The second scenario graphs are presented in the appendix.
Generally speaking, denoising has a positive influence on all the models. In all cases, the performance of the RBF and ANN-MLP models was appropriate.
To sum it up, the results of the chaotic analysis revealed that applying the denoising method improved the performance of all models and in a few cases, that of GP; however, it was verified that some models were less affected than others. Nevertheless, in many cases, it had a neutral effect on GP performance. ANN-MLP and RBF accepted more positive influences than others and yielded the most accurate performance and outcomes.

DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.