Data augmentation in economic time series: Behavior and improvements in predictions

: The performance of neural networks and statistical models in time series prediction is conditioned by the amount of data available. The lack of observations is one of the main factors influencing the representativeness of the underlying patterns and trends. Using data augmentation techniques based on classical statistical techniques and neural networks, it is possible to generate additional observations and improve the accuracy of the predictions. The particular characteristics of economic time series make it necessary that data augmentation techniques do not significantly influence these characteristics, this fact would alter the quality of the details in the study. This paper analyzes the performance obtained by two data augmentation techniques applied to a time series and finally processed by an ARIMA model and a neural network model to make predictions. The results show a significant improvement in the predictions by the time series augmented by traditional interpolation techniques, obtaining a better fit and correlation with the original series.


Introduction
The performance of neural networks for time series prediction is conditioned by the amount of data available to train the network. The amount of data available plays a crucial role in the accuracy and reliability of the predictions made by statistical models. This aspect acquires significant importance in the academic and business fields, where the ability to correctly predict future patterns and behaviors from a time series is essential for strategic decision making.
An insufficient number of observations can lead to unrepresentativeness of the underlying patterns and trends in the data. Time series are typically made up of data that evolves over time and a limited number of observations may not adequately capture the inherent variability and complexity of this data. These aspects are fundamental for the construction of robust and accurate prediction models, since they provide crucial information about the behavior of the data. The literature has highlighted the importance of having an adequate number of observations to obtain reliable results in the prediction. For example, Box et al. [1] emphasized the need for at least 30 observations to adequately model a time series and make accurate forecasts. Similarly, Shumway and Stoffer [2] suggested that a minimum of 50 to 100 observations are necessary to build reliable models and avoid wrong conclusions. The development of data augmentation techniques allows solving this problem, providing the network with enough training data and consequently more accurate predictions. These data augmentation methods employ different techniques to augment time series observations, using estimates and predictions that generate data similar to the original. Economic time series are characterized by having a trend, autocorrelation, seasonality and being mostly non-stationary, which highlights the importance of achieving adequate data augmentation techniques that allow generating new observations without affecting the characteristics of the series.
In their study Iwana and Uchida [3] conducted a review of the literature regarding the use of data augmentation algorithms for time series classification. In their research, they reviewed 12 methods to achieve data improvement and 128 sets of classification and evaluation with 6 types of neural networks. However, the most recent literature focuses on the use of other types of techniques such as the use of GAN networks. Therefore, this study lacks a comparison with traditional data augmentation approaches.
The work carried out by Iglesias et al. [4] tries to compile various data augmentation techniques oriented to time series, allowing an exhaustive comparison of the methodologies in the domain of time series.
Research by Liu et al. [5] propose various methodologies aimed at increasing time series data such as AddNoise, permutation, scaling and warping, verifying these methods through two deep learning models with real-time time series, showing an improvement in classification tasks. From these investigations arises the need to make a comparison between the different data augmentation techniques and provide information on the performance of the resulting time series for prediction tasks.
The main objective of this research is to identify the most suitable model to increase the number of observations of economic time series, which allows to generate a consistent number of observations, allowing to improve the performance of classical models and artificial neural networks (ANN) to make predictions in financial time series. After carrying out multiple experiments, it is observed how interpolation techniques achieve a substantial improvement in the error metrics of different prediction models, allowing more reliable forecasts to be generated when sufficient data is lacking.

Time series augmentation
In recent decades, data augmentation algorithms have become highly relevant in changing machine learning and artificial intelligence. These algorithms allow the generalization and improvement of the performance of the trained models. They are used to generate new synthetic data from existing ones through the application of simple transformations or the generation of synthetic data through complex generative models.
Various authors deal with the relevance of the number of observations necessary to obtain good predictions, carrying out experiments in which the performance of the models is measured based on this number and how the error metrics vary ( [6][7][8]) finding significant differences between the observations they consider necessary, varying between 100 and 300.
The use of data augmentation algorithms dates back to the 1990s, when researchers began using these techniques with the goal of increasing and improving the quality and quantity of data available to machine learning models. One of the first studies in the area was carried out by Lecun et al. [9], in which they presented a method that allowed the generation of images from existing ones, allowing a significant improvement in the performance of image classification models.
Over the next decade, more sophisticated data augmentation methods, such as slice [10] and rotation [11], were developed for augmenting image data sets using the application of simple transformations.
Data augmentation algorithms are also used in the generation of texts. Data augmentation techniques based on the substitution of synonyms and the random elimination of words have been developed in [12].
In this study, Wong and Leung [13] carried out an exhaustive review of the existing data augmentation techniques for neural networks. In the research they describe different data augmentation techniques such as rotation, translation and scaling, among others. Subsequently, the research carried out in [14] explored the deformation-based data augmentation technique by applying nonlinear deformations in small windows of the time series to generate new series with different patterns of variation.
This data augmentation technique is introduced in the field of time series, allowing to increase the size of these and improve the performance of predictions by neural networks. Authors like Yoon et al. [15] present a data augmentation technique based on generative adversarial networks (GAN) for use in time series. The TimeGAN model is trained with the objective of generating new time series from an existing data set and it is used to generate new time series that allow preserving the statistical characteristics of the original data set.
Different data augmentation techniques for time series are investigated by researchers in [16,17], where they propose different techniques that allow increasing the number of observations of the time series through generative models, allowing to improve the accuracy of the predictions through neural network models.

Data augmentation for time series
Throughout the last few years, different algorithms have been developed that allowed the increase in the time series data available to perform the training of the neural networks. This increase allows minimizing the risk of overfitting by the network in case of lack of sufficient data and improve the accuracy of the models. There are several techniques to achieve the increase in available observations in a time series.
Most recent research is focused on augmenting data from images, video or natural language processing (NLP) ( [18][19][20]). These techniques are focused on correcting the imbalance or lack of data in the information sets, but there are other application areas where these problems are more common, such as time series and especially economic ones.
The growing interest in this type of technique has allowed innovation in methodologies. Data imputation techniques make it possible to increase the size of time series using various techniques such as interpolation, extrapolation, smoothing, the use of regression techniques and the use of machine learning models to predict missing values. These include extrapolation, resampling, transformations, interpolation, imputation and simulation. These techniques are reviewed in research such as those carried out in [21], where they review data augmentation techniques, including imputation techniques. García-Molina et al. [22] carried out a study in which, using imputation techniques, they completed the missing values in a time series, managing to increase the size of the data set.
Interpolation is a technique by which new data is generated from existing data. This technique consists of generating new data points within the range of existing values by estimating values from known data. Guennec et al. [23] proposed the use of this to increase the observations of a time series for the training of a convolutional neural network for the classification of time series. The interpolation technique is commonly used in time series modeling to achieve an increase in the available observations, some of the best known methods are spline and Fourier.
Through the extrapolation technique it is possible to achieve the generation of new data, in this case outside the range of existing values, the Holt-Winters and Box-Jenkins techniques are the best known. In their research, Salinas et al. [24] compared this model with other classic and deep learning models. Gashler and Ashmore [25] explored the power of these models for data augmentation.
One of the simplest forms of extrapolation is linear extrapolation, which uses a straight line to extend the trend of the observed data. This model can be represented as: where y represents the value to be predicted, x represents the independent variable corresponding to the value to be predicted, m is the slope of the straight line and b is the y-intercept. Monte Carlo simulation will perform data augmentation from a probability distribution. It involves the generation of a large number of random values from a given probability distribution and will be used to generate new data. Bootstrap and Marcoval are the most used techniques in Monte Carlo simulation. Studies like presented in [26,27] based their research on the use of this technique to achieve an increase in the number of observations of a time series using machine learning techniques.
The Monte Carlo simulation is based on the fact that given a data set of size n: X = {x1, x2, ..., xn} the probability distribution of the data p(X) will be estimated, a set of m random numbers from the estimated probability distribution q(X) and adjusting the generated values to match the range and scale of the original data.
Then, the likelihood of getting the sample from the distribution is given by the Eq (2): Let θ be the parameter vector for f, which can be either a probability mass function (PMF) for discrete distributions or a probability density function (PDF) for continuous distributions. We will denote the pdf/pmf as fθ. Let the sample drawn from the distribution be x1,x2,...,xn.
Another of the most used techniques to achieve an increase in the number of observations in a time series is the increase in synthetic data, which through random transformations to existing data will achieve the generation of new data. These transformations can include rotations, translations, scaling and deformations, often using the wavelet transform. Investigations such as those carried out in [28] and [29] show the usefulness of this type of model, combined with the use of machine learning techniques to generate new data.
The generation of synthetic data through deep learning (DL) techniques has proven to be highly effective, being able to distinguish between different techniques used for this purpose: Variational autoencoder (VAE): An autoencoder neural network will learn to encode and decode data. While the encoder maps the input data to a distribution of latent variables, the decoder will map the latent variables to the desired output in order to maximize the probability of the original data given the parameters of the latent distribution and minimize the difference between the two. This function will allow you to generate new data similar to the originals, including slight modifications.
This technique was developed in [30]. They presented an architecture that combined the use of an autoencoder with a Bayesian inference probabilistic model. The research by Deng [31] presented a variant that allowed learning multimodal latent representations of the data.
An encoder is capable of compressing an input x into a latent space z, while a decoder will act from that latent space z to obtain the reconstruction, it can be defined as: In the formula above, x is training data and z is the hidden feature that cannot be observed in x data. Generative adversarial networks (GAN): These neural networks proposed by Goodfellow et al. [32] are trained to generate new data from a random noise distribution. Through two networks, a generator and a discriminated one, the first will generate the synthetic data while the second will classify whether the data is real or synthetic ( [15]). The goal is for the generated data to be realistic enough that the discriminator is unable to distinguish it from the actual data. Research by Isola et al. [33] proposed a variant called conditional GAN, which allows controlling the output of the generator through a conditional input. The formula for the entire GAN is as follows: where E is the expected value of the distribution function, ( ) is the distribution of the real sample and p_((z)) is the distribution that generates the sample [34].
The generation of synthetic data through recurrent neural networks (RNN) and long short-term memory (LSTM) developed by Hochreiter and Schmidhuber [35], will allow the generation of new data by predicting the next element in the sequence from the previous elements.
WaveNet consists of a convolutional neural network (CNN) will work similar to RNN and LSTM but through an autoregressive convolutional network. In the research by Oord et al. [36] used this technique to generate high-quality signals in an autoregressive way.
Other generative modeling techniques for time series are generative flow, which was proposed in [37] in 2014, and transformers, which have been used for text generation and adapted for time series in [38] in 2019.

Data augmentation for time series
Mathematical methods for time series forecasting have constantly evolved since the beginning of research in this field. Poynting [39] tried in his study to eliminate the trend and cyclical fluctuations by averaging over a given time interval. Later, other researchers wrote in [40,41] about the elimination of trends by including high-order polynomials.
In recent years, the development of time series forecasting techniques has been developed and applied in a wide variety of fields, from finance and economics to engineering and physics. In its beginnings, prediction techniques were based on classical statistical models such as exponential smoothing models [42] or ARIMA models developed in [1], who through adjustment and smoothing techniques make the predictions of the series based on their historical values.
In the 1980s, artificial neural network (ANN) models were included among the prediction techniques, which have the ability to model non-linearly complex relationships between the input and output variables, which makes them useful for this type of prediction [43].
The amount of data currently available makes neural networks a dominant technique for which there is an extensive literature. Zhang et al. [44] carry out an extensive review of the investigations in which neural network techniques have been used for the prediction of time series, concluding with the superiority of neural networks over classical techniques and this is due to the properties that characterize the networks. artificial neurons.
One of these features, described in [45] is the ability to generalize data and make predictions from it. One of the first researchers to apply ANN to time series forecasting was presented in [46], in his work he predicted gold prices using backpropagation neural networks. Subsequently, different types of ANN have been developed, among which are recurrent neural networks (RNN), especially long short term memory (LSTM) networks developed by Hochreiter and Schmidhuber [35], which allow modeling long-term temporal dependencies.
In the same line of research, in [47] was studied the behavior of artificial neural networks for the prediction of time series comparing with six statistical methods, resulting in a better result of artificial neural networks and achieving more accurate predictions.
This evolution has an impact on an improvement in the predictions, Siami-Namini and Namin [48] compared the performance of an ARIMA model with that of an LSTM model, finding lower error metrics on the part of the ANN model for the prediction of industrial production in Taiwan.
In recent decades, research has focused on hybrid models that allow combining the advantages of traditional statistical models with ANN, achieving greater precision in the results.
The research carried out by Ravi et al. [49] focused on a hybrid model that combined the ARIMA technique with the use of neural networks for the prediction of inflation in the United States (USA), demonstrating superior performance by of the hybrid model with respect to the ARIMA and ANN models individually.
In this line, various authors have developed their research around the combination of different types of neural networks with the aim of improving these predictions ( [50][51][52][53][54]).

Data description
In this section the data used in the experiment will be described. To observe the behavior of the economic time series in the face of an increase in data carried out with different techniques, experiments are carried out starting from three different data sets. These time series vary in the number of source observations as well as in the trend and characteristics of the data, allowing a global vision of how the increase in data influences this type of data.
In first place, the time series corresponding to the Morgan Stanley capital international (MSCI) index obtained from Thomson Eikon Reuters was obtained with a daily timeframe with data from November 15, 2007 to August 12, 2022. The MSCI index is used as a reference to evaluate the performance of equity markets worldwide and is made up of a series of indices that cover different geographical regions and sectors of the economy. The index is considered a key indicator of stock market performance and is closely watched by investors and financial analysts.
The second data set used is the China containerized freight index (CCFI). It is a measure used in the field of container shipping to assess changes in ocean freight prices on trade routes that involve China. Specifically, the CCFI focuses on freight prices for export containers from Chinese ports to destinations around the world.
The CCFI index has become a highly relevant tool for the shipping industry and for stakeholders involved in international trade, as it provides key information on trends and fluctuations in ocean freight costs. Allowing companies and industry analysts to make informed decisions regarding logistics planning, cost management and competitiveness assessment.
This time series consists of 2855 observations between August 2, 2011 and August 11, 2022 with a daily temporality.
Lastly, the data augmentation models have been tested with the time series corresponding to the gold price index, through the Gold Spot Price index, it is a reference used to determine the current value of gold in the spot market. Refers to the price at which gold can be bought or sold immediately, with immediate delivery and cash settlement. This index is widely followed and used as a key indicator of the price of gold in real time.
The spot price of gold is influenced by various factors, including supply and demand, global economic conditions, monetary policy, geopolitical stability and financial market movements. The gold supply comes from mining production, as well as the sale of recycled gold and from central banks.
For this time series, the available data is from May 13, 2015 to June 15, 2022, with 1850 observations.

Dataset augmentation
To carry out this experiment, two different data augmentation techniques are used with the aim of verifying the performance in the prediction with neural networks of both techniques with an economical time series.

Interpolation
First, the interpolation technique is chosen. The interpolation technique is based on using mathematical methods to estimate unknown or missing values in a time series, based on the values observed at earlier and later moments. It can be used to fill in gaps in a time series, that is, to add missing observations, thus increasing the amount of data available for analysis and prediction.
The main objective of this technique is to seek to find a function that allows to approximate the missing values in a coherent way with the observed values. The interpolation method can be defined as: where k represents the number of observed values on each side of i and di is the distance between i and the closest observed value. This technique calculates the missing value as a linear combination of the closest observed values, weighted by a quadratic function that favors values that are closer together and assigns less importance to values that are further away. Figure 1

Interpolation
The Tsaug library for time series data augmentation, implemented in Python, provides a wide variety of transformations applicable to the data in order to generate new observations, some of the transformations include time shift, amplitude shift, noise reduction, smoothing and frequency shift.
This implementation is based on the creation of transformation objects, applied to the input time series, applying a specific operation to a time series.
Tsaug also provides composition functions to combine transformations sequentially or randomly. This allows you to create complex sequences of transformations.
Some studies have shown that data augmentation using TSAUG can significantly improve the performance of neural network models in time series forecasting. For example, Wang et al. [55] used TSAUG to augment the electric power time series dataset and achieved a significant improvement in the performance of neural network models. Several authors have demonstrated its usefulness in augmenting time series for classification tasks ( [23,56,57]). TSAUG has also been shown to be effective in predicting financial time series [58] and predicting medical time series [59]. Figure 2 shows how the TSAUG technique adds the observations to the end of the time series, which allows working with the same observations for training, while in the test the time series will use different values when incorporating the new observations in the final part of the series. This technique will perform shock replay over the newly incorporated data.

Evaluation metrics
For the evaluation of the performance of the data augmentation presented in this work, the four most used error metrics have been chosen: Root mean squared error (RMSE): Mean squared error (MSE): R-squared (R2): RMSE is a metric used to measure prediction error, or neural network performance. The lower this value, the better the prediction.
The value of R2 is closely related to his MSE and is also used to assess model performance. The output independent variable predicts the amount of variation in the output dependent attribute. The closer this value is to 1, the better the network performance.
The use of RMSE and MSE is justified due to their ability to measure the average difference between the predictions and the actual values. These metrics quantify the mean square error and the square root of the square error, respectively. Both metrics penalize the largest errors and provide a measure of the model's accuracy in terms of the spread of the errors [60].
These metrics have been widely adopted in the scientific literature and have been used in various research fields, such as stock price forecasting and electricity demand forecasting [61].

Data augmentation results
The result of the increase in the data of the different time series is reflected in Table 1. For each time series, two tests are carried out. In the first, the data set is increased by 1000 observations for the CCFI and MSCI time series (26% and 35% respectively) and 500 for the GOLD series, 27%. In the second test, there is an increase of 2,000 observations for MSCI and CCFI and 1,000, being 52% and 70% for each series. For the GOLD series, there is an increase of 1,000, which represents 54% of the total.

Forecasting model results
The increase in data is carried out using the interpolation and TSAUG techniques. Later, the results are analyzed after processing all the time series with the ARIMA models and a multilayer perceptron network. Then, it will be possible to evaluate how the increase in observations of a time series makes it possible to improve the predictions produced by the models and the techniques that best model this increase.
Tables 2 and 3 reflect the results obtained by the ARIMA model after the different data increases carried out, reflecting an improvement in the metrics with the interpolation technique. It is observable how the increase in the ARIMA model is capable of better modeling the time series increased with the interpolation technique and that with an increase in the amount of data up to figures greater than 50%, an improvement in precision of 44% is achieved. In the RMSE metric for the MSCI time series, 65% for the GOLD series and 63% for the CCFI series. When increasing the data with the TSAUG library the metrics worsen substantially, especially in the first round of increasing observations. Tables 4 and 5 show the results obtained with a multilayer perceptron model after increasing the observations in all time series.  As with the ARIMA model, the MLP artificial neural network is able to better model time series augmented with the interpolation technique. In this case the improvements in the second round of increases for the RMSE metric are 73%, 91% and 76% respectively. Figure 3 shows the results observed in Tables 3 and 5, a decreasing trend as a function of the increase in the number of observations. When this increase occurs, the RMSE metric decreases significantly, providing more accurate predictions and allowing model improvement.
On the other hand, Figure 4 shows how a greater number of observations does not directly imply a lower error, resulting in a less efficient technique for economic time series.

Conclusions and future lines of research
This research aims to analyze the impact of data augmentation techniques on economic time series and how it affects prediction algorithms, filling the gap in the existing literature that compares different data augmentation algorithms on economic time series.
The process of increasing data in time series is subject to the need to maintain the initial characteristics of the series so as not to interfere with the performance of predictions and statistical studies, achieving stability between the original time series and the increased time series. This study compares the behavior of different data augmentation algorithms applied to economic time series and the results of making predictions with classical statistical techniques and traditional techniques.
Two data augmentation techniques are used to increase the observations of three economic time series with the purpose of observing if this increase has an impact on a better fit of the prediction models.
Subsequently, they are analyzed by an ARIMA model and a multilayer perceptron (MLP) ANN model, to observe how the use of augmented economic time series affects the performance of the models.
The results obtained reflect how the imputation models allow an increase in observations, subsequently improving the metrics returned by both the ARIMA and MLP models The observations generated by the TSAUG library result in worse modeling, causing the metrics to show a higher error and not improving the training and testing process in either of the two cases.
These results coincide with the existing literature, highlighting the importance of increasing the amount of data available, presenting various data augmentation strategies used to improve the accuracy of forecast models [62].
Similarly, the conclusions coincide with Asem et al. [63], who through several experiments using the interpolation technique evaluated the effectiveness of these techniques in the generation of new data points to improve the accuracy of time series forecast models. The use of data augmentation techniques implies an improvement in the performance of the prediction techniques, improving the error metrics and therefore the predictions; this improvement is especially significant by using the interpolation technique. Based on the research carried out in this work, future lines are proposed in which combined approaches to increase data are sought, allowing an improvement in the performance of prediction models.

Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.