Time Series (re)sampling using Generative Adversarial Networks

We propose a novel bootstrap procedure for dependent data based on Generative Adversarial networks (GANs). We show that the dynamics of common stationary time series processes can be learned by GANs and demonstrate that GANs trained on a single sample path can be used to generate additional samples from the process. We find that temporal convolutional neural networks provide a suitable design for the generator and discriminator, and that convincing samples can be generated on the basis of a vector of iid normal noise. We demonstrate the finite sample properties of GAN sampling and the suggested bootstrap using simulations where we compare the performance to circular block bootstrapping in the case of resampling an AR(1) time series processes. We find that resampling using the GAN can outperform circular block bootstrapping in terms of empirical coverage.


INTRODUCTION
Generative Adversarial Nets (GANs) were introduced by Goodfellow, Pouget-Abadie, et al. (2014).Based on an initial training sample, GANs can learn to generate additional data that looks similar.GANs are intensely researched in the deep learning literature (Radford et al., 2015;Salimans et al., 2016;Gulrajani et al., 2017;Arjovsky et al., 2017) but have received minor attention in time series analysis.There are examples of GANs being explored for structural models (Kaji et al., 2018) and estimation of treatment effects (Athey et al., 2019).
However, these are both in the cross-sectional iid setting.Hyland et al. (2017) suggest that GANs can be used for what they call "medical" time series but they lack a clear definition of the data generating process (DGP) and correspondingly measures of quality for the learned model.Recently, Wiese et al. (2020) described a GAN for financial time series that can reproduce some of the stylised facts of such series using a temporal convolution architecture related to the one suggested in this work.Smith and Smith (2020) also outlined a method for training GANs on time series using spectrograms.Unlike Wiese et al. (2020) and Smith and Smith (2020) our focus is on the general applicability of GANs as a bootstrap method for dependent processes.The potential usefulness of GANs for such time series bootstraps is briefly mentioned in recent work by Haas and Richter (2020).
GANs have frequently been applied for image synthesis (Goodfellow, Pouget-Abadie, et al., 2014;Radford et al., 2015;Arjovsky et al., 2017) and the generated samples are often evaluated using measures such as Inception Score (Salimans et al., 2016) or Frèchet Inception Distance (Heusel et al., 2017).Both measures utilise a neural network trained for image recognition to attempt to assess the visual quality of the generated samples.However, these are heuristics and it is difficult to construct a theoretically motivated notion of quality.
Contrary to Hyland et al. (2017), we argue that it is straightforward to assess the basic properties of the generated samples in a time series context as the theoretical properties of many time series are well understood -contingent on considering explicitly defined data generating processes.A similar point is made by Wiese et al. (2020).
We show that stationary autoregressive time series processes -exemplified by the AR(1) process -can be learned by GANs trained on a single sample path and find that temporal convolutional neural networks provide a suitable design for the generator and discriminator.
Bootstrapping dependent data -such as samples from a time series process -has received long-standing attention in the literature, see e.g. the overview of block bootstrapping by Lahiri (1999) and the newer contributions of Paparoditis and Politis (2001) and Shao (2010).
We suggest that GANs provide a novel approach to such bootstrapping which we call the generative bootstrap (GB).The theoretical properties of GANs -and hence our suggested generative bootstrap -is an active area of research, see (Biau, Cadre, et al., 2018;Biau, Sangnier, et al., 2020;Haas and Richter, 2020) and we instead contribute by analysing the finite sample properties of generative bootstrapping using simulations where we compare the performance against Circular Block Bootstrapping (CBB) (Politis and Romano, 1992) for the AR(1) process.In particular, we recover the parameters of the true data generating process using the simulated data and thereby show that the generative model has learned at least a minimum of characteristics of the process.Further, we find that resampling using the generative model can outperform CBB for dependent data on the empirical coverage of percentile confidence intervals.
Our main contributions are: (1) we show that GANs can learn the dynamics of stationary autoregressive time series processes, (2) we find that temporal convolutions provide a working architecture for the discriminator and generator networks, and (3) we show that GANs can be used to resample from dependent data with suggestive finite-sample improvements in empirical coverage over CBB. ( 1) and ( 2) are also discussed by Wiese et al. (2020).However, Wiese et al. (2020) do not consider the more general applicability of GANs to time series bootstrapping which we define and subsequently evaluate in our simulations.
In Section 2 we review two common GANs and discuss how they can be trained to generate samples from a time series based on an initial sample path.Section 3 discusses an algorithm for using the trained GAN to bootstrap dependent data.In Section 4 we provide simulation evidence of the quality of the learned GAN and the performance when it is used for bootstrapping.Section 5 concludes.

Basic GAN
The concept of GANs can be introduced intuitively as follows.Assume that a real sample of data is drawn from the unknown distribution F X and assume that we have another arbitrary, but known, distribution F Z .The generator G is a function that transforms a sample from F Z into a sample that looks like it is drawn from the real data distribution F X .The discriminator D is a function that tries to determine if a given sample is drawn from the real data distribution F X or not.The two models are set to play a game against each other.
The generator tries to fool the discriminator by generating fake samples that look as real as possible, and the discriminator tries to detect the generator's forgery by determining if it got a real or fake sample.
Let G and D be specified up to the finite dimensional parameters θ G and θ D respectively.Also, let x real denote some generic real sample from distribution F X and x f ake = G(z; θ G ), z ∼ F Z a generated sample.Goodfellow, Pouget-Abadie, et al. (2014) suggest solving the minimax problem min (1) In practice Goodfellow, Pouget-Abadie, et al. (2014) separate the minimisation and maximisation steps into max and iterate between these to learn the discriminator and generator using batching and stochastic gradient descent.We skip the details here, but describe the training in detail for the Wasserstein GAN in the following section.Algorithm 1 provides an overview of the training algorithm.Biau, Cadre, et al. (2018) and Goodfellow, Pouget-Abadie, et al. (2014) argue that, under a set of assumptions, the optimal discriminator in the minimax formulation in Equation 1 is related to the Jensen-Shannon divergence between the distributions of the real and generated data.If F G is the distribution of the transformation G(Z; θ G ), Z ∼ F Z and the optimal discriminator is in the class of functions {D(•, θ D ) : θ D ∈ Θ} where Θ is some parameter space then the solution to the maximisation problem in Equation 1 is the Jensen-Shannon divergence JS between F X and F G (Biau, Cadre, et al., 2018;Goodfellow, Pouget-Abadie, et al., 2014), and, heuristically, if we assume the discriminator is optimal then the generator is solving the As noted by Biau, Cadre, et al. (2018), this seems to have motivated work on investigating other divergences/distances in the context of GANs.One such distance is the Wasserstein distance which we will consider in the following section.

Wasserstein GAN
Arjovsky et al. (2017) argue that an alternative distance measure in Equation 3 is the order-1 Wasserstein (Earth-Mover) distance which results in the Wasserstein GAN.Consider informally two probability measures P and Q defined on a suitable common probability space Algorithm 1 GAN (Goodfellow, Pouget-Abadie, et al., 2014) for i = 1, 2, ..., N do: Discriminator loss Generator loss Parameter update end for (M, •).The Wasserstein distance W 1 between P and Q is defined as where, with abuse of notation, Π is the set of all joint probability measures V(x, y) with marginal probabilities P(x) and Q(y), and || • || is the absolute value norm (Arjovsky et al., 2017).Here E V denotes expectation under the probability measure V. Arjovsky et al. (2017) argue that Equation 4 is equivalent to where F is the set of real-valued Lipschitz functions on M with Lipschitz constant 1.In Equation 5we have conveniently denoted the function to be optimised over by D as we can consider it to play the role of a discriminator.Given the discriminator, the generator would like to minimise the distance between the generated data and real data, if the laws of generated and real data are given by P and Q then the generator is solving the problem inf G W 1 (P, Q).This is analogous to Equation 3 but the Jensen-Shannon divergence JS has been replaced by the Wasserstein distance W 1 .A primary issue in operationalising Equation 5is enforcing the Lipschitz condition on D. For example, say we learn D using a neural network, how do we constrain this network to only learn Lipschitz-1 functions?Let G and D be specified up to the finite dimensional parameters θ G and θ D respectively.Also, let x real denote some generic real sample from distribution F X and x f ake = G(z; θ G ), z ∼ F Z a generated sample.Now based on Equation 5Arjovsky et al. ( 2017) suggest solving the minimax problem min A simple training algorithm for solving the problem ( 6) would be splitting it into a min and max step, and iterate between them (Goodfellow, Pouget-Abadie, et al., 2014) max (8) Gulrajani et al. (2017) recognise that a function is Lipschitz-1 if and only if the norm of the gradient is 1 or less everywhere, so they suggest a gradient penalty to enforce the Lipschitz condition in the discriminator x f ake is a convex combination of x real and x f ake with uniform random weight a ∼ U(0, 1), and we let F x denote the distribution of these convex combinations.This procedure is motivated heuristically in Gulrajani et al.
(2017) and is a less computationally intensive way of enforcing the Lipschitz constraint across all possible x.Under the gradient penalty the discriminator objective function is now max where the weight of the gradient penalty is adjusted by the hyper parameter λ.
Let {(z i , x i,real )} n b i=1 constitute a (mini) batch of data where z i is noise sampled from F Z and x i,real is a real sample.During training we minimise the batch discriminator loss Similarly, for the generator we minimise the batch generator loss corresponding to Equation 8.By alternating between the objectives (D1) and (G1) we can learn the parameters of G and D. The complete training algorithm of Gulrajani et al. (2017) is given in Algorithm 2 in pseudo-code.Notice that during the discriminator updates the gradient is with respect to θ D and in the generator updates with respect to θ G .Algorithm 2 uses Stochastic Gradient Descent (SGD) to update the parameters, but more sophisticated optimisation methods could also be applied, e.g.ADAM (Kingma and Ba, 2014).
19: end for 20: end for The GAN formulation above does not necessarily impose how we should parameterise the discriminator D and generator G.However, in practice, they are commonly learned using neural networks with exact parameterisations depending on the application. 1Hornik et al.   (1990) showed that neural networks with fully-connected layers enjoy universal approximation properties and hence are a natural choice.We do not give an introduction to neural networks and their terminology but refer to the textbook treatment by Goodfellow, Bengio, et al. (2016).
Consider a time series process Y = {Y t : t ∈ T } indexed by time t.A time series has the defining property that information flow is unidirectional, so the state of the process at time t, Y t , can only depend on past information (Y t−1 , Y t−2 , ...) while the future is unknown.This constraint is useful when we choose the parameterisation of G and D.
The GAN in Hyland et al. (2017) relied on recurrent neural networks (RNNs) to model time series.We pursue a different approach and base the generator and discriminator on stacked dilated temporal convolutions (DTC) used by Oord et al. (2016) for audio generation.
We will refer to this as the TC-architecture.The temporal convolutions are similar to conventional convolutions -see (Goodfellow, Bengio, et al., 2016, Chp. 3) -but they enforce the unidirectional flow of information.They were applied to time series forecasting by Borovykh et al. (2017) and Sen et al. (2019).In particular, Borovykh et al. (2017) showed that DTC networks outperform RNNs in several forecasting problems and are easier to train even for long-range dependence.Very recently, Wiese et al. (2020) similarly suggested temporal convolutions based on (Oord et al., 2016) for financial time series modelling with GANs.The dilation of the temporal convolutions increases the receptive field -in context of time series this is the number of lags that the model can accommodate at once -while limiting the number of parameters (Oord et al., 2016;Borovykh et al., 2017).
In practice, temporal convolutions can be implemented as conventional one-dimensional convolutions with appropriate zero-padding applied to the input, see (Oord et al., 2016).If  (Oord et al., 2016).An illustration of the DTC network.The output y t at time t is a function of the present and past noise terms (z t , z t−1 , z t−2 , z t−3 ).We generate the output samples as we slide across the noise terms.
we stack d DTC layers with kernel size 2 where the dilation for layer i is 2 i then the total receptive field size at the final layer will be (Yu and Koltun, 2015) For illustration, assume that our generator G consists of d DTC layers.To generate a time series of length b we slide the DTC layers over a sequence of (b + p) iid noise terms where F Z is some arbitrarily chosen distribution and p is the receptive field size given in Equation 10.During the GAN training, the parameters in the DTC layers learn to transform this sequence of iid noise into observations from the time series.This is illustrated in Figure 1 for a generator with two DTC layers.On the other hand, the discriminator D considers sequences of observations from the generated or real time series (y 1 , y 2 , ..., y T ) and learns the parameters in the DTC layers to distinguish between real and generated samples.We will detail this process in the context of bootstrapping in Section 3, and provide a complete example of the architecture in Section 4.

GENERATIVE BOOTSTRAP
We propose the with temporal convolution layers as a method to resample from a time series process.This procedure is called the Generative Bootstrap (GB).The GB is composed of two stages: (1) the GAN is trained on an initial sample from the true DGP using blocking -the training stage.
(2) samples are generated from the generator of the trained GAN -the sampling stage.
The initial sample from the true DGP is re-sampled using a moving block scheme similar to the Moving Block Bootstrap (MBB) (Kunsch, 1989;Liu, Singh, et al., 1992) and n b of such blocks constitute a batch of data that are used to perform one iteration over the batch losses in Equation D1-G1.Many iterations are performed until the GAN losses stabilise.
For sampling, the discriminator is discarded and we feed noise into the generator from an arbitrary distribution F Z .The generator output is used as a sample to calculate one set of bootstrap statistics.This procedure is repeated for an arbitrary number of samples and the collection of statistics is used to form the GB estimates.
We will see that this differs from conventional block bootstrapping in two respects.
where p is again given in 10.As in the training stage, F Z is a multivariate standard normal distribution with an identity variance-covariance matrix.Any distribution could be used, the important point is that the training and sampling stages use the same distribution for F Z .Next we obtain a generated sample path y i by passing the noise vectors through the learned generator, A single sequence of innovation vectors z i = (z 1,i , ..., z b 2 +p,i ) generates one sample path y i of length b 2 .We can repeat this process to obtain an arbitrary number of sample paths.
Under the proposed TC-architecture the generator can sample a block of any length from the underlying process and hence we are not restricted to the block size on which the model was trained, i.e. it is perfectly acceptable if b 1 = b 2 .This does not necessarily hold for all choices of architecture, e.g. a fully-connected network would not have this property.This is a very attractive feature of the TC and GAN approach as it alleviates the need to stack individual blocks in a way that might break the dependence structure of the time series.We can simple choose the sampling block size to be equal to the size of the initial sample path, so b 2 = T .
When b 2 < T we refer to it as blocked sampling, while b 2 = T is called complete sampling.
Bootstrap statistics.Let G(•; θG ) be the learned generator from the training stage that has been trained on a single initial sample y * i from the true DGP.We now discuss how to calculate bootstrap statistics on the GAN samples.Assume that we are interested in parameter φ which has a suitable estimator φ.We use the sampling procedure from the previous section to obtain m samples from the learned generator G( • ; θG ), denote these samples by (y 1 , y 2 , ..., y m ) where y i = (y 1,i , ..., y T,i ), i = 1, ..., m.Each y i is considered a realisation of the DGP that produced the initial training sample for the GAN.We calculate the bootstrap statistics φi ≡ φ(y i ), i = 1, ..., m resulting in m estimates ( φ1 , φ2 , ..., φm ).
to conventional bootstrapping (Efron, 1981), the GB variance estimate of φ is The (1 − α) GB confidence interval (CI) for φ is the (1 − α) percentile CI (Efron, 1981) constructed using the empirical quantiles2 (α/2, 1 − α/2) of ( φi 4. SIMULATIONS In this section we will illustrate the performance of the GB by using simulations and by making comparisons to the established CBB approach for bootstrapping dependent processes. For simplicity of exposition we base our illustrations on the AR(1) as the data generating process.

AR(1) process
The simulation design is as follows.The true DGP is a zero mean and stable AR(1) process with φ = 0.5, 0.8, 0.9.For each replication, a sample path of length T = 1, 000 is generated from Equation 13.This sample is used to train the GAN with a training block size of b 1 = 150 and batch size n b = 64.Once training is complete, we sample 10, 000 sample paths from the GAN.These samples are used for two purposes: (a) we compare the samples generated by the proposed GAN to the known properties of the DGP under complete sampling b 2 = 1, 000.We use the generated samples to estimate the autocorrelation (ACF) and partial autocorrelation (PACF) over 1, 000 replications.
(b) we compare GB and the CBB for confidence interval estimation, i.e., empirical coverage, of the least-squares estimator φLS of φ.The GB is run for 1, 000 replications and the CBB is run for 5, 000 replications.The CBB resamples from the same initial sample as is used to train the GB.We consider CBB block sizes b 1 = 50, 100, 150.The GB training block size equals 150.The number of replications for GB is lower as the simulation time is considerably higher than for CBB.A GB replication takes around 20-30 minutes while it is less than a minute for CBB.It is important to note that, in both the CBB and GB, we do not specify the dynamics of the true DGP.The GB assumes that the dynamics can be approximated by some functions of the noise vectors but these functions are not fully specified.
The following section details the hyper parameters and training of the GAN.The two succeeding sections discuss the simulation results -first the correlation structure of the generated samples and secondly the higher-level statistics in a bootstrapping context.
GAN implementation details.We discuss the hyper parameters and network design of the GAN.
The discriminator has 6 temporal convolution layers with common kernel size 2 and dilations (1,2,4,8,16).The filters are (8,16,32,32,64,64).The output from temporal layers number 1, 2, and 6 are run through adaptive max pooling (AMP) with feature size 16 and concatenated into a feature vector of size 48.This is followed by two fully connected layers that regress into a single output unit.All layers use leaky ReLU activation (Maas et al., 2013) except for the final layer which has no activation function.The leaky ReLU avoids the zero-gradient problem of conventional ReLUs during training (Maas et al., 2013).The generator has 6 temporal convolution layers that directly outputs a sample path.The filters are (128,64,32,32,16,1).Except for the last layer, all layers use the Tan activation function as it -unlike ReLU -is symmetric.The total number of (trainable) discriminator parameters is 233, 609 while the generator has 89, 921 (trainable) parameters.The GAN is trained for 5,000 steps based on a single sample from the DGP.We do not employ any (early) stopping criterion, so the GAN is always trained till the final step.The training involves iteratively minimising the batch losses, see Equation D1-G1.Instead of stochastic gradient descent, we use the more complex Adam algorithm as it can accelerate training, see (Kingma and Ba, 2014).Table 1 in the Appendix lists all hyper parameters for the GAN in this paper.
(a) ACF and PACF properties of the GAN samples.We compare the samples produced by the GAN and CBB against the theoretical properties of the AR(1) process.The theoretical autocorrelation function (ACF) for an AR(1) process is given by γ(j) ≡ Cor(y t , y t−j ) = φ j for φ = 0.5, 0.8, 0.9.We estimate the ACF using generated samples under the complete sampling scheme.The ACF estimates are averaged over 1, 000 replications.
Figure 2 plots the estimated ACF (full line) against the theoretical ACF (dashed line).In Figure 2 we have also included the interquartile range (IQR) for the theoretical ACF and 3 For the implementation of the CBB we have used the Python library by Sheppard (2020).the ACF estimated across the 1, 000 replications.If the GAN has learned the dynamics of the AR(1) process, then the estimated and the theoretical ACF should be similar and the theoretical and the estimated IQR should be overlapping.Clearly, for higher values of the autoregressive parameter φ the persistency of the process is stronger and challenges the GAN to learn longer range dependencies.
From Figure 2 the estimated ACFs are close to their theoretical counterparts for all lags when φ = 0.5.For φ = 0.8 there is a small upwards bias in the estimated ACFs that is larger for the CBB, particularly, for the intermediate range of lag lengths, i.e, lags 5-20.For φ = 0.9 there is a small but noticeable bias in the estimated ACFs for all lags considered for both CB and CBB.The bias is again uniformly larger for the CBB.Importantly, all theoretical ACFs are well within the IQR of the estimated ACFs.
For φ = 0.5 the estimated IQR of the CBB ribbon) is almost identical to the theoretical IRQ (blue ribbon).However, for φ = 0.8 and φ = 0.9 the upper limit of the estimated IQRs for CBB seem to be considerably downward biased for all lags.Noticably, the estimated IQSs for CB (red ribbon) appear to be much less sensitive to the value of φ and the estimated IQSs for CB are only marginally wider than the theoretical IQRs.We find these results very encouraging.
Next we turn to the partial autocorrelation function (PACF).For an AR(p) process the PACF is zero for lags larger than p.The AR(1) process is expected to have PACF equal to φ at lag 1 and zero PACF for all following lags.Figure 3 depicts the estimated PACF using the GAN samples and plots it against the theoretical PACF (horizontal dotted line).
The black horizontal marks denote the estimated IQRs.The estimated PACFs have the expected behaviour for φ = 0.5, 0.8 with values close to 0.5 and 0.8 at lag 1 respectively and with values very close to zero for all remaining higher order lags.For the highly persistent case φ = 0.9 the estimated PACF is slightly more imprecise with a notable non-zero PACF at lag 2. However, overall, the PACF based on the GAN sample clearly suggests that the underlying time series under consideration is a highly persistent AR(1) process.(b) Bootstrapping the least-squares estimator.We apply the GAN for re-sampling and examine if it recovers higher-level statistics, in particular, the sampling distribution of the usual least-squares (LS) estimator φLS of the autoregressive parameter φ.
The simulation design is identical to that initially described.We obtain 10, 000 sample paths from the GAN across 1, 000 replications with φ = 0.5, 0.8, 0.9.For each replication, the GB variance of φLS and the GB confidence intervals are constructed as in Section 3.
Using the usual asymptotic approximation, the LS estimator φLS has asymptotic variance Figure 4 contains the main simulation results for the GB using complete sampling (b 2 = 1, 000).For values of φ in the range 0 − 0.75 the GB in general produces far better empirical coverage than CBB for all nominal confidence levels, but in particularly for the levels 0.99, 0.95, and 0.9.For the highly persistent processes none of the re-sampling methods considered have good empirical coverage.In these cases the empirical coverage of the GB is at par with the CBB with a block size equal to 100.Not surprisingly, the CBB with the largest block q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Confidence level 0.9 size (=150) here has the best coverage.
It is very likely that the performance of the GB for highly persistent processes could be improved by increasing the number of layers in temporal convolution network.Recall, that a temporal convolution network with d layers and fixed kernel size 2 accounts for at-most 2 d+1 − 1 lags (the receptive field), hence an increase in d might rectify this problem .How to select the optimal number of d layers in the GB procedure as a function of the persistence of original process is ongoing work.

CONCLUDING REMARKS
GANs provide a promising approach for simulating time series data.Our results suggest that GANs can accurately learn the dynamics of common autoregressive time series processes using temporal convolutional networks.In addition, it seems compelling that the GAN appears to improve empirical coverage in bootstrapping of dependent data when compared to the circular block bootstrap.This gives credibility to the use of GANs on data from time series processes that are unknown.
It is important to note that the various dependent bootstraps have a theoretical justification and their properties have been theoretically derived, see e.g. the overview in (Lahiri, 1999).The GAN and GB currently do not have this theoretical reassurance.
The GAN relies on a large number of hyper parameters and design choices.We have not investigated how sensitive our results are to these but research on this is ongoing.We have used sensible defaults for batch sizes, learning rates, gradient penalty, update iteration scheme, activation functions, optimiser but these are by no means optimal choices.This is a general shortcoming in the GAN literature and little is known about how to optimally choose these values.
Our simulations also rely on a simple and basic time series process.It would be fruitful to consider the performance on more general stationary processes (ARMA) and in settings where the error term has stochastic variance, e.g.GARCH.
ake which is the empirical and batched equivalent of Equation 9, and recall that x i,f ake = G(z i ; θ G ).As per usual, the batch gradients ∇ θ D L (b) D serve as unbiased estimates of ∇ θ D L D (i.e.here L D is the loss over the entire training sample while L (b) D is the loss in the batch only, so for L D the sums run over (1, ..., n) instead of (1, ..., n b )) which allow us to do stochastic gradient descent on the parameters (θ D , θ G ).The first sum in (D1) amounts to the discriminator objective in Arjovsky et al. (2017) while the second sum corresponds to the gradient penalty suggested by Gulrajani et al. (2017).

Figure 1 :
Figure 1: Adapted from Figure 2 in(Oord et al., 2016).An illustration of the DTC network.The output y t at time t is a function of the present and past noise terms (z t , z t−1 , z t−2 , z t−3 ).We generate the output samples as we slide across the noise terms.

Firstly, a
conventional MBB would sample blocks of size b with replacement from the initial sample and stack these into one sample path matching the length of the initial sample.This stacked sample is then used to calculate bootstrap statistics.In the GB these blocks are not stacked, but fed as individual samples to train the GAN.Secondly, sampling the GAN does not have to use stacking and the GAN can provide a sample of any size once it has been trained.The size of the sample is determined by the noise terms supplied to the generator.We proceed to discuss the training and sampling stages below.Algorithm 3 provides an overview.Traning stage.The training stage mimics Algorithm 2 but uses moving blocks to resample the initial sample, see Line 5 in Algorithm 2. Let y * i = (y * 1,i , y * 2,i , ..., y * T,i ) be the initial sample from the true DGP.We perform a blocking procedure identical to the moving block bootstrap.Define each of the (T − b 1 − 1) overlapping blocks by B * j = (y 1j = y * 1+j,i , ..., y 2j = 3 Generative Bootstrap 1: for r = 1, 2, ..., R do: i ) where the block size is b 1 < T .A batch of training data is given by randomly sampling n b blocks from {B * 1 , B * 2 , ..., B * T −b 1 −1 } without replacement, denote this batch of true blocks by Y * .Next, we sample the generator noise (see Line 3 of Algorithm 2) from a multivariate standard normal distribution with an identity variance-covariance matrix, but the distribution F Z could be selected arbitrarily.To generate a sample path of length b 1 we need (b 1 + p) noise terms where p is the receptive field in the DTC layers -see Equation 10.The noise terms are used to generate a block sample B j = (G(z 1 , ..., z p ; θ G ), G(z 1+1 , ..., z p+1 ; θ G ), ..., G(z b 1 , ..., z b 1 +p ; θ G )), see Line 4 of Algorithm 2, and n b of such blocks are generated to form a batch of fake blocks, denote them by Y.The true and fake samples are fed to the discriminator and it tries to distinguish between them.This requires that the blocks in Y * and Y have the same size.This procedure of sampling true and fake blocks and feeding them to the discriminator constitutes the training stage.In practice, we iterate over Equation D1-G1 until the losses stabilise.Equation D1 requires both a fake and true batch per iteration, while Equation G1 needs only a fake batch.Sampling stage.Let G( • ; θG ) be the learned generator from the training stage.Once trained, the generator should produce samples mimicking the true DGP that generated the initial sample y * i .To generate a sample of length b 2 , we first sample a sequence of noise vectors from F Z generator noise is sampled from a multivariate standard normal distribution with an identity variance-covariance matrix.To generate a sample path of length b we need (b + p) noise terms where p is the receptive field size in the DTC layers -see Equation 10 -and b is either b 1 or b 2 corresponding to training or sampling stage.The dimension of the noise term is a hyper parameter and can be chosen arbitrarily, in our simulations we use 256.If we stack all the noise terms needed to produce a sample path of size b then we obtain a (b + p) × 256 matrix with iid standard normal distributed entries.

3 A
common discussion is whether the GAN has learned to produce new samples or if it simply reproduces the original samples perfectly.If the generative model learned to perfectly replicate the original sample then the method would perform approximately on-par with CBB.Favourable bootstrapping characteristics of the GAN relative to CBB could indicate that the GAN has an advantage in capturing the dynamics of the DGP and that it does not simply replicate blocks of the original sample.

Figure 2 :
Figure2: Theoretical (dashed line, blue ribbon) and sample autocorrelation functions for CBB (green ribbon) and GAN (red ribbon) resamples when φ = 0.5, 0.8, 0.9.Block size for CBB is 150 and training block size for GAN is 150.The ACF estimates and confidence bands are based on 1, 000 replications of the generative model with 10, 000 samples per replication.

Figure 4 :
Figure4: Empirical coverage of percentile confidence intervals -with nominal confidence levels (0.99, 0.95, 0.90, 0.80) -for the CBB and the GB (using complete sampling) under different choices of the autoregressive parameter φ.The horizontal lines depict the corresponding desired nominal confidence levels.