Multi-level stochastic refinement for complex time series and fields: A Data-Driven Approach

Spatio-temporally extended nonlinear systems often exhibit a remarkable complexity in space and time. In many cases, extensive datasets of such systems are difficult to obtain, yet needed for a range of applications. Here, we present a method to generate synthetic time series or fields that reproduce statistical multi-scale features of complex systems. The method is based on a hierarchical refinement employing transition probability density functions (PDFs) from one scale to another. We address the case in which such PDFs can be obtained from experimental measurements or simulations and then used to generate arbitrarily large synthetic datasets. The validity of our approach is demonstrated at the example of an experimental dataset of high Reynolds number turbulence.


Introduction
Many high-dimensional nonlinear systems display a remarkable degree of complexity in space and time. As a consequence, spatio-temporal records of such systems often exhibit non-trivial statistical features, including multi-scale correlations and scale-dependent deviations from Gaussianity. In principle, a comprehensive statistical characterization of such systems requires obtention of the full multi-time-multi-point statistics -a prohibitive task given the dimensionality of typical systems.
Significant simplifications arise if a hierarchy of scales in space and/or time can be identified, which is statistically characterized by transition probabilities from larger to smaller scales (or vice versa). If the statistics on one scale only depend on the previous one, the system is Markovian in scale. This property has been validated empirically for many complex systems, including hydrodynamic turbulence [1][2][3], solar wind turbulence [4], neuronal spike trains [5] and currency exchange markets [6,7]. For scale-Markovian systems, the transition probabilities contain the full statistical information of the system, leading to a tremendous reduction of complexity. Even if the Markov property is weakly violated or a priori unknown, a Markovianization in scale can still provide a valuable approximation, containing, for example, non-trivial multi-scale correlations.
In many situations, however, a purely statistical characterization is insufficient. Rather, a complete spatio-temporal record, i.e. a realization of a complex system, is needed. For example, machine learning applications require very large training datasets. Providing such datasets from experimental measurements, computer simulations or historical records is often challenging, sometimes even impossible. Even if data is available, it sometimes lacks resolution for a specific target application due to experimental or computational constraints, such that refinement is necessary.
In this work, we introduce a data-driven method, which allows generating statistically well-defined time series or fields with non-trivial features in a hierarchical fashion. The key idea of this multi-level stochastic refinement (MLSR) is to generate a realization of a stochastic process representing the multi-scale features of a generic complex system given the knowledge of transition probability density functions (PDFs) on all relevant scales. Our emphasis is on a refinement of coarse time series or fields in scale, which complements previous approaches that focused on forward-intime construction of synthetic time series by utilizing Markov properties in scale in conjunction with a Fokker-Planck-equation based approach [8,9]. We here focus on the case in which the transition PDFs can be sourced from experimental measurements or simulations.
As one prototypical example of a complex system, we apply our approach to turbulent flows, which are ubiquitous in nature and engineering applications. Turbulent flows feature a broad range of dynamically active scales in space and time. For example, the separation of the large, energy-containing length scales L and the small, dissipative length scales η increases in hydrodynamic turbulence with the Reynolds number like L/η ∼ Re 3/4 . As a consequence, fully resolved simulations are computationally prohibitive for realistic natural and engineering flows. Large-eddy simulations tackle this problem by only resolving the energy-containing scales [10,11]. However, in many applications, resolving small-scale turbulence is key. Examples include cloud microphysics [12][13][14], turbulent combustion [15,16], mixing and transport of atmospheric pollutants [17,18], the assessment of turbulent loads on wind turbines [19], as well as the development of active control strategies for wind farms [20]. Recently, deep learning algorithms for feature identification and extraction have been gaining attention to deal with these challenges. For the required large amounts of training data [21,22], synthetic turbulent fields, like the one presented in the following, could make a valuable contribution to alleviate the challenge of acquiring the databases.
The paper is structured as follows: We start with a description of the algorithm and the theoretical background in section 2. We then demonstrate the data-driven algorithm at the example of state-of-the-art wind tunnel measurements of high-Reynolds number turbulence in section 3. In the conclusions 4 we discuss future applications and generalizations as well as an outlook on model-driven approaches to the MLSR method.

Description of the data-driven MLSR based on three-point PDFs
Let us consider the situation in which we aim to generate a realization of a statistically homogeneous, one-dimensional field of length L with a resolution of ∆ = L/2 N . Given initial field values on scale r 0 = L, we adopt a hierarchical refinement on scale r 1 = L/2, then progressing to scale r 2 = L/4 etc. until scale r N = ∆ = L/2 N is reached after N refinement steps.
On each refinement level, let us assume we know the three-point PDF f (u l , u c , u r ; r i ) from experimental measurements. Here u l , u c and u r denote the field values at the left end, in the center, and at the right end of the interval, respectively, which are separated by the distance r i . Based on the three-point PDF, we can define the conditional PDF Here, f (u l , u r ; 2r i ) denotes the two-point PDF on scale 2r i , which can be obtained from the three-point PDF by integrating out the center point: The conditional PDF (1) can be interpreted as a transition PDF in scale. It describes the probability of drawing u c , which can be used to obtain a field which is sampled on scale r i , given the two values u l and u r at the end points of the interval of length 2r i . In our numerical implementation, we employ a standard rejection method [24] to draw from this conditional PDF. We thereby construct a realization sampled on the scale r i . The three-point statistics on scale r i are consistent with the three-point statistics of the underlying experimental data by construction. Employing this multi-level stochastic  Figure 1: Schematic representation of the data-driven multi-level stochastic refinement at the example of synthetic time series (mapped to a spatial signal using Taylor's hypothesis [23]) of wind tunnel turbulence. The initial boundary values are set on a scale of L = 32λ (λ denotes the Taylor scale in hydrodynamic turbulence). The velocity at L = 16λ, i.e. on half the scale, is then drawn from the the conditional PDF p(u L/2 |u 0L , u L ) which previously has been obtained from experimental data. By iterating this procedure, the synthetic time series (gray) is obtained, which shares statistical features with the original experimental data. refinement algorithm hierarchically starting from r 1 and progressing down to scale r N enforces consistency with the underlying data in each refinement step.
We illustrate the procedure in figure 1 for a time series of flow velocities in turbulence. Here we construct a synthetic turbulent velocity signal at a resolution of L/2 9 = λ/16 over an interval of length L = 32λ, where λ denotes the Taylor scale of hydrodynamic turbulence. As detailed in section 3, the corresponding transition PDFs have been obtained from wind tunnel measurements reaching down to scales L/2 9 = λ/16. Starting from prescribed velocity values at the end points of the interval [0, L], the velocity u L/2 at the mid-point is drawn from the transition PDF p 1 (u c |u l = u 0L , u r = u L ). The two velocities u L/4 and u 3L/4 on the next level of refinement are drawn from p 2 (u c |u l = u 0L , u r = u L/2 ) and p 2 (u c |u l = u L/2 , u r = u L ), respectively. The four velocities necessary to refine the field on scale r 3 = L/8 are then drawn from the corresponding transition PDF p 3 . By iterating this procedure down to level 9, we finally obtain the gray time series shown in figure 1.
Up to now, we have discussed how to generate a field of length L and resolution L/2 N . If the scale L is large enough such that field values separated by L are statistically independent, arbitrarily large fields can be generated by drawing from the single-point PDF f (u) on scale L. Statistical correlations on scales smaller than L are then subsequently introduced by MLSR. Regarding the small-scale resolution, the latter is only limited by the smallest experimentally accessible scale, from which the transition PDFs are estimated.

Multi-point single-time statistics and relation to Markov processes in scale
The MLSR discussed in the previous section allows us to generate a synthetic field of extent L and resolution ∆ = L/2 N sampled on M + 1 points where M = 2 N . Let us denote the field values at these points sequentially by u 0 . . . u M with the corresponding multi-point PDF f (u 0 , u 1 , u 2 , u 3 , . . . , u M ). Owing to the structure of the MLSR algorithm, an ensemble of such synthetically generated fields obeys the multipoint statistics Consistent with the hierarchical construction of the random fields, this expression for the multi-point statistics shows explicitly that random variables can be systematically integrated out level by level. For example, by integrating out all random variables which have been introduced in the N th refinement step resulting in a field resolution of ∆, the joint PDF f (u 0 , u 2 , u 4 , . . . , u M ) corresponding to an ensemble of fields with a resolution of 2∆ is obtained. The hierarchical structure of the multi-point statistics (3), in which the probability of the field value at the center point of an interval is determined by the values at the two end points, indicates that MLSR generates a Markov process in scale. As we show by direct calculation in Appendix A, Markovian properties also follow for related statistical quantities such as field increments. As a further note, if the original time series obeys Markov properties of the form p (u 1 |u 0 , u 2 , u 3 ) = p (u 1 |u 0 , u 2 ), then equation (3) is an exact representation of the multi-point statistics. For turbulent flows, this property is an ongoing topic of investigation [25].
A prototypical example for non-Gaussian, yet Markovian, systems are time series generated by nonlinear Langevin equations. For such time series, the MLSR algorithm is capable of generating synthetic equivalents that reproduce the correct N -point statistics precisely. As a proof on concept, we demonstrate this for the two-point statistics in Appendix B. However, even if Markovian properties are mildly violated, the synthetic time series generated by MLSR still resemble the original statistics closely as illustrated in section 3.

Application: Turbulence Data
Previous works (e.g. [1][2][3]) have empirically established a Markov property in scale for hydrodynamic turbulence, which lends support to the MLSR approach proposed here to capture essential statistical features of turbulent fields. To illustrate the capabilities of the MLSR method for real-world, close-to-Markovian complex systems, we utilize experimental turbulence data from the Variable Density Turbulence Tunnel VDTT at the Max Planck Institute for Dynamics and Self-Organization [26]. This pressurizable wind tunnel uses sulfur hexafluoride at densities up to 15bar to achieve high Reynolds numbers due to the low kinematic viscosity of the pressurized working gas. The data we used to construct the MLSR method is a subset of the data presented in [27]. Turbulence was generated using a passive grid of rectangular grid bars with a mesh spacing of 18cm and was measured about 40 mesh sizes downstream of the grid using a nano-scale thermal anemometry probe (NSTAP) developed at Princeton University [28,29]. The NSTAP was a 60µm long hot-wire that measured one-dimensional velocity signals and was small enough to resolve the smaller length scales in the VDTT. The data was obtained at a pressure of 8.5bar of sulfur hexafluoride, resulting in a Taylor-scale Reynolds number of R λ = 1030. The time series was sampled at a rate of 60kHz with noise being removed at 15kHz using an 8 th -order Butterworth filter and contained O (10 10 ) samples of the turbulent velocity, corresponding to O (10 5 ) large eddy turnover times. Based on the mean speed U = 4.2m/s and the root mean square velocity fluctuation u rms = 0.13m/s, the turbulence intensity u rms /U was 3%. We invoked Taylor's hypothesis [23] to convert temporal information into spatial information. The Kolmogorov length scale, η, was 34µm, the integral length scale, L int , was 12.6cm and the Taylor scale, λ, was 2.2mm.
To obtain the conditional PDFs p (u c |u l , u r , r), we first subtracted the mean of the signal velocity and computed the three-point PDF f (u l , u c , u r ; r) and the two-point PDF f (u l , u r ; 2r), using conventional binning with 100 equidistant bins in all PDF dimensions. Given the substantial length of the velocity time series, these multi-point PDFs are wellconverged. In situations where one potentially only has access to shorter time series, the estimates of the PDFs could be improved using e.g. kernel density estimation [30] at a higher computational cost. To remove spurious noise stemming from divisions of bin elements with insufficient statistics when computing conditional statistics, we empirically set all bins in the three-point PDF that contain less than 10 samples of the turbulent velocity to zero.
Two-dimensional cuts through these conditional PDFs computed from the NSTAP data for u r /u rms = 0 are shown in figure 2. At small scales r, the conditional PDF is noticeably constricted around u l /u rms = u r /u rms = 0 and substantially widens with increasing difference |u l − u r |. As to be expected, the constriction around u l /u rms = u r /u rms = 0 weakens when the separation becomes comparable to the integral length scale such that the shape of the conditional PDFs, which are centered around (u l + u r ) /2, varies less with the difference |u l − u r |. One-dimensional cuts through the conditional PDFs in figure 2 at a scale of r = λ are shown in the bottom right panel. For small differences |u l − u r | the cuts through the conditional PDFs show a noticeably non-Gaussian behavior, while for large differences |u l − u r |, the PDFs become virtually Gaussian. This emphasizes that the conditional PDFs not only widen with increasing differences |u l − u r |, but rather change shape and therefore result in a qualitative change of the underlying dynamics determining u c . This effect can also be seen in the mean and variance of the conditional PDF in figure 3. While the conditional mean shows a nearly linear behavior across all separations with a slope that depends on the scale r, the variance is non-trivially depends on the separation and the difference |u l − u r |. However, the deviations from symmetric behavior are relatively small.
Using these PDFs and the method described in section 2, we can construct a multilevel refined time series reproducing the key properties of a real turbulent time series. To initialize the method, we started with a turbulent time series sampled at r 0 = 32λ and utilized a 9-level MLSR process to reach a final separation r 9 = λ/16. Figure 4 shows a direct comparison of the resulting synthetic time series with an experimentally obtained turbulent time series sampled at the same separation r 9 = λ/16. Both time series show a visually indistinguishable behavior, the intermittent character of turbulence is qualitatively well reproduced by the MLSR time series.
As a quantitative benchmark for the MLSR algorithm, we computed the velocity  for separations λ/4, 2λ and 128λ as well as one-dimensional diagonal cuts through the conditional mean for u l = u r . The conditional mean appears to be a linear function of the mean value (u l + u r ) /2, with a slope that approaches one for decreasing separation. Right column (top to bottom): Conditional variance as well as one-dimensional diagonal cuts for u l = u r . In contrast to the conditional mean, the variance is a slightly asymmetric, non-trivial function of u l and u r . increment PDFs f (v; r), where v = u (x + r) − u (x), at each refinement scale for the respective separation as well as the velocity increment PDFs for the experimental turbulence dataset that was originally used to compute the conditional PDFs. The result is shown in figure 5. At each refinement scale, the refined time series reproduces the original turbulence PDFs including the increasingly heavy tails towards smaller scales. As detailed in [31], the Markovian properties of turbulence start to deteriorate at scales smaller than λ, the Taylor scale. While deviations are small, effects of this violation can supposedly be seen in the tails of the increment PDFs for smaller scales in figure 5. This is in contrast to the findings of the exact reproduction for fully Markovian systems in Appendix B. It is, however, remarkable how closely the MLSR-generated time series quantitatively resemble actual fluid turbulence highlighting its general applicability for systems without ideal Markov properties.

Conclusions and Outlook
We have presented a hierarchical stochastic refinement method that allows to generate synthetic fields and time series. In this data-driven MLSR, the synthetic fields (time series) are reconstructed from multi-point (multi-time) statistics of real input data. The method is able to generate synthetic data which exhibits two-point statistics in good agreement with the original data. We demonstrated the data-driven MLSR at the example of turbulent wind tunnel data as well as for nonlinear Langevin equations. Due to the low requirements of the method, i.e. the determination of the three-point PDF for different refinement levels, as well as due to the fast production of surrogate data, the data-driven MLSR appears to be a promising method for a broad range of applications. Synthetic fields generated by MLSR, for example, might be used as input to simulate power output fluctuations of wind parks, the fatigue loads on single wind turbines [32], to test the stability of power systems [33], as well as to simulate the propagation of cosmic rays [34][35][36].
In certain situations, sufficient statistics to resolve the three-point PDFs required for the data-driven MLSR method might not be available and alternative approaches are necessary. An extension of the approach to the generation of synthetic data based on phenomenological models of turbulence [37,38] without the need for measuring threepoint PDFs is subject of ongoing work.
Organization for this project as well as to Gregory P. Bewley for participating in the data acquisition and for fruitful discussions. We thank Joachim Peinke for valuable discussions.
MW is supported by the Max Planck Society. JF acknowledges funding from the Humboldt Foundation within a Feodor-Lynen fellowship and also benefitted from the financial support of the Project IDEXLYON of the University of Lyon in the framework of the French program "Programme Investissements d'Avenir" (ANR-16-IDEX-0005). MS acknowledges support from the Deutsche Forschungsgemeinschaft under grant no. 396632606.

Appendix A. Relation of MLSR to Markov processes in scale
To elucidate the relation of the MLSR approach to Markov processes in scale in terms of increments, let us consider one particular "branch" of random variables in (3) which connects the largest scales to the smallest scales. Without loss of generality, this can be achieved by setting j = 1 in (3) which results in the reduced multi-point PDF The joint PDF of increments v i = u M/2 i − u 0 on scales r i together with u 0 are obtained from the reduced multi-point PDF (A.1) according to After inserting (A.1) into (A.2) the integration can be carried out and we obtain Similar to the multi-point PDFs, the increment variables can be integrated out scale by scale. For example, the joint PDF which includes all increments but the smallest, can be straightforwardly obtained by integrating over v N . As a consequence, the conditional increment PDF fulfills a Markov property, which can be seen by noticing that (A.4) and (A.5) only differ by the transition PDF p N . As a result we obtain This relation shows that only the transition PDF p N of the N th refinement level is needed to fully specify increment PDF conditional on all other increments and u 0 . In other words, the conditional PDF p(v N |v 0 , v 1 , . . . , v N −1 , u 0 ) indeed is independent of v 0 , . . . , v N −2 such that the generalized Markov property ‡ holds. Of course, this argument holds on all larger scales as well, showing that MLSR generates a Markov process in scale.

Appendix B. Proof-of-Concept: Nonlinear Langevin Equations
The method described in section 2 can, given Markov properties, produces synthetic time series with correct statistics from any experimental or numerical time series with sufficient statistics to resolve the three-point PDFs f (u l , u c , u r ; r i ) at a hierarchy of scales r i . From these three-point PDFs and the corresponding two-point PDFs at each scale, f (u l , u r ; 2r i ), one can then compute the conditional PDFs, p i (u c |u l , u r ), through equation (1) Figure B1: Implementation of the MLSR method to the nonlinear Langevin equation (B.1) with γ = 1, c 1 = 5, σ = 1 and c 2 = 1. The top plot shows a time series of the simulated system, the bottom plot a 11-level refinement from the MLSR method. ‡ The Markov property is generalized in the sense that the conditional PDFs maintain a dependence on u 0 . The necessity of including u 0 has also been discussed in a related approach presented in [3].

11-level MLSR
series as well as on the convergence of the measured PDFs. In general, the further the original time series deviates from exhibiting Markovian properties, the less accurate the reconstruction becomes.
As a proof-of-concept test case, let us consider a prototypical nonlinear Langevin equation of the form du = −γu c 1 dt + (σ + c 2 u) 2 dW t . (B.1) Here, dW t is the increment of a Wiener process in time, γ is a parameter controlling the strength of the (nonlinear) drift term, and σ a diffusion parameter. The parameters c 1 and c 2 determine the degree of nonlinearities of the respective terms and give rise to non-Gaussian behavior. For c 1 = 1 and c 2 = 0 this equation reduces to the standard Ornstein-Uhlenbeck process with linear drift and constant diffusion that results in Gaussian solutions. The system can be directly simulated using a standard Euler-Maruyama method [39]. To obtain sufficient data for the three-point PDFs, we compute a total of 1.2 · 10 10 data points with a time step of ∆t = 10 −3 for various choices of the parameters. We then apply the MLSR method to refine initial conditions at scale τ 0 = 2 11 ∆t down to scale τ 11 = τ 0 /2 11 . Such a refinement is shown in figure B1. The properties of the MLSR reproduction are virtually indistinguishable from the original, simulated time series. This can also be tested quantitatively by the means of the scale-dependent PDFs f (w; τ ) of the increment w = u (t + τ ) − u (t) at a given scale τ . This is shown in figures B2 for two sets of parameters.