A unified and automated approach to attractor reconstruction

We present a fully automated method for the optimal state space reconstruction from univariate and multivariate time series. The proposed methodology generalizes the time delay embedding procedure by unifying two promising ideas in a symbiotic fashion. Using non-uniform delays allows the successful reconstruction of systems inheriting different time scales. In contrast to the established methods, the minimization of an appropriate cost function determines the embedding dimension without using a threshold parameter. Moreover, the method is capable of detecting stochastic time series and, thus, can handle noise contaminated input without adjusting parameters. The superiority of the proposed method is shown on some paradigmatic models and experimental data from chaotic chemical oscillators.


Introduction
State space reconstruction from observational time series often marks the first and basic step in nonlinear time series analysis. Several methods addressed the reconstruction problem, but none of them allow for a fully automatized and reliable way of embedding an uni-or multivariate set of observed time series with no, or at least very few, free parameters. The aim of this paper is to provide such a technique. The embedding theorems of Whitney [1], Mañé [2], and Takens [3] among with their extension by Sauer et al [4] allow several approaches to tackle the reconstruction problem. Among using derivative coordinates [5,6], PCA [7] or Legendre coordinates [8], uniform-and nonuniform time delay embedding (TDE) [7] is by far the most commonly used technique, due to its appealing simplicity. However, since Takens' theorem [3] is based on noise-free and infinitely long data, it does not give any guidance to choose the proper time delay(s) τ in practice. Together with the unknown box-counting dimension D B of the observed, but unknown system, which is needed to fulfil the embedding dimension criterion m 2D B + 1, the majority of the published articles propose ideas to infer estimates for τ and the reconstruction dimension m from data, usually a univariate time series (univariate embedding). The reconstruction problem starts with the unknown system u(t) with a mapping f : R D B → R D B , which is observed via a measurement function h and lead to M observational time series {s i (t)|i = 1, . . . , M} (figure 1). There can be different measurement functions h forming the multivariate dataset s i (t) and the combination of Whitney's and Takens' embedding theorems allow for constructing u(t) from more than one time series (multivariate embedding) [4,9]. One then tries to find optimal embedding parameters m and τ 's (the delays can be integer multiple of some constant, uniform time delay embedding UTDE, or different for each embedding dimension, non-uniform time delay embedding NUTDE) in order to build reconstruction vectors v(t) and, thus, a mapping F : R m → R m . These can be furthermore transformed by Ψ into v (t) and F , preserving the  [10] and Uzal et al [11]. diffeomorphism to u(t). For a detailed introduction into the reconstruction problem we refer to Casdagli et al [10], Gibson et al [8], Uzal et al [11] or Nichkawde [12].
In TDE the key idea is to use lagged values of the available time series as components of the reconstruction vector v(t) = s i 1 (t − τ 1 ), s i 2 (t − τ 2 ), . . . , s i m (t − τ m ) .
The delays τ j and the corresponding time series s i j , j = [1, . . . , m] should be chosen, such that the resulting components of the vectors forming the reconstructed state space v(t) are as independent as possible [4,7], but at the same time preserve the correlation structure of the system to a certain extent. These two competing objectives are also known as the problems of redundancy (delays should not be too small) and irrelevance (delays should not be too large) [10,11,13,14] and its optimization is goal to any proposed TDE procedure. Despite of its lack of a sound theoretical foundation for higher dimensional reconstructions (m > 2) [15][16][17], in a univariate scenario [i.e. s i 1 = · · · = s i m = s(t) in equation (1)], the approach to choose τ 2 from the first minimum of the auto-mutual information [18,19] is most common. τ 1 is usually set to zero, i.e. the unlagged time series constitutes the first component of the reconstruction vectors. The embedding dimension m is then separately determined, usually by a false nearest neighbour approach [20][21][22][23][24] or some other neighbourhood-preserving-idea [25,26] and all delays up to τ m are simply integer multiples of τ 2 (UTDE). Other approaches for an appropriate choice of τ 2 are possible [5,13,[27][28][29][30][31][32]. We refer to this as standard TDE in the following. More sophisticated ideas [16,27,[33][34][35][36][37][38][39][40][41][42], some including non-uniform delays and the extension to multivariate input data [12,[43][44][45][46][47][48] have been presented [48,49], but it seems their use is rather limited and not very popular. This could be due to their occasionally very complex nature and the lack of high quality open source implementation in the most commonly used programing languages. Another reason could be the fact that standard TDE performs surprisingly well in a range of examples; but still, its limitations should not be neglected, in particular when it comes to noisy time series, systems exhibiting multi-time scale behaviour, or multivariate input data. The latter are becoming of increasing interest in the near future, since acquisition costs for sensors and data collection decrease rapidly. Moreover, the application of complex systems approaches and nonlinear dynamics in different scientific disciplines receives increasing popularity.
Here we propose a fully automated method for an appropriate state space reconstruction of uni-or multivariate time series input, which utilizes two concepts, the continuity statistic [46] and the L-statistic [11]. We briefly review the basic ideas we will use in our proposed method (section 2) in order to illustrate their specific utilization in the algorithm (section 3), before applying it to simulated and experimental data (section 4). . Arrows indicate the mapping f : R d → R 1 (right panel), which is the potential (d + 1) th component of each of the points in the left panel (according to a specific delay τ d+1 and time series s j d+1 ), equation (2). To decide whether this ε-interval size accepts or rejects the null hypothesis on a significance level α the cumulative binomial distribution for getting at least l = 3 points in the ε-interval with probability p is used (modified after Pecora et al [46]).

Review of used concepts
In order to ensure comprehensibility of our proposed method in section 3 we explain the two main concepts of it in the following. In section 2.1 we review the continuity statistic [46] rather detailed, while the L-statistic is described only briefly in section 2.2 and extensively in the appendix F.

Continuity statistic by Pecora et al
In the continuity statistic, the problem of finding an optimal state space reconstruction with respect to redundancy and irrelevance is addressed by a statistical determination of functional independence among the components of the reconstruction vector [46]. Let {s i (t)|i = 1, . . . , M} be a multivariate dataset consisting of M time series, equally sampled at time instants t = 1, . . . , N. Suppose we have already chosen some delays τ k to build our temporal reconstruction vector v(t) of dimension d. This is, v(t) = (s j 1 (t + τ 1 ), s j 2 (t + τ 2 ), . . . , s j d (t + τ d )), with j k ∈ {1, . . . , M} being the choices of the different time series and τ k the according delays, which can be-and most often are-different. Then for a new potential component of v(t) we ask if this new potential component can be expressed as a function of the existing components. Mathematically speaking, the equality needs to be tested in an appropriate way, i.e., a sensible choice for f : R d → R 1 has to be made. This choice can be based on the property of continuity [50]. The practical implementation of equation (2) would start with mapping k nearest neighbours, v k (t), of a fiducial point v fid (t) from R d → R 1 , as illustrated in figure 2. That is, for each of the (k + 1) d-dimensional points in the left panel a potential (d + 1)th component s j d+1 (t + τ d+1 ) is considered and drawn onto the number line (right panel). The continuity statistic now asks whether these k + 1 points on the one-dimensional number line fall within a certain ε-interval size by chance, or due to the fact that there is a functional relationship between the d and the (d + 1)th component of a potential reconstruction vector v(t). The according continuity null hypothesis is that l of the k + 1 points landed in the ε-interval by chance, with probability p on the basis of the binomial distribution. When the number of observed neighbours, which are mapped into the ε-interval, is larger than the expected number from the binomial distribution for a selected α, i.e. l points, then the null can be rejected and, thus, a functional relationship can be assumed. The number of considered nearest neighbours k (i.e., the size of the δ-ball in figure 2) also determines the acceptable number of k + 1 − l points falling outside the chosen ε-interval for a given probability parameter p of the binomial distribution and a given α. For each candidate delay τ d+1 and each time series s j d+1 for each k (at a given p and α) there is a minimal spatial scale ε for which the null hypothesis can be rejected, i.e., a minimal size of the ε-interval in the right panel of figure 2. For the sake of avoiding redundancy while choosing the right delay, an ε for each potential τ d+1 has to be found. This is simply the distance from the fiducial point to its lth-nearest neighbour. By averaging over a range of fiducial points we eventually get the final continuity statistic ε (τ ) as a function of considered delay values τ (figure 3).
The final idea for achieving an optimal embedding is a sequential one. For each embedding cycle D d , i.e. for trying to find an appropriate additional component to build a reconstruction vector v(t) of dimension Figure 3. Illustration of the proposed embedding procedure for a univariate case, using the y-component of the Lorenz system (appendix E.1). (A) Blue, yellow, and green lines represent the continuity statistics ε for the three embedding cycles the algorithm 1 computes. Triangles identify the τ values corresponding to local maxima of ε . Then, the local maximum which corresponds to the maximum decrease of the L-statistic with respect to the actual reconstruction vectors v act , denoted as ΔL (orange triangle) is chosen as a delay time in each embedding cycle. In the third embedding cycle the cost-function cannot be minimized any further, i.e. all peaks lead to a non-negative ΔL. In this case no delay value is chosen and the algorithm breaks. (B) We end up with a three-dimensional embedding and lags τ 1 = 0, τ 2 = 10, τ 3 = 5. d + 1, initially starting with a one-dimensional time series v(t) = {s i (t)|i = 1, . . . , M}, the ε (τ ) values for a range of possible delay values τ and for each time series s i (t) gets computed. The τ of the highest relative maximum of ε (τ ) is selected as the optimal delay for this embedding dimension d. This delay is used to build up the temporal reconstruction vectors v(t) with the according time series. From here the next embedding cycle D d+1 gets executed and the process gets repeated until some break criterion terminates the procedure, i.e., when a sufficiently high embedding dimension m is reached.
Even though the idea of the continuity statistic is indeed promising, in this approach several unanswered questions remain, making the proposed idea not suitable for a fully automated embedding approach.
(i) The choice of p = 0.5 has been made plausible and also our tests support this idea, while α = 0.05 or α = 0.01 is standard in science (see figures 10-12 in appendix C), so we can safely fix them to these values. What is not so clear, but highly relevant for the method, is how to choose the optimal delay τ from the continuity statistic. Specifically, we might ask what 'relative maximum' exactly means and if there is any objective criteria for that choice. Moreover, it is also not clear how to obtain the continuity statistic in the first place with respect to the size of the neighbourhood, i.e. the size of the δ-ball in figure 2. We propose to vary k from k = 8 to some higher value, like k = 14, for each considered delay τ and take the minimum of all trials ε (and finally average over all fiducial points in order to obtain ε ). This is allowed, because there is no preferred choice of k, but a lower bound (see table 1 in Pecora et al [46]), and generally the choice depends on the amount of available data and its sampling rate.
(ii) In the original study, it was proposed that the continuity statistic on its own provides a breaking criterion for the method, namely, when ' ε remains small out to large τ , we do not need to add more components' [46]. However, this is no objective criterion and introduces a statistic, which would quantify small, and also a threshold, which determines when small is small enough. Due to folding and stretching of the attractor for high delay values τ , especially in case of chaotic systems, we expect ε to increase with higher τ , anyway. For these cases an (computationally intensive) irrelevance measure, the undersampling statistic, has been proposed [46]. Nevertheless, even though the undersampling statistic prevents the choice of irrelevant delays, it does not tell which of the local maxima of the continuity statistic we should pick and when to stop adding more components to the reconstruction vectors 7 . As an alternative to assess the irrelevance, the L-statistic has been suggested [11] which will be later used for our automated approach to attractor reconstruction.

L-statistic by Uzal et al
The L-statistic is an objective cost function, which quantifies the goodness of a reconstruction, independent of the reconstruction method [11]. It has two free parameters, k and T M . The approach uses ideas of noise amplification and minimization of the complexity of the reconstruction, which lead to a variable σ, and combines it with an irrelevance measure α. Specifically, the method estimates the conditional variance of neighbourhoods consisting of k nearest neighbours as the relative dispersal of these neighbourhood points with respect to the centre of mass of that neighbourhood T time steps ahead. Eventually this conditional variance estimate gets averaged over a range of prediction horizons T up to a maximum value T M and is normalized with respect to the original neighbourhood size, thus defining σ. The irrelevance measure α is basically the averaged inter-point distance, which depends on the sampling. The final statistic is then defined as L k = log 10 (α k σ k ), where the index k indicates the dependence on the chosen number of nearest neighbours. A detailed description can be found in appendix F. The authors showed, that the L-statistic converges for any k 3. Our analysis (figure 10 in appendix C) supports this assumption and, thus, we can fix k = 3. However, the second free parameter T M will alter the resulting L-statistic at any value. Particularly in the way we want to utilize the concept of this cost function in our automated embedding scheme, we need to tackle this parameter dependency (see section 3). It is worth mentioning that the L-statistic inherits the minimization of a mean squared prediction error (MSE) (here computed using a local constant model based on the first k neighbours) and the FNN-statistic proposed by Kennel et al [20] (when k = 1 and T M = τ ).

New reconstruction method
The L-statistic (section 2.2, appendix F) on its own could guide the reconstruction problem on finding the optimal delay values and a sufficiently high embedding dimension, when used in a brute-force-approach, i.e., scanning all possible delay values of all available time series in every single possible combination. It is not clear a priori how to set the parameter T M for obtaining the L-statistic, so the described procedure has to be repeated for a range of values for T M . In most cases, this is not computationally feasible and, therefore, not suitable for a fully automated embedding approach. We propose to combine the continuity statistic (section 2.1) and the L-statistic (section 2.2). The continuity statistic ε guides the preselection of potential delays τ and time series {s i (t)|i = 1, . . . , M} in each embedding cycle D d , whereas the L-statistic decides which one to pick. This algorithm works recursively. An embedding cycle D d determines the optimal time delay and its corresponding time series and enables us to build the actual reconstruction vectors v act from this, having dimension d + 1. Algorithm 1 and figure 3 explains the method in detail, which we refer to as 'PECUZAL' algorithm in honour of its roots.
(a) For the actual reconstruction vectors v act , in each embedding cycle D d , ε i (τ ) is computed for all available M time series s i (t). We comment on the procedure in the first embedding cycle D 1 further below. (b) We consider all those delays τ j for each ε i (τ ), which correspond to local maximaτ j in ε i (τ ).
These delays τ j (and their corresponding time series s i j (t)) are used to build temporary reconstruction vectors v temp (τ j , s i j ), by concatenating v act with the τ j -lagged time series s i j (t). (c) For each v temp (τ j , s i j ) and the actual reconstruction vectors v act the L-statistic is simultaneous computed for many parameters T M (cf figure 4), and we denote them as L vtemp(τ j ,s i j ) (T M ) and L vact (T M ). We . This way T M is not a free parameter anymore. (d) The delay-time series combination (τ j , s i j (t)), which yields the minimum ΔL value will be picked for this embedding cycle D d , if ΔL < 0, and is used to construct the actual reconstruction vectors v act = v temp (τ j , s i j ) of dimension d + 1 (e) v act is passed into the next embedding cycle D d+1 . (f) We repeat steps (a) to (e) until we cannot minimize the L-statistic any further, i.e. when each considered potential embedding v temp will yield a positive ΔL. In this case the reconstruction cannot get any better and we stop. v act constitutes the final reconstruction. Thus, the L-statistic provides a break criterion, without the introduction of any threshold parameter. Each embedding cycle ensures the maximum possible decrease of the cost-function.
We give some remarks on the proposed idea: (i) In case of the very first embedding cycle the actual reconstruction vectors v act are not yet defined. Roughly speaking, the algorithm needs to find a time series to start with and which can act as the first component of the final reconstruction vectors. As explained in algorithm 1, each of the available time series serves as v act in the first run and consequently M 2 continuity statistics get computed in the first step, i.e., for each combination (i, k) of the given time series s i (t). Note that we always use the unlagged available time

5:
Compute ε ik (τ ) for all M 2 pairwise combinations of s i , s k for the given τ 's 6: for each peakτ j in each ε ik (τ ) do 7: Create v temp by appending the time series s i with the τ k -lagged time series s k 8: Compute the L-statistics for v temp and s i for a range of parameter values T M = 2 : τ max , denote them as L temp (T M ) and L s i (T M ) 9: Compute 10: end for 11: From all ΔL temp j take the τ j , which corresponds to the peak with minimum ΔL, ΔL min 12: Save ΔL min and v temp 13: Compute ε i (τ ) for v act and all s i for the given τ 's 15: for each peakτ j in each ε i (τ ) do 16: Create v temp by appending v act with the τ i -lagged time series s i 17: Compute the L-statistics for v temp and v act for a range of parameter values T M = 2 : τ max , denote them as L temp (T M ) and L act (T M ) 18: Compute 19: end for 20: From all ΔL temp j take the τ j , which corresponds to the peak with minimum ΔL, ΔL min 21: Save ΔL min and v temp 22: end if 23: if ΔL min < 0 then 24: Set The final reconstruction vectors v final series s i (t), which will constitute the first component of the reconstruction vectors, i.e. we set τ 1 = 0. The continuity statistic reveals the correlation structure of its input, meaning that a lagged initial time series would lead to different consecutive delay values. However, the difference of the finally chosen delay values as well as the total time window of the reconstruction Tw = max(τ 1 , τ 2 , . . .) would be identical, because the correlation structure does not change, at least in case of infinite data. Practically, any τ 1 = 0 only reduces the amount of data available and searching for the optimal τ 1 in the sense of a minimal ΔL for the first embedding cycle D 1 would increase the computation time tremendously.
(ii) The L-statistic can not serve as an absolute measure, due to its dependence on the parameter T M . Consider a time series and a potential two-dimensional embedding consisting of this time series and a τ -lagged version of it. This corresponds to the first embedding cycle D 1 in algorithm 1. Figure 4 shows the L-statistic for a range of parameters T M for the single time series and for the potential two-dimensional reconstruction for two deterministic systems (panels A, B) and in case of the time series being uniformly distributed random numbers (panel C).
Whenever the L-statistic of the two-dimensional reconstruction (orange graph) is lower than the one from the single time series (blue graph) an embedding would be beneficial (grey shaded areas in figure 4). But this is not always the case throughout the course of T M (see panel B). The conclusion is, that it is not meaningful to judge a reconstruction on a single L-value, gained from a fixed T M . A reconstruction is always related to a function L(T M ) of the parameter T M and it is sensible to look at relative L-discrepancies between two consecutive embedding cycles, namely ΔL. It turns out that some stochastic signals will yield a negative ΔL for T M = 1. In panel C of figure 4 this is not the case, but the proximity of the two curves only for T M = 1 is decernible. This is comprehensible by recalling that the L-statistic inherits the MSE [see equation (F5)], and a one-sample-prediction horizon is simply too short, to properly distinguish deterministic from stochastic sources. As a consequence we compute the L-statistics in each embedding cycle for T M -values starting with T M = 2. Thus, for any embedding cycle each peak of the continuity statistic does not receive a certain absolute L-value, but rather a maximum possible decrease of L, ΔL, with respect to the actual embedding (figure 3). Then one simply picks the peak, which yields the largest decrease. We can not rule out the possibility that we could obtain a lower overall decrease in L for all embedding cycles by taking a different 'path', i.e. not go for the maximum decrease of L in each embedding cycle. This would correspond to achieving a local minimum of the cost function in the parameter and embedding cycle space. (iii) We propose to stop the embedding process, when ΔL > 0 for all considered temporary reconstruction vectors v temp . One could think about incorporating a threshold, a small negative number, e.g. ΔL > −0.05, to avoid only tiny decreases of the cost function encountered in an embedding cycle. Throughout all our computations this has not been necessary and, therefore, we dispense on such an additional parameter.
(iv) The continuity statistic ε itself contains information about the correlation structure of the data (cf section 2.1), which makes it valuable in the context of an automated embedding procedure as proposed here, especially for multivariate input. Not only that the first maximum most often coincides with the value we would obtain from the first minimum of the mutual information, but the continuity statistic of two time series 'levels off' at a certain value range. The absolute value of ε represents the correlation structure of the data we are looking at and quantifies the independence from each other for a specific time lag. This fact allows our method to pick only time series from a multivariate set, which belong together, and, consequently, in combination with the corresponding decrease of the L-statistic, ΔL, avoid stochastic signals (cf figure 4(C), table 1).
(v) The time complexity of the proposed method is O(N log N) as illustrated in figure 9 (appendix A).

Application, comparison & results
We apply the proposed method to a range of different systems, artificial and experimental data, exemplifying its advantage over established methods. Specifically, we compare our method to the standard TDE. We estimate the uniform delay τ by means of the first minimum of the auto-mutual information [18] and estimate the appropriate embedding dimension by using Cao's method [22]. Specifically, we automatically select the appropriate embedding dimension, when the E 1 -statistic reaches the saturation value within a given threshold of the slope (we picked a slope of < 0.05). We also look at two more sophisticated methods, which can also handle multivariate input, namely the method proposed by Garcia & Almeida (G & A) [44,45] 8 and Nichkawde's method (MDOP) [12]. The latter mentioned methods do not 8 Note that we were not able to reproduce the results shown in the papers from Garcia & Almeida [44,45]. Two of the authors, KHK and NM as well as another experienced researcher independently implemented this method and got the exact same results. An email to Prof. Garcia explaining this issues and seeking for help remains unanswered. We improved the method to be at least capable of producing acceptable results. First we implemented a Theiler window, which has not been discussed by the authors. Second we introduced the forward time step in order to produce the d E 2 statistic as a free parameter. The authors only discuss the case of a forward time step of 1. Throughout all our computations we set this parameter to the same value as the Theiler window, i.e. the value of the first minimum of the auto-mutualinformation. In the multivariate input cases, we picked the maximum from all first minima of all auto-mutualinformation. The reader is welcome to follow the implementation process on https://github.com/JuliaDynamics/ DelayEmbeddings.jl/pull/38. come with a predefined way to terminate the embedding process. In order to terminate them we use Hegger's [23] method of obtaining the optimal embedding dimension. Specifically, we set an FNN-threshold of 0.05, i.e., the algorithm breaks when the normalized fraction of FNN's fall below this threshold in order to allow the algorithm to give meaningful results in the presence of noise. The threshold for the tolerable relative increase of the distance between the nearest neighbours, when increasing the embedding dimension is set to 2, as suggested in [20,23]. The threshold, which defines the factor of tolerable stretching for the d E 1 -statistic in case of G & A's method is set to 10, as suggested by the authors. We estimate the decorrelation time by using the first minimum of the auto-mutual information and use it as the Theiler window [51] in all approaches. In the multivariate input cases, we pick the maximum from all first minima of all auto-mutual information statistics. For distance computation, the Euclidean norm is used.

Reconstruction evaluation statistics
In order to compare our approach with the established methods we need to quantify the goodness of the embedding. For this, we will consider six statistics. In addition to the overall decrease of the L-statistic, that where m is the embedding dimension and ΔL i the corresponding L-decreases in the encountered embedding cycles, we use two other statistics, which also reflect the neighbourhood relations of the reconstruction compared to the reference. One is the mutual false nearest neighbour statistic [52]. Instead of estimating the coupling strength/ degree of synchrony of two coupled oscillators, we use the statistic for assessing the similarity between the reference (time series gained from numerical integrations) and the reconstruction: where u i are the vectors of the reference/original system, v i , i = 1, . . . , N the reconstruction vectors, i ref k the indices of the K-nearest neighbours of index i in the original system and i rec k , k = 1, . . . , K the corresponding indices measured in the reconstruction. We propose the comparison of K nearest neighbours instead of just focussing on the first nearest neighbour, in order to receive more robust results in the presence of noise. By sampling the data sufficiently high we allow the precise determination of quite large neighbourhoods of a fiducial point, so we set K = 10 in our computations. The results vary in their absolute values with different choices of K, but the order of rank for the different test methods and their relative difference remain approximately constant. MFNN = 1 corresponds to an ideal Afraimovich diffeomorphism [50,53], higher values mark worse reconstructions. The other strict criterion we propose quantifies the degree of neighbourhood-relation conservation: the joint recurrence rate fraction (JRRF). It is based on the concept of a recurrence plot (RP), which is a two-dimensional representation of a dynamical system as a binary matrix [54,55] (appendix D). JRRF measures the accordance of the RP of the reference system, R ref , and the RP of the reconstruction, R rec .
We compute both, R ref and R rec , by fixing the recurrence threshold corresponding to a global recurrence rate of 8%. This is also to ensure comparability of the recurrence quantifiers described below [56]. Results are fairly similar for a wide range of choices of the recurrence rate we tried and the particular choice (in our case 8%) is not so important, since we apply them to all RP's we compare. It is, of course, also possible to compare different RP quantifiers gained from R rec to the ones derived from R ref [55]. We here choose the For these measures we plot the 1-relative deviations to the best score, which increases in the multivariate case. For the other statistics we plot 1-relative deviations to the reference score, i.e. the closer to unity the value gets, the better the accordance to the reference or the best achieved value is.
determinism (DET), the diagonal line length entropy (ENTR) and the recurrence time entropy (RTE) (appendix D). The latter two are related to the Kolmogorov-Sinai-Entropy [57,58], but do not serve as straight forward estimators, when necessary corrections on the RP and its quantifiers are ignored [59].

Paradigmatic examples
We investigate three paradigmatic chaotic systems, the Rössler system in the funnel regime (appendix E.3), a driven Duffing oscillator in regular motion (appendix E.4) and the Mackey-Glass delay equation (appendix E.5). For all systems we compare the mean values of the evaluation statistics from ensembles of 1000 trajectories with different initial conditions. ). For a concise visual presentation we use spider plots in the following and plot 1 -rel. deviation, i.e. the closer to unity the value gets, the better the accordance to the reference or the best achieved value is.
(i) For the chaotic Rössler system, we reconstruct the state space for the univariate case (using the y-component, in order to allow TDE having the best chances [60]) and for the multivariate case (using the x-and y-component). An overview over the results are shown in figure 5. For TDE in the multivariate case we take the results from the univariate example, because TDE cannot handle multivariate input. Note that this leads to different relative values for MFNN and ΔL, since we plot the deviation to the best score in these cases. The PECUZAL method performs best in the univariate as well as in the multivariate scenario, with improved outcomes for the multivariate one, as expected. This also holds in case of applied measurement noise, where our method even expands its lead for most measures (table 2). Surprisingly, TDE also provides very good results in the univariate case and in the multivariate setup Garcia & Almeida's method yields an overall larger deacrease of the L-statistic than PECUZAL, but only in the noise free case (see table 2 in appendix B). This lower ΔL is not reflected in better performance of the other evaluation statistics. Specifically, the diagonal line length entropy values differ from the true reference value in the double-digit percentage range for G & A's method.
(ii) The overall rating also holds for the driven Duffing oscillator ( figure 6), but there are severe differences. In contrast to the chaotic Rössler system here it seems that the additional given time series for the multivariate scenario do not improve the reconstructions significantly. The JRRF do not get better and remain on a quite high level of 80% (G & A) up to 84% (PECUZAL) accordance with the reference RP R ref . This is also reflected in the ΔL statistic, which also does not improve in the multivariate case for G & A and MDOP, and only slightly decreases for PECUZAL. The same story is told by looking at the other evaluation statistics summarized in table 2. It is important to note that under noise our proposed method outperforms the others in almost all cases, but in principle we notice that the signal to noise ratio seems to be very low and biases the results of all methods more than in case of the Rössler example. Specifically in case of RTE deviations to the true reference value increase from the low single digit percentage range up to 62% in the noisy case. The reason is that in regular motion we expect a near zero value of RTE and noise easily blurrs the diagonal lines in the RP, leading to a broader distribution of recurrence times and, thus, to randomly elevated RTE values. The very same problem make the ENTR values deteriorate for all methods, here already apparent in the noise free case. In a regular motion system ENTR is biased the most when no correction of the diagonal lines is performed, which has been shown by Kraemer and Marwan [59]. The proposed skeletonizaton of the recurrence plots for the purpose of reducing the bias in the diagonal line based RP quantifiers, such as ENTR and RTE, lead to way better results in the one digit percentage range even for noise, as expected. Due to the computational complexity of the skeletonization algorithm [59] we did not apply this correction scheme to all of the 1000 runs in this experiment, but rather tried it on small samples not shown here, in order to understand the bad performances for all methods in case of ENTR and RTE.
(iii) In contrast to the above systems, the Mackey-Glass system (appendix E.5) is infinite dimensional and we have do deal with a univariate time series from the numerical integration, which is why we do not have a 'reference' value we could base our computations on. Therefore, we can only use the ΔL-statistic (table 2). The proposed PECUZAL method performs significantly better than the other methods with Garcia & Almeida's method coming second, especially in the presence of noise. Here all methods but the proposed one yield reconstructions with corresponding positive ΔL-values, i.e. the suggested embeddings do not decrease the overall L-statistic. Admittedly, PECUZAL also comes with a very small negative ΔL value, with 0 falling within the 1σ-interval. This finding indicates a too low signal to noise ratio, which we comment on below. Farmer [61] conjectured a linear relation between the delay chosen in the Mackey-Glass equation and the corresponding dimensionality of the attractor (cf table 1 in [61]). A linear fit to the data of that table suggests an attractor-dimension of d A ≈ 5.3 with the 95% confidence interval being between ≈ 2.1773 and ≈ 8.2888. The studied methods give 5 ± 0(6.7 ± .5) (TDE), 3 ± 0(4.6 ± 1) (G & A), 4 ± 0(2 ± 0) (MDOP) and 7 ± 0(2.2 ± 1.7) (PECUZAL). The bracketed values correspond to the case of 10% additive noise. While all methods meet Farmer's conjecture in the noise free scenario, this does not hold for the MDOP method an the proposed method in the noisy case. Both methods suggest too low embedding dimensions. This is due to the fact that the signal to noise ratio is apparently too low and PECUZAL treats the signal as a stochastic source for some realizations where it does not embed the data at all, while it did not do it in case of the Rössler and Duffing system, despite the same variance of the white Gaussian noise. Results from G & A and TDE do fall in the 95% confidence interval, which is large, because of the weak data basis given in reference [61] and the resulting uncertain fit. We find the time window of the embedding, i.e. the total time span covered by a reconstruction vector, decreasing with increasing noise level throughout our experiments. This is very much in line with the findings of Ragwitz and Kantz [62].
We finally look at two made up, ill-defined multivariate datasets, in order to see how the G & A, MDOP, and the PECUZAL method cope with redundant data and with stochastic signals. Rössler + y Rössler , whereas the sixth time series is set to x Lorenz · y Lorenz + y Lorenz . Our proposed method does not mix time series from both systems and sticks to one system (Lorenz in this case), as shown in table 1. It suggests a three-dimensional embedding and also does not need the redundant information stored in the fifth and the sixth time series of the input dataset. In contrast, G & A and MDOP fail here, suggesting a three-dimensional and a six-dimensional embedding, respectively, mixing up the different systems yielding a useless reconstruction ( figure 7).
(ii) The second made-up dataset (Fooling dataset II) consists of three time series of length 5,000, with the first one being an auto-regressive process of order 1 with parameters (0, 0.9) and initial condition u 0 = 0.2, to mimic coloured noise. The second and third time series are Gaussian and uniform random numbers. While G & A and MDOP embed the non-deterministic time series, our proposed algorithm suggests no embedding and throws an error (table 1). The reason is, that the L-statistic is a monotonically increasing function of the embedding dimension for stochastic data for any prediction horizon parameter T M , i.e. the algorithm cannot minimize L already in the first embedding cycle. For the sake of completeness we have to stress that this particular example should not be read as a claim of a generalizable behaviour of our proposed method to deal with auto-regressive processes of arbitrary order p. In the case of higher-order AR processes, PECUZAL often suggests an embedding with a dimension that corresponds to the order of the AR process, as we would expect it theoretically. We have noticed, however, that the embedding depends heavily on the length of the time series used, which is in line with the findings of Holstein and Kantz [43], but also of the choice of the particular AR-parameters and the order of the AR process under study. A systematic consideration of PECUZAL's embedding suggestions for this class of processes is beyond the scope of this paper.

Experimental data
We will now utilize the PECUZAL embedding method on experimental data. Specifically, we look at a chaotic time series from electrochemical oscillations. The experiment was performed with the chaotic electrodissolution of nickel in sulphuric acid [63]. A standard three-electrode electrochemical cell was used with a 1 mm diameter nickel wire as working, a Pt counter, and a Hg/Hg 2 SO 4 /sat. K 2 SO 4 reference electrode. The electrolyte was 4.5 M H 2 SO 4 at 10 • C. The nickel wire was connected to the working point of the potentiostat through an individual resistance (R ind ), and a potentiostat (Gill AC, ACM Instruments) applied a constant circuit potential (V 0 , with respect to the reference electrode). At a given circuit potential, the rate of the metal dissolution, measured as the current, can exhibit chaotic oscillations due the hidden negative differential resistance and additional nonlinear processes related to the passivation kinetics [64]. About 500 current oscillations were recorded at a data acquisition rate of 200 Hz, which corresponds to about 200 points per cycle and a time series length of N = 100 000. There are two primary bifurcation parameters in the experiment: the individual resistance, which affects the charging time constant of the electrical double layer, and the circuit potential, which drives the dissolution. We consider a setting with R ind = 1.5 kΩ and V 0 = 1360 mV.
We demonstrate the ability of the proposed embedding method to cope with rather small to intermediate sized experimental datasets. We first downsample the time series to N = 50 000 [ figure 8(A)]. Reconstructions for the four methods (TDE, G & A, MDOP, PECUZAL) are then computed from which we obtain RPs (with a fixed recurrence rate of 8%, in order to guarantee comparability) and the corresponding RQA-quantifiers diagonal line length entropy (ENTR), the laminarity (LAM), the RTE and the recurrence network measure transitivity (T ), see appendix D. We denote each of these values for each of the reconstruction method as its reference values. We then repeat the described procedure for 1, 000 sub-samples of lengthN = 2, 500 drawn from the time series at random [shown exemplary in figure 8(B)], i.e. for each of the 1, 000 sub-samples we compute the reconstruction for each of the four methods, its corresponding RP and the RQA-quantifiers. This will result in distributions for ENTR, LAM, RTE and T for each reconstruction method. Finally we compare the medians of these distributions to their reference values and plot the relative deviation figure 8(C). The capability of the four methods to allow for satisfying estimates from short time series samples differs strongly for the different RQA-quantifiers. The largest discrepancies to the reference can be noted for TDE in case of T and RTE. For LAM all methods estimate the reference very well from the subsamples. While our proposed method slightly comes last in case of ENTR, it yields the best results for T and LAM and is performing well in case of RTE. The example shown here can not be generalized, but it underpins our claim that PECUZAL provides robust state space reconstructions for a very broad range of processes under different conditions, which are often better, but always at least equally well than the ones obtained from the established methods.

Conclusions
A fully automated approach for state space reconstruction from uni-or even multivariate time series is proposed and compared to established methods. The algorithm works iteratively and appends the reconstruction vectors in each embedding cycle with an appropriate time delay and an according time series until a cost function cannot be minimized further. Its core functionality is based on identifying potential time delays and its corresponding time series in each embedding cycle by using the continuity statistic. For each of those delays, temporary reconstruction vectors are build and the cost function is computed. The delay value, which yields the maximum decrease of the cost function is selected. If none of the considered delay values yields a decrease of the cost function the reconstruction can not get any better and the final embedding is obtained without the need of setting any threshold parameter. Usually the time delays chosen that way are not simply multiples of each other, but rather reflect even complex correlation structures within the data. This is why the algorithm is also able to detect time series stemming from a stochastic source, which it will not embed. Except from computing the decorrelation time of the data for providing a valid Theiler window for the nearest neighbour search, and providing a range of possible delay values the algorithm shall encounter, there are neither any data preprocessing steps necessary, nor any free parameters need to be adjusted before using the proposed routine. The approach is demonstrated on a variety of exemplary systems as well as on experimental data stemming from chaotic chemical oscillators. We find that it provides often better, but always at least equally well reconstructions than the established methods. It is furthermore capable of providing meaningful reconstructions for rather short time series, which particularly holds for the case of multivariate input. The additional computational effort in comparison to standard TDE is manageable and justified. Since the proposed method works automatically, is basically parameter free, and can handle multivariate input without mixing data originating from different systems, we can think of a wide range of potential applications. This is especially true for scenarios, where multiple sensors or channels of a detector monitor real world processes, which are not isolated observables, i.e. in engineering, earth-and life science contexts. The provided software (appendix A) in three common coding languages will facilitate the use of the presented method.    Ensembles of 1000 trajectories using different initial conditions are used. Additionally, we consider additive noise with an amplitude of 10% of the standard deviation of the corresponding signal.
All RPs are computed using a fixed recurrence rate of 8% and a minimum line length of 2.

Appendix C. Dependency on parameters
We investigate the impact of different parameter settings on the resulting reconstruction of our proposed method for the x-component of the Lorenz system (appendix E.1). In particular, figure 10 shows the sensitivity of the L-statistic value and the chosen delays with respect to the number of nearest neighbours kNN for a fixed paramter T M = 20 (panels A, B) and also the dependence on the continuity statistic parameters δ-Neighborhoodsize (panels C, D), the (binomial) probability p (panels E, F) and significance-level α (panels G, H). The critical dependence of the L-statistic on the parameter T M is discussed in section 3 and figure 4. The results show very little impact on the reconstruction quality and, thus, confirm our choice of fixed parameter values for the algorithm. Specifically, the crucial qualitative course of the continuity statistic, i.e. the position of the local maxima, remains unchanged for relevant choices of α and in the vicinity of p = 0.5, which is the proposed fixed value (figures 11 and 12).

Appendix D. The recurrence plot and its quantification measures
A RP is a binary, square matrix R representing the recurrences of states x i (i = 1, . . . , N, with N the number of measurement points) in the d-dimensional state space [54,55] with · a norm, ε a recurrence threshold, and Θ the Heaviside function. There are several ways to quantify the structures and dynamics encoded therein, but here we look at the determinism [70], which is the fraction of recurrence points that form diagonal lines of length with P( ) being the histogram of all diagonal line lengths in the RP and min the considered minimal line length (here set to 2). Another quantifier is the Shannon entropy of the probability distribution  It is also possible to look at the Shannon entropy of the length distribution of the white vertical lines w , which correspond to recurrence times (D4) Analogue to the definition of the determinism, the laminarity is the fraction of recurrence points that form vertical lines with v the length of a vertical line, P( v v) the histogram of all vertical lines in the RP and v,min the considered minimal line length (here set to 2). Finally, the RP can be considered as the adjacency matrix of an ε-recurrence network [71], A = R − and define the transitivity [72] as which is a measure characterising the geometric structure of the state space attractor [73].

Appendix E. Exemplary models
The models described in the following are all integrated using DynamicalSystems.jl and DifferentialEquations.jl [68,74] in the Julia Language.

E.1. Lorenz system
The Lorenz system [75] is defined asẋ We set the initial condition to u 0 =

E.5. Mackey-Glass equation
The Mackey-Glass equation [81] is the nonlinear time delay differential equatioṅ with the lag τ = 44, and the parameters n = 10, β = 0.2, and γ = 0.1. x τ represents the value of x at time t-τ . We randomly choose the initial conditions uniformly from an interval [0, 1.5], use a sampling time of Δt = 0.5 and discard the first 2000 points of the integration as transients, resulting in time series of length N = 10 000.

Appendix F. Details of the L-statistic
The concept of quantifying noise amplification in the context of the validation of an attractor reconstruction has been proposed in [10]. The reconstruction process is considered in the presence of noise and the finite data availability as a modelling problem, introducing a noise amplification and estimation/prediction error for any measures on the reconstructed attractor. The variance of the conditional probability density function is a 'natural criterion for assessing predictability' [10] and is used as the noise amplification for a fiducial point v fid (t) at a given noise level ε on the reconstructed attractor σ ε (T, v fid (t)) = 1 ε Var v (t + T) |B ε ( v fid (t)) , ( F 1 ) with Var v (t + T) |B ε ( v fid (t)) being the conditional variance of v(t + T) for v fid (t) in a radius ε ball B ε ( v fid (t)) for a prediction horizon T. Finally, the noise amplification σ σ (T, v fid (t)) = lim ε→0 σ ε (T, v fid (t)) (F2) averaged over all fiducial points on the attractor (and squared), σ(T) 2 , serves as a measure of the predictive power, with respect to the time horizon T, the reconstruction vectors v allow for (for details see [10]). Broadly speaking a low conditional variance, and thus, a low value of σ(T) 2 is achieved for sufficiently unfolded attractors, because in this case noise distortions of the true trajectory are not likely to result in mixing states, which are far away from each other in true state space and consequently preserve the neighbourhood relations on the reconstructed attractor. Uzal et al [11] reinterpret equation (F1), (F2) and give an approximation-recipe for the conditional variance. The authors redefine the mentioned equations as and consider the limit σ ( v fid (t)) = lim ε→0 σ ε ( v fid (t)) (F4) where ε is not related to any observational noise level anymore. The ε-ball B ε ( v fid (t)) is simply a tool for determining certain neighbourhood relations of a fiducial point v fid (t) and their changes when mapped to future states by the reconstruction function F/F (figure 1). It is possible to approximate the conditional variance in equation (F1) by where B k ( v fid ) estimates B ε ( v fid (t)) by the fiducial point itself and its k nearest neighbours, respecting a Theiler window (i.e. avoid temporal correlations in the neighbour-searching) [51]. The centre of mass with respect to the chosen time horizon T and the fiducial point v fid is defined as The size of the k-neighbourhood of v fid , B k ( v fid ) is estimated as where · is a norm used for the distance computation. Finally, E 2 k (T, v fid ) [equation (F5)] is averaged over a range of T's in [0, T M ] and the noise amplification estimated from k nearest neighbours is , ( F 8 ) which needs to averaged over all considered fiducial points N on the reconstructed attractor to obtain Since the reinterpretation of ε with the related k now acts as a neighbourhood size parameter [equation (F3)-(F8)], σ 2 k can be normalized with respect to the averaged inter-point distance, which depends on the sampling rate and the scale of the input data [11]. The normalization factor is with 2 k from equation (F7). In this way, the final statistic will be able to compare attractor reconstructions stemming from different input data and also serves as an irrelevance measure, because large delays will result in large 2 k 's. Finally, equation (F9), (F10) define the L-statistic L k = log 10 (α k σ k ), which has a free parameter k and another implicit parameter T M [equation (F3)].