Learning from the past: reservoir computing using delayed variables

Reservoir computing is a machine learning method that is closely linked to dynamical systems theory. This connection is highlighted in a brief introduction to the general concept of reservoir computing. We then address a recently suggested approach to improve the performance of reservoir systems by incorporating past values of the input signal or of the reservoir state variables into the readout used to forecast the input or cross-predict other variables of interest. The efficiency of this extension is illustrated by a minimal example in which a three-dimensional reservoir system based on the Lorenz-63 model is used to predict the variables of a chaotic Rössler system.


Introduction
Machine learning methods are becoming increasingly important in science and in everyday life, and new approaches and concepts are being developed all the time.One method for predicting temporal developments that was already proposed at the beginning of the millennium is reservoir computing.In this approach, a reservoir of dynamic elements, often consisting of a recurrent neural network, is driven by a (known) input signal, and the dynamic response of the reservoir system is used to predict the future evolution of the input signal or other relevant quantities, e.g., by the linear superposition of its state variables.This basic concept was developed and publicized independently by Jaeger [4] and Maass et al. [5] in 2001 and 2002, respectively, and represents a special form of random projection networks [6].Jaeger introduced the so-called echo state networks, i.e., recurrent neural networks whose nodes consist of one-dimensional dynamic elements with a sigmoid activation function.The reservoir system presented by Maass et al. consisted of a neural system with spike dynamics and was called liquid state machine [5,7].Only later did the generic term reservoir computing become established for the entire class of machine learning methods that exploit the dynamic response of a driven dynamical system for prediction and classification of input signals [8].
Reservoir computing differs from most other machine learning methods by the nature of the training, which here essentially consists of solving a linear system of equations to determine the weights of the linear superposition of (functions of) the reservoir variables [9].However, this fast learning step usually has to be repeated several times in order to optimize the hyperparameters of the reservoir system, i.e., the parameters that control its dynamical features.Nevertheless, the overall training of a reservoir system is generally faster than, for example, that of a deep neural network.However, as Jaeger pointed out in his foreword to the book Reservoir Computing: Theory, Physical Implementations, and Applications by Nakajima and Fischer [10], a deeper understanding of reservoir computing is still lacking despite extensive research since its invention.
In order to achieve fast and energy-efficient performance, hardware implementations of reservoir systems have been devised, opening a new field called physical reservoir computing [29], often based on optical systems [e.g., [30][31][32][33]] and other hardware platforms [26, [34][35][36][37].A readable introduction to reservoir computation in general and physical reservoir computing in particular can be found in the review article by Cucchi et al. [38].For more details and examples, the book by Nakajima and Fischer [10] is another excellent source.
While reservoir computing has been extensively studied in the machine learning community since the seminal works of Jaeger and Maass et al., it initially received little attention in the field of nonlinear dynamics, with a few exceptions [39][40][41].This changed in the last 7 years, mainly due to the work of Ott and collaborators [42][43][44][45].The main reasons for this are the generally increasing interest in machine learning methods and applications, as well as the fact that reservoir computing is a very interesting and not yet fully explored dynamic process.
A key feature of suitable reservoir systems is their ability to generate a unique response for a given input time series, independent from the reservoir system's initial conditions.In the context of reservoir computing, this feature is called echo state property or fading memory.If the known input signal itself originates from a dynamical system, this system and the reservoir system form a pair of uni-directionally coupled (sub-) systems where generalized synchronization can occur.Generalized synchronization is a necessary condition for reproducible response and thus for the functionality of the reservoir system and closely related to the echo state property.The aim of this article is therefore to look at reservoir computing from a non-linear dynamics perspective in order to deepen the understanding of how it works and to provide a review of recently proposed extensions using delayed variables.To this end, in Sections 2.1-2.3,we first formulate the concept of reservoir computing in general terms, i.e., not limited to the use of recurrent neural networks as reservoir systems.In Section 2.4, the echo state property and its relation to generalized synchronization will be discussed.The classical examples of general-purpose reservoir systems are echo state networks, which are introduced in Section 2.5.In Section 3.1, we present an extension of the output function by adding input and state variables from the past.This promising approach is illustrated in Sections 3.2-3.5 using a minimal reservoir system consisting of a single three-dimensional system based on the Lorenz-63 differential equations, which is used to predict the variables of a chaotic Rössler system.Some other concepts to improve the performance of reservoir computing will be briefly discussed in Section 4.

Fundamentals of reservoir computing . Continuous and discrete reservoir systems
In reservoir computing, an input signal drives a dynamical system, the reservoir, and the state variables of this system (evolving in time) are used for generating some desired output, like a temporal forecast of the input signal or a cross-prediction of some other signal that is assumed to be related to the input signal.
The concept of reservoir computing can be implemented and studied in continuous or discrete time.The discrete time framework is used, for example, if a time series sampled in time is input of a dynamical system simulated on a digital computer (e.g., a recurrent network).With discrete time, the reservoir system reads where r[n] ∈ R N res is the state of the reservoir system and u[n] ∈ R N in the input time series, that is often given by sampling a time continuous signal u[n] = u(nT smp ) where T smp denotes the sampling time.
If reservoir computing is implemented on an analog computer or using any driven real-world (physical, biological, etc.) system, a formal description using continuous time systems [i.e., ordinary differential equations (ODEs)] is in general more appropriate.If the input u(t) ∈ R N in is given by N in signals in time, the reservoir system reads where r ∈ R N res is the state of the reservoir system and N res its dimension.The output v ∈ R N out of the reservoir system is given by a parameterized readout function g(u, r) of the reservoir state r and the input u.
During training (or learning), the parameters of g are the only quantities to be estimated and adapted.Since the functional form of the function g should enable efficient training, functions that are linear in their parameters are preferred for reservoir computing design, such as linear superpositions of K j basis functions b k j to define the component v j of the output v as where the coefficients w kj constitute the parameters to be estimated using training data (see Section 2.3).The basis functions b k j can be any linear or non-linear function [e.g., polynomials [43] and radial basis functions [46]], but in most applications affine linear or quadratic polynomials have been used as readout functions of reservoir systems.Affine linear output functions are given by an N esv = 1 + N in + N res dimensional extended state vector x = (1; u; r) ∈ R N esv and an N esv × N out output matrix W out whose elements are the parameters to be estimated during training [here (•; •; •) denotes concatenation and x is a row vector].Equation ( 4) provides the N out dimensional output signal (or time series) v where the j-th column of W out contains superposition coefficients w kj for computing the element v j of the row vector v and the number of basis functions K j equals N esv for all j = 1, ..., N out components of the output v.
In general, the number of basis functions K j can be different for each component v j and can be larger than the dimension N esv of the extended state vector x, for example, if (additional) non-linear basis functions like polynomials [43] or radial basis functions are used.Such non-linear output functions not only extend the possibilities of approximating complicated functional relations, but can also be used to break unwanted symmetries of the reservoir system [47].
In this and the following sections, the reservoir systems are assumed to be finite dimensional.However, also dynamical systems with an infinite dimensional state space, like delay differential equations or extended systems described by partial differential equations may act as dynamical reservoirs.For example, reservoir computing using feedback loops (e.g., ring lasers) has been demonstrated experimentally and in numerical simulations.The dynamics of these delay-based reservoir systems is governed by delay differential equations ṙ(t) = f (r(t), r(t − τ ), u(t)) where τ equals the time delay due to the internal feedback loop (roundtrip time of the light in case of a ring laser, for example).The Depending on the application of the reservoir system, the output may be defined as a function of the reservoir state r, only.This is often the case, for example, when implementing autonomous reservoir systems whose output signal is fed back into their input, see Section . . .The term delay-based reservoir system is used if the internal dynamics of the reservoir system contains time delays.This is to be distinguished from the use of delayed variables in the output function, which is the main topic of the current review, see Section .
Delay-based reservoir systems are often described by a single variable r.
output v(t) = N n=1 w n r(t − τ n−N N ) is obtained by sampling the solution r(t n ) at discrete times t n = t − τ N−n N along the delay line.For a more detailed description of this type of delay-based reservoir computing, see the pioneering work of Appeltant et al. [30,48], a recent review by Chembo [49], references [31,32,50,51] for implementations and extensions, and Hülser et al. [52] for a detailed discussion of the influence of (multiple) delay times as successfully used by Stelzer et al. [53].We will not go further into delay-based reservoir computing, but most of the aspects and features discussed below also apply to this approach.

. Forecasting, cross-prediction, and classification
Reservoir computing can be used for essentially all machine learning tasks including: (i) forecasting a future value u(t + T prd ) of the input signal u(t) (in the case of discrete systems: ), (ii) cross-predicting another variable y(t) or its future value y(t + T prd ), or (iii) classifying the input by some constant label.
In all these cases, the reservoir system is trained such that it provides a task-dependent target value (the training is described in more detail in Section 2.3).Tasks (i) and (ii) can also be combined to learn and provide a signal, which is then used to control another dynamical process.Using different readout functions, a single reservoir system can perform different tasks like (i), (ii), and (iii) in parallel.
There are two main options for predicting the evolution of a target value over time: (a) performing the desired prediction in a single step or (b) decomposing the prediction interval into many small steps, resulting in an iterated prediction for discrete systems and an autonomous ODE in the case of a continuous reservoir system.Both approaches are now described in more detail.

. . Single-step prediction
With single-step prediction, the reservoir system is trained such that its output v(t) provides (an estimate of) the desired future value u(t + T prd ) or y(t + T prd ) (analogously for discrete systems predicting N prd steps ahead).With this method, the reservoir system continuously generates forecasts of the variables of interest that may be used, for example, in control loops (to compensate latency) or for early warning of upcoming (extreme) events.This approach will be illustrated in Section 3.2.The larger the prediction time T prd , however, the more complicated is the function to be approximated (in particular for signals from chaotic systems).Therefore, an iterated prediction with a sequence of smaller time steps is often more advantageous [54,55] as will be discussed in the next section./fams. .future value of the input u[n + 1], which is then used as input for the next prediction step.With this feedback, the reservoir system becomes an autonomous system

. . Output feedback and iterated prediction
and after N prd iteration steps, the desired prediction of u[n + N prd ] is obtained.If the output of the reservoir system is y = u, then the input u[n + 1] for the next temporal iteration step has to be reconstructed from y[n + 1] to close the feedback loop, e.g., with the help of another reservoir system trained for cross-prediction y → u.In general, however, for predicting another target variable, y[n + N prd ]), in some applications, one would first use iterated prediction to estimate u[n + N prd ] and then use the same or another reservoir system to cross-predict y[n Feedback can also be used to predict the future evolution of the input u(t + T prd ) using a continuous reservoir system (Equation 2).In this case, the output v(t) = g(r(t)) = (1; r(t))W out of the reservoir system is trained to approximate the input u(t), and the result is substituted in Equation ( 2) for u(t).The resulting autonomous reservoir system is integrated up to time t + T prd to obtain u(t Of course, if the input signal originates from a chaotic system and if the reservoir system has learned this dynamics, sensitive dependence on initial conditions will lead to predictions that deviate exponentially from the true input signal as T prd increases, i.e., only short-or medium-term forecasts are possible.This prediction horizon is often compared with the Lyapunov time, which is the inverse of the largest Lyapunov exponent of the dynamical system that generated the input signal.Similar to synchronization with sporadic coupling [56,57], a divergence of the prediction from the true time evolution can be avoided by feeding the given input signal u into the autonomous reservoir system from time to time [58].

. . Short-term prediction and climate
Since training optimizes only the performance of short-term predictions, when applied for many steps, iterated prediction not only deviates from the true evolution of the target signal but may also result in asymptotic behavior with dynamical and statistical properties that are (very) different from the target signal (e.g., periodic or diverging solutions, instead of chaos; different values of characteristics like fractal dimensions or Lyapunov exponents).Such long-term features (corresponding to ergodic properties) are also referred to as climate [42,59], and it depends on the application whether it is more important to optimize the accuracy of the shortor medium-term predictions, or whether a correct representation of the climate is the main goal [44,[60][61][62].
Using input data from different sources (e.g., different chaotic attractors), a reservoir system with feedback can be trained to produce forecasts and the climates of these sources [63][64][65].In this case, the autonomous reservoir system is trained to have coexisting attractors and multifunctionality to predict input signals from different sources, and it is possible to switch between the coexisting attractors using suitable input signals [63].

. Training of the reservoir system
The only parameters to be determined during training are the elements of the output matrix W out .If the output function g is chosen to be linear in these parameters, training consists essentially in solving a linear set of equations as will be shown in the following.and y[m] = y(mT smp ) in the case of a continuous reservoir system sampled with a sampling time T smp .To make sure that the dynamical evolution of the reservoir system is (asymptotically) independent from its (in general unknown) initial conditions, this sampling has to be preceded by a sufficiently long transient (or washout) time T trs = M trs T smp of M trs time steps, a requirement that will be discussed in more detail in Section 2.4.
The output of the reservoir system matches the target if W out can be chosen such that the linear equation holds, where X is an M × N esv matrix whose rows are extended states x[m] and Y is an M × N out matrix whose rows are the corresponding target values y[m].If M > N esv , this linear set of equations has more equations (M • N out ) than unknowns (N esv • N out ), while for M < N esv , it is underdetermined.In the latter case, regularization is required to avoid solutions W out with very large matrix elements, but also for the over-determined case M > N esv , the solution may suffer from instabilities in the case that some columns of X are almost collinear, i.e., that elements of u and/or r are highly correlated.This problem can be addressed by means of Tikhonov regularization (also known as Ridge Regression) where a cost function with is minimized that contains a regularization term punishing too large matrix elements W out ij .The regularization parameter β > 0 controls the impact of the regularization and has to be chosen suitably.The minimum of this cost function (Equation 8) is given by where X T denotes the transposed of X.Alternatively, the solution W out (Equation 9) minimizing the cost function (Equation 8) can be computed using the singular value decomposition (SVD) of X = USV T , where S is a diagonal matrix with S ij ≥ 0 for i = j and S ij = 0 for i = j.In this case, the minimum of the cost function (Equation 8) is given by Instead of Tikhonov regularization, one can also use the Moore-Penrose-pseudo inverse X + of X, such that Using the SVD of X = USV T , the pseudo-inverse X + can be written as X + = VEU T with where ǫ > 0 represents the numerical resolution but can also be chosen larger and then serves as a regularization parameter.The difference between the definitions of the matrices D and E is illustrated in Figure 1.In both cases, the impact of very small or vanishing singular values S ii is limited by bounding their inverse values 1/S ii (see Equations 10 and 11).For sufficiently long training data, i.e., with M > N esv , the pseudo-inverse based on Equation (11) may provide better results than Tikhonov regularization (Equation 10), because the latter has a stronger bias for small but not too small singular values.An alternative to minimizing the quadratic cost function (Equation 8) is L1-optimization, where the Euclidean norm • 2 is replaced by the L1-norm • 1 , and the solution of the minimum of the resulting cost function can be computed using the Lasso algorithm [66][67][68].In this case, the computation of the optimal coefficients is more time consuming, but since many of them are zero after training, the calculation of the readout consists of fewer operations and is therefore more efficient.
An important feature of a trained reservoir system is its capability for generalization on similar but previously unseen data.This is notoriously difficult to determine in advance, but for reservoir systems consisting of (a certain class of) recurrent neural networks, Han et al. [69] derived a bound for the generalization error.
Many programming languages such as Matlab, Julia, or Python provide a function pinv(A, tol) for computing the pseudo-inverse of a matrix A with an optional parameter tol corresponding to ǫ.

. Echo state property and generalized synchronization
To use a trained reservoir system in any application, e.g., prediction, its output has to be reproducible and unique if the same input signal is presented again, independently from the (often unknown) initial state of the reservoir at the beginning.In reservoir computing, this feature is known as echo state property (ESP) or fading memory, and it is formally defined as follows [4,70,71].
For all input sequences u[0], u[1], u[2], ... ∈ U ⊂ R N in from a compact set U and any pair of initial conditions of the reservoir system, r[0] and r[0], the iterated application of Equation ( 1) results in lim n→∞ r[n] − r[n] = 0 (or correspondingly lim t→∞ r(t) − r(t) = 0 for continuous time systems (Equation 2) with input signal u(t)).
If the reservoir system fulfills this criterion, it asymptotically "forgets" its initial state after some transient or washout time T trs , and its (current) state is given by a so-called echo state function ξ of the semi-infinite input sequence r [4].
If the input signal was generated by a dynamical system ż = f src (z) or z[n + 1] = f src (z[n]), this system and the reservoir system can be seen as a pair of uni-directionally coupled dynamical systems (i.e., a drive-response configuration), where the echo state property occurs due to generalized synchronization [72 -80].With generalized synchronization, the state of the driven system, i.e., the reservoir, r(t) is (asymptotically) a function ψ(z(t)) of the state z(t) of the driving system generating the input signal, i.e., lim t→∞ r(t) − ψ(z(t)) = 0 (or lim n→∞ r[n] − ψ(z[n]) = 0 for discrete systems).In this way, each state variable r i (t) of the reservoir system represents after a sufficiently long (synchronization) transient time T trs , a dynamically emerging function ψ i (z) of the state of the driving system that is used in the output Equation ( 4) as a non-linear basis function for approximating the target v [39].This function ψ i (z) corresponds to the echo state function r[n] = ξ (..., u[n − 1], u[n]), if the input u is generated by the dynamical system and given by an observation function u = h(z).In this case, (previous) values of the input time series are given by functions of the current state z[n] of the driving system, because u , where φ −k denotes the inverse flow of the driving system.The same holds for continuous reservoir systems driven by an input signal from a continuous dynamical system.
So in principle, any driven dynamical system may be used as a reservoir system if it exhibits generalized synchronization with the dynamical system generating the input signal.This includes analog computers or even real physical processes.The occurrence of generalized synchronization can by checked, for example, using conditional Lyapunov exponents or the auxiliary system approach [74, 76,79].However, generalized synchronization is only a necessary condition for fulfilling the echo state property but is not sufficient to guarantee the same unique response for all initial conditions of the reservoir system.There may be different coexisting attractors that lead to different reproducible asymptotic responses depending on the initial conditions of the driven system.This form of multistability has to be taken in to account [63][64][65] or excluded to guarantee a unique response and thus the echo state property.
The emerging functions ψ i are continuous, and their smoothness properties depend on the (transversal) contraction properties of the response system.In general, the stronger this contraction, i.e., the faster two neighboring trajectories r(t) and r(t) converge with lim t→∞ r(t) − r(t) = 0, the smoother are the functions ψ i .A mathematically more rigorous treatment of this feature can be found, for example, in a series of publications by Stark et al. on forced dynamical systems [81][82][83].A detailed mathematical analysis of the differentiability properties of the echo state functions and their importance for successful reservoir computing was carried out by Grigoryeva and Ortega [84].Hart et al. [85] proved under some mild assumption that ψ almot surely provides an embedding of the dynamics of the driving system into the state space of the reservoir system, and that there exists a linear readout function ϕ(r) that predicts arbitrarily well the input values provided that the dimension of the reservoir system is high enough.If this prediction is used to substitute the input u of the reservoir system an autonomous system r[n

2).
A well-trained autonomous reservoir system reproduces the Lyapunov exponents of the driving system, i.e., the Lypunov exponents of the input signal are a subset of the Lyapunov exponents of the autonomous reservoir [42].In this context, Platt et al. [86] found that the prediction time for autonomous reservoir systems with a low Kaplan-Yorke dimension is the longest.Carroll [87] obtained a similar result for the Kaplan-Yorke dimension calculated from the full Lyapunov spectrum of the coupled drive-response system.
If the response system exhibits no contraction, generalized synchronization and the echo state property cannot occur.If, on the other hand, contraction is too strong, the emerging functions ψ i are very (too) smooth and, therefore, not very well-suited to be used for approximating functional relations of target signals y, whose dependence on the state z is given by a less smooth and more complicated function.It is therefore important that the contraction properties of the response system can be adapted to generate the appropriate level of smoothness of the output for a given modeling task.In case of echo state networks (see Section 2.5), contraction is controlled by the spectral radius of the reservoir matrix W res , and in many studies, it has been observed that weak contraction provides best performance of the reservoir system.Weak contractions lead to long synchronization transients, and an increased memory capacity of the reservoir system quantified by the ability to recover past input signals u[n − k] from the current state r[n] of the reservoir system [88].A comprehensive discussion of the role of generalized synchronization and the question under which conditions iterated prediction of a reservoir reproduces the attractor (i.e., the climate) of the driving system was given by Lu et al. [44].Mathematical criteria and statements for the latter can also be found in [85].
Dambre et al. [89] pointed out that the information processing capacity of a driven dynamical system with fading memory (i.e., echo state property) and linear readout is bounded by the number of linearly independent functions of the input the system can compute (or generate).In general, good performance of any reservoir system can be expected if the time series r i [n] given by the state variables of the reservoir system are linearly independent, because in this case the dimension of the linear subspace spanned by the columns of the matrix X is large.This feature can be quantified by the rank of the covariance matrix X T X, also called covariance rank [90][91][92][93].A large covariance rank will not guarantee good performance, but in general, a reservoir with a large rank will be superior to a reservoir generating output with a lower covariance rank.
An extension of generalized synchronization is consistency [94], which quantifies the degree of functional dependency of a driven dynamical system on its input in terms of the linear correlation of the reservoir variables when repeatedly driven by the same input.If the echo property is fulfilled, the consistency is equal to one.Consistency has also been used for studying reservoir computing, in particular with experimental set-ups [31,32].Lymburn et al. [95] found that even in cases of non-complete consistency, such inconsistent echo state networks may exhibit very good performance because there are subspaces in the state space of the reservoir system that represent signals with high consistency.This observation may lead to an enhanced understanding of the computational capacity of chaotic or noisy response systems and novel unsupervised optimization procedures.Extending the basic concept of consistency, Jüngling et al. [96] have developed an approach to investigate the propagation and distribution of information-carrying signals in a reservoir system.
Last but not least, reservoir computing using echo state networks has also been used to approximate the emergent functions that arise from the generalized synchronization of different coupled systems [97].

. Echo state networks
An echo state network [4,98] is a recurrent neural network acting as a dynamical reservoir that can be given by a set of ordinary differential equations (ODEs) or a set of updating rules for echo state networks that operate in discrete time In both cases, q(•) denotes an activation function (applied component-wise), which in most applications is a sigmoid function like q(•) = tanh(•).b ∈ R N res is a vector of different bias constants b i ∈ R affecting the dynamics of the state variable r i of the i-th node of the network and may be used to break the symmetry of the dynamical equations [47].The constants γ > 0 and α ≥ 0 are called leaking rates and can be used to adjust the temporal scale of the echo state network to that of the input signal and also to control the (fading) memory of the network dynamics.The parameter σ > 0 controls the amplitude of the input signal.Both matrices, W in and W res , are chosen randomly.Since the impact of the input can be controlled by the parameter σ , the elements of the input matrix W in can, for example, be uniformly distributed in the interval [−1, 1].The internal coupling matrix of the reservoir W res and the leaking rates γ or α have to be chosen in a way that the echo state network possesses the echo state property [99].
Furthermore, in many cases, sparse coupling matrices W res provided better performance.Therefore, a sparseness parameter δ is introduced quantifying the relative number of non-zero elements of W res .With a sparse coupling topology, the echo state network decomposes into many loosely coupled subsystems, and this may result in a more diverse dynamics of its nodes [100].
All network parameters mentioned so far, N res , W res , W in , b, σ , γ , α, δ, and the regularization weight β, are hyperparameters, which have to be chosen by the user when designing the echo state network and may have a major impact on its performance.Optimal values of hyperparameters can be found by grid search or using gradient methods [70,101] or Bayesian optimization [102].Improved predictive performance and enhanced generalization capabilities have been obtained with first projecting the state of the reservoir system to a lower dimensional space before computing the parameters of the readout function [103].An approach for reducing the size of the network using proper orthogonal decomposition (or principle component analysis) of its internal dynamics was suggested by Jordanou et al. [104].
An important feature of reservoir systems is their memory capacity, i.e., their ability to reconstruct and output previous input signals from the past.The memory capacity of an echo state network increases with the network size and is also affected by the other hyperparameters [detailed studies using i.i.d.random input time series can, for example, be found in [88,105,106]].For a given input signal and task, there is an optimal amount of memory of the reservoir system maximizing its performance, but it is difficult to determine it in advance without simulating the full reservoir computer [107].
Reservoir systems do not have to be non-linear, because linear system may also generated non-linear output due to their exponential solution functions [108].Bollt [109] showed that an echo state function with a linear activation function is equivalent to a vector autoregressive (VAR) model and if a quadratic readout function is used this is equivalent to non-linear VAR (NVAR) models.Furthermore, he showed that linear echo state networks are less effective but in combination with quadratic readout prediction results for chaotic time series may be acceptable.Carroll and Pecora also found that linear node dynamics in echo state networks is less effective [90].In their study of the impact of network structure, they considered the rank of the covariance matrix of the output, symmetries, and the resulting memory capacity and found that sparsity of the internal coupling matrix is beneficial, but much more important for improving the performance may be flipped links in the network.To cope with dynamics on multiple time scales, Tanaka et al. [110] considered echo state networks with heterogeneous leaky integrator (LI) neurons, where the parameter α in Equation ( 12) is chosen different for each node.
The performance of echo state networks (and reservoir computing in general) also depends on the available input u, more precisely whether full or only partial information about the state of the driving system is provided (e.g., a single component of the state vector of the dynamical system generating the data).Storm at al. [92] demonstrated that if only partial information is available, the reservoir system needs some memory to generated implicitly a delay embedding.This is not necessary if the full state is presented as input, and in this case, even an echo state network without internal dynamics can provide good predictions [62], because static non-linear maps of the input state (given by the activation functions) are sufficient.However, Lymburn et al. [97] have shown that the inclusion of internal dynamics can still be advantageous compared with memoryless reservoir computing, even if the complete state is available as input, if the input-output function to be learned is itself dynamically generated and has a very complex (e.g., fractal) form.
Gauthier et al.
[111] compared echo state networks with NVAR modeling using polynomial basis functions and found that these types of NVAR models provide similar or even better predictions, while requiring shorter time series for training and less computational resources.Jaurigue and Lüdge [115] discussed this approach in the context of other machine learning methods including delay-based reservoir computing with multiple delay loops [53].An evaluation of the performance of different echo state networks with different network structures compared with other machine learning methods was presented by Shahi et al. [116].They found that reservoir computing and NVAR models achieve similar prediction results.

Reservoir computing using delayed variables . Extending the set of basis functions
The performance of a reservoir computing system can be improved by increasing its state-space dimension N res , which results in a larger number of basis function ψ i used for In [ ], the NVAR approach presented is called "next-generation reservoir computing", a title that is possibly ironic but slightly misleading, since static functions are used for prediction and not the response of a dynamical system.Modeling and approximating the flow in embedding space using static non-linear functions goes back to the s using polynomial generating the output (Equation 4) [89].This increases, however, the computational load for simulating the reservoir system.An alternative is the use of delayed input and state variables, u(t − τ u ) and r(t − τ r ), respectively, because they provide additional basis functions.Let φ t be the flow of the driving dynamical systems such that z(t − τ u ) = φ −τ u (z(t)), and let the input signal of the reservoir system be given by u(t) = h(z(t)) where h denotes the observation function.In this case, after the synchronization transient (washout time) decayed, delayed input and reservoir state variables can be (asymptotically) written as functions of the state z(t) of the driving system at time t Here, the functions h • φ −τ u and ψ • φ −τ r , represented by u(t − τ u ) and r(t − τ r ), respectively, can be used to extend the set of basis functions entering in the computation (Equation 4) of the output of the reservoir system.
The extension of reservoir computing using past states of the reservoir system has been first studied by Marquez et al. [117], Sakemi et al. [118], and Del Frate et al. [119] who all found significant improvement in the performance of the reservoir system, a result that will be confirmed in the following.Later, Carroll and Hart [91] showed for an optoelectronic delay-based reservoir computer with only a small number of virtual nodes that time-shifted readout increases the rank and memory of the reservoir computer.In general, using many different delay times τ u and τ r increases the number of columns of the matrix X and (if suitable) its covariance rank, which leads to an improved performance of the reservoir [91,93].
Duan et al. [120] showed that linear readout functions including delayed variables may also provide an embedding of the input dynamics, and they discuss the trade-off between the number of delays and the state space dimension of the reservoir system (i.e., the number of nodes in case of a recurrent network).
Jaurigue et al. [121] showed that the performance of an unoptimized reservoir system can significantly be improved by adding a time-delayed input signal.In their study, they simulated time multiplexed delay-based reservoir computing with a single non-linear element that can be implemented using optical hardware.In a recent study, Jaurigue and Lüdge [122] showed that the dependance of reservoir performance on the choice of suitable hyperparameter values can be reduced by including delayed variables.

. Predicting Rössler dynamics using a Lorenz-based reservoir system
To demonstrate the effectiveness of reservoir computing using delayed variables, we will consider now a very simple example: a Note that here multiple delays are used in the readout function, only.This is di erent from delay-based reservoir computing using (multiple) delays mentioned in Section . .three-dimensional reservoir system based on the Lorenz-63 system [79,123] This system fulfills the echo state property for all initial conditions r(0) (see Appendix).
The parameter µ is used to adapt the time scale to that of the driving signal u(t) and controls the transversal contraction properties of the response system (see Appendix).Therefore, it is a hyperparameter that has to be carefully chosen.
Using an affine linear readout function, the output component v j (j = 1, 2, 3) sampled at times t n = nT smp is given by The minimal reservoir system (Equation 15) will be used to predict the variables z i of a chaotic Rössler system [79,124] from the observations u[m] = u(t m ) = z 1 (t m ) of the first variable at times t m = mT smp , where T smp denotes the sampling time.
. Cross-prediction of unobserved state variables of the Rössler system In our first example, the reservoir system (Equation 15) is used to cross-predict the variables z 2 (t m ) and z 3 (t m ) of the chaotic Rössler system (Equation 17) from previous observations u This task is known in control theory as state observer.If a training time series u(t k ) of length M is available, the computation of the weights w ij in Equation ( 16) consists of solving the set of linear equations (Equation 7) with where with the columns of Y given by the target values z i [m] = z i (t m ) of the Rössler system.This set of linear equations (Equation 18) is solved for the weights W out = X + Y using the Moore-Penrose pseudo-inverse X + of X (see Section 2.3) with ǫ = 10 −7 σ 1 , where σ 1 denotes the largest singular value of X.
Figure 2 shows predictions of z 2 and z 3 obtained with a training time series of length M = 50, 000 and sampling time T smp = 0.01 corresponding to a time interval of length 500.While the crossprediction of the z 2 component may be acceptable, the peaks of the z 3 variable are not correctly reconstructed.This relatively poor result should be expected given the fact that only time series of u, r 1 , r 2 , and r 3 have been used to predict the target signal.The bottom panels of Figures 2C, F show the dependence of the Normalized Root Mean Squares Error (NRMSE) of the test data on the parameter µ for both target signals y 2 and y 3 , respectively.
Here, ȳj stands for the mean value of y j , σ 2 y j represents its variance, and N test is the length of the time series used to compute the test error.The NRMSE quantifies the performance of the reservoir compared with a prediction of y j by its mean value ȳj .The lowest errors occur for predictions of z 2 at µ = 0.025 and for z 3 near µ = 1.1.This large difference in optimal µ values may be due to different smoothness properties of the functions ψ i required to successfully predict z 2 or z 3 (see Section 2.4 and Appendix).Such differences with respect to optimal (hyper) parameters may complicate the simultaneous prediction of different target variables using output from the same reservoir system.

. Cross-prediction using delayed variables
We will now increase the pool of signals (or basis functions) using past values of the input signal (Equation 13) and the state variables (Equation 14) of the reservoir system (Equation 15) in terms of delay vectors u(t) = [u(t − N u τ u ), ..., u(t − τ u ), u(t)] and r k (t) = [r k (t − N r τ r ), ..., r k (t − τ r ), r k (t)], which can for t = mT smp , Since the target signal y 1 = u = z 1 is known, only the outputs v 2 and v 3 have to be computed to estimate the unobserved variables z 2 = y 2 and z 3 = y 3 for reconstructing the state z of the Rössler system.τ u = L u T smp and τ r = L r T smp be written as To simplify the presentation, we use the same number of N u = N r = N delay delayed variables u and r in the following.Using the row vectors u[m] and r k [m], the linear set of equations for computing the parameters of the affine linear output function reads [1] r 2 [1] r 3 [1] 1 u[2] r 1 [2] r 2 [2] r 3 [2] . . . . . . . . . . . .
with K = 1 + 4(N delay + 1) = 5 + 4N delay basis functions.This set of linear equations can again be solved using the pseudo-inverse of X or any other method for solving a linear set of equations using regularization (see Section 2.3).Note that with delay M + N delay samples at times t 1−N delay , ..., t 1 , ..., t M−1 , t M are required for constructing the matrix X and that the number of basis functions K = 5 + 4N delay used for approximating the target function increases linearly with N delay .Furthermore, the rows of matrix X represent a projection of the state (z, r) of the full coupled system to a K-dimensional space (providing an embedding if K is large enough).
Figure 3 shows prediction results for z 2 and z 3 obtained with N delay = 100 and delay times (τ u = 0.32, τ r = 0.37) for z 2 and (τ u = 0.03, τ r = 0.30) for z 3 , such that the delay vectors for z 2 cover approximately a period of time of 32 < N delay τ u,r < 37 corresponding to about six oscillations of z 1 and z 2 .As can be seen, not only the z 2 -component of the Rössler system is recovered with high precision, but also the peak structure of the input signal and the z 3 -component.
The delay times τ u , τ r and the number of delays N u and N r are additional hyperparameters that have to be carefully chosen for a given prediction problem.One method for optimization is the maximization of the rank correlation [93], which can also be achieved by a targeted use of the zeros or minima of the autocorrelation functions of the input signal and the reservoir states.
. Predicting future evolution So far, only the current values of the target signals z 2 (t) and z 3 (t) have been reconstructed using input signals z 1 (t) until time t.As discussed in Section 2.2, reservoir systems can also be used for predicting the future evolution of the variables of interest.Figure 4 shows prediction results where the output of the reservoir system provides the target signals z 2 (t + T prd ) and z 3 (t + T prd ) In general, di erent numbers of delay terms with di erent delay times could be used for u and each variable r k .This would, however, increase the number of hyperparameters that have to be determined using grid search, for example.Cross-prediction of the (A) z and (C) z variables of a chaotic Rössler system (Equation ) from its first variable z using a Lorenz-based system (Equation ) with delayed readout (Equations and ).Hyperparameters for the prediction of z are µ = ., τ u = ., τ r = ., N delay = and for the prediction of z the values µ = ., τ u = ., τ r = ., N delay = were used.The elements of the output matrix W out were computed using Equations ( ) and ( ) with regularization parameter ǫ = − σ , where σ denotes the largest singular value of X.The diagrams (B, D) show the corresponding prediction errors e j (t) = v j (t) − z j .The mean test errors (Equation ) equal E = .
for a prediction time of T prd = 10 (in units of the time variable of the Rössler system).As indicated by the green horizontal bars in Figures 4A, C, this period of time corresponds to about 1.6 oscillation periods of the z 2 variable.The continuously available future values of the target signals can be used, for example, for implementing a control system or for predicting the occurrence of "extreme events" like the peaks of the z 3 variable.A prediction scheme based on delayed variables can also be implemented using feedback (see Section 2.2.2).In this case, the output which is fed back as input to the reservoir is a function

Extensions and improvements to reservoir computing
The use of delayed variables is not the only way to extend and improve the reservoir computing paradigm.Many studies for echo state networks have been performed to learn more about criteria for choosing the most suitable network architecture [125][126][127][128][129] and alternative activation functions (e.g., relu, radial basis functions, ....).Shahi et al. [130] reported more robust and accurate predictions when reservoir computing is combined with an autoencoder, where an echo state network operates in the latent space of the autoencoder.Nathe et al. [131] investigated the effects of measurement noise and found that the performance of reservoir observers can be significantly improved with low-pass filtering applied to the input signal.
Another extension is knowledge-based reservoir computing [132,133], where valuable information from existing models based on first principles and/or other empirical models are combined with a reservoir system.Existing models may be incomplete and are thus quantitatively not exact, but they still contain the knowledge of the field and provide at least a good first guess for the desired output.In this case, machine learning methods like reservoir computing can be used to add required corrections to the predicted values or to create the conditions under which the model can be applied.For example, to make use of existing knowledge-based dynamical models like ordinary or partial differential equations, one has to know suitable initial conditions, e.g., the full state of a dynamical system, which is in most cases not given by the data available.In this case, a reservoir system (like the example presented in Section 3.3) can be used to reconstruct the full state Reservoir computing has also been successfully applied to spatially extended systems by using several echo state networks operating in parallel to avoid the use of extremely large networks [45,86,[136][137][138].Last but not least, it should be mentioned that the concept of reservoir computing can also be used with quantum systems.Such quantum reservoir computing is currently a very active field [139][140][141][142], but beyond the scope of this review.

Conclusion
Reservoir computing is an efficient method for time series prediction and has mainly been studied using recurrent networks called echo state networks.However, in principle, any driven dynamical system may serve as a reservoir of output signals that can be used, for example, to predict or classify the input signal if the system has a fading memory of its own initial condition.This feature is also called echo state property, and it is closely linked to generalized synchronization.The concept of reservoir computing can be implemented not only on digital computers but also on a hardware basis.Such physical reservoir computing systems may offer advantages like very high speed or low energy consumption.
Of course, useful mathematical models do not only exist in physics, but also in many other fields of science.Recently, it has been shown that the inclusion of past values of input and reservoir state variables in the readout function can significantly improve the performance of reservoir computing.In particular, for echo state networks, it has been shown that this type of extended readout may be used to reduce the number of nodes (i.e., the size of the network) required to achieve a given level of predictive power [117][118][119][120].As this is a simple but very effective method for improving reservoir computing, this approach has been dealt with in more detail in this overview.We demonstrated and confirmed its effectiveness using a minimal three-dimensional reservoir system based on the Lorenz-63 system.Using delayed variables, this low-dimensional reservoir system can reconstruct all state variables of a chaotic Rössler system from an univariate time series of the first Rössler variable and even provide some mediumtime prediction of their future evolution.This example was not only chosen to demonstrate the effectiveness of using delayed variables but also to show that there alternative reservoir systems and that it is not mandatory to use high-dimensional systems such as large networks.Of course, a reservoir system with only three variables cannot compete in performance with high-dimensional systems such as echo state networks with hundreds or thousands of nodes.But it may serve, for example, as a building block of an approach with several similar systems running in parallel.This can be done in continuous time, as here, but also using low-dimensional discrete dynamical systems.Such structures may lead to novel reservoir system designs that can be efficiently implemented in (physical or biological) hardware.

Frontiers
However, the definition (Equation ) is better suited to include in the output Equations ( ) and ( ) the most current information at time n, which is given by r[n] and u[n].
Let us first consider the case of a discrete reservoir system where the output v[n] is a function g(r[n]) = (1; r[n])W out of the state of the reservoir system r[n], only, and provides an estimate of the next Parlitz .

A
reservoir system driven by an input signal (Equations or ) is sometimes called a listening reservoir and autonomous reservoir systems with a feedback loop (Equations and ) are referred to as predicting reservoirs.The same reservoir can, e.g., be trained to model the function u[n − 1] → y[n] by determining a suitable output matrix W out .
First, a training set {(v[m], y[m])} consisting of m = 1, ..., M sampled output values v[m] and target values y[m] ∈ R N out has to be generated, where v[m] = v(mT smp )

FIGURE
FIGUREGraphical illustration of the inversion of the SVD matrix S using Tikhonov regularization (Equation ) or when computing the pseudo-inverse (Equation ) for di erent values of the regularization parameters β and ǫ, respectively.

[
] or radial basis functions [ ], including Tikhonov regularization [ ].In this context, early work on NARMAX models should also be mentioned, see,

FrontiersFIGURE
FIGURECross-prediction of the (A) z (t) and (D) z (t) variables of the chaotic Rössler system (Equation ) from u = z using the Lorenz-based reservoir system (Equation ).The elements of the output matrix W out were computed using Equations ( ) and ( ) with regularization parameter ǫ = − σ , where σ denotes the largest singular value of X.The diagrams (B, E) show the corresponding prediction errors e j (t) = v j (t) − z j (t).In (C, F), the dependences of the test errors E and E (Equation ) on the reservoir parameter µ are shown.The examples (A, B, D, E) have been computed for µ = .and µ = ., respectively, which are close to the minima of the test errors shown in (C, F).

FIGURE
FIGURE

FIGURE
FIGUREPrediction of future values of the (A) z and (C) z variables of a chaotic Rössler system (Equation ) from its first variable z using a Lorenz--based system (Equation ) with delayed readout (Equations and ).The prediction time interval T prd = is indicated by a horizontal bar.Hyperparameters for the prediction of z are µ = .τ r = ., τ u = .N delay =, and for the prediction of z , the values µ = .τ r = ., τ u = ., N delay = were used.The values of the output matrix W out were computed using Equations ( ) and ( ) with regularization parameter ǫ = − σ , where σ denotes the largest singular value of X.The diagrams (B, D) show the corresponding prediction errors.The mean test errors (Equation ) equal E = .andE = . .
vector from available observations, which can then be used as initial condition of the mathematical model to compute its future evolution.Such hybrid configurations are called knowledge-based or physics-informed machine learning and often require (much) less training data compared with purely data-driven algorithms (i.e., without employing a physics-based model).A comparison of different approaches for hybrid reservoir computing was recently provided by Duncan and Räth [134].If no model based on first principles is available, one can also combine reservoir computing with another type of data-driven model [e.g., some ODE model obtained by sparse identification of a polynomial vector field [135]].

Frontiers
Here, we assume that prior to the sampling of the training data a su ciently long transient time T trs has elapsed.