Performance boost of time-delay reservoir computing by non-resonant clock cycle

The time-delay-based reservoir computing setup has seen tremendous success in both experiment and simulation. It allows for the construction of large neuromorphic computing systems with only few components. However, until now the interplay of the different timescales has not been investigated thoroughly. In this manuscript, we investigate the effects of a mismatch between the time-delay and the clock cycle for a general model. Typically, these two time scales are considered to be equal. Here we show that the case of equal or resonant time-delay and clock cycle could be actively detrimental and leads to an increase of the approximation error of the reservoir. In particular, we can show that non-resonant ratios of these time scales have maximal memory capacities. We achieve this by translating the periodically driven delay-dynamical system into an equivalent network. Networks that originate from a system with resonant delay-times and clock cycles fail to utilize all of their degrees of freedom, which causes the degradation of their performance.


Introduction
Reservoir computing is a machine learning method, which was introduced independently by both Jaeger [1] as a mathematical framework and by Maass et al. [2] from a biologically inspired background.It fundamentally differs from many other machine learning concepts and is particularly interesting due to its easy integration into hardware, especially photonics [3,4].With the help of the reservoir computing paradigm, the naturally occurring computational power of almost any dynamical or physical system can be exploited.It is particularly valuable for solving the class of time-dependent problems, which is usually more difficult to address with artificial neural network-based approaches.A time-dependent problem requires to estimate a target signal (y(t)) t∈T which depends non-trivially on an input signal (u(t)) t∈T , the set of times T may be continuous or discrete.This class of problems contains, in particular, speech recognition or time series prediction [5,6,7], and also has great promise for error correction in optical data transmission [8].Furthermore, reservoir computing can be used to study fundamental properties of dynamical systems in a completely novel way [9], enabling new ways of characterizing physical systems.
The main idea of reservoir computing is simple, yet powerful: A dynamical system, the reservoir, is driven by an input u(t).The state of the reservoir is described by a variable x(t), which can be high-or even infinitedimensional.A linear readout mapping x(t) → ŷ(t) provides an output.While the parameters of the reservoir itself remain fixed at all times, the coefficients of the linear mapping x(t) → ŷ(t) are subject to adaptation, i.e. the readout can be trained.
Reservoir computing is a supervised machine learning method.An input u(t) and the corresponding target function y(t) are given as a training example.Then optimal output weights, i.e. coefficients of the linear mapping x(t) → ŷ(t), are determined such that ŷ(t) approximates the target y(t).This is analogous to a conventional artificial neural network where only the last layer is trained.The goal of this procedure is to approximate the mapping u(t) → y(t) via u(t) → x(t) → ŷ(t) such that it not only reproduces the target function for the given training input but also provides meaningful results for other, in certain sense similar, inputs.From a nonlinear dynamics perspective, the readout mapping x(t) → ŷ(t) is a linear combination of different degrees of freedom of the system.
The reservoir must fulfill several criteria to exhibit good computational properties: First, it must be able to carry information of multiple past input states, i.e. have memory.References [10] and [11] study how a reservoir can store information about past inputs.Second, the system should contain a non-linearity to allow for non-trivial data processing.Jaeger [1] proposed to use a recurrent neural network with random connections as reservoir.In this case the reservoir x has a high-dimensional state space by design, as the dimension equals the number of nodes.Such systems are in general able to store a large amount of information.Moreover, the recurrent structure of the network ensures that information about past input states remains for a number of time steps and fades slowly.Jaeger compared the presence of past inputs in the state of x to echoes.For this reason he called his proposed reservoir system echo state network.
In recent years, the field of reservoir computing has profited from experimental approaches that use a continuous time-delay dynamical system as reservoir [12].While hybrid network-delay systems have also been proposed [13], typically, only a single dynamical nonlinear system is employed and connected to a long delay loop; i.e. as opposed to a network-based approach only a single active element is needed for a delay-based reservoir.Here, the complexity is induced by the inherent large phase space dimension of the dynamical system with time-delay [14,15,16].The main advantage of using a delay system over Jaeger's echo state approach [1] is that one can physically implement the delay system with analogue hardware at relatively low costs-either electronically [12] or even optically [17,18].
Two main time scales exist in such delay-based reservoir systems: the delay-time τ given by the physical length of the feedback line, and a clock cycle τ given by the input speed.In this paper, we show that the non-trivial cases of mismatched delay-time and clock cycle possess better reservoir computing properties.We explain this by studying the corresponding equivalent network, where we can show that non-resonant ratios of τ to τ have maximal memory capacities.
The paper is organized as follows: In section 2 the general model of a reservoir computer based on a single delay-differential system is introduced.We refer to this method as time-delay reservoir computing (TDRC).Section 3 shows numerical simulations and the effect of mismatching clock cycles and delay-times.Section 4 derives a representation of the TDRC system with mismatching clock cycle and delay-time as an equivalent echo state network.Section 5 presents a direct calculation of the memory capacity.Section 6 derives a semi-analytic explanation for the observed decreased memory capacity for resonant τ to τ ratios and provides an intuitive interpretation.All results are summarized in section 7.

Time-delay reservoir computing
In this section we describe the reservoir computing system based on a delay equation.Its choice is inspired by the publications [12,17], where it was experimentally implemented using analogue hardware.In comparison to the general reservoir computing scheme u(t) → x(t) → ŷ(t) described above, an additional 'preprocessing' step is added to transforms the input u in an appropriate way before being sent to the reservoir.This is particularly necessary when the input u is discrete and the reservoir x is timecontinuous.In the following, we describe the resulting chain of transformations in detail.

Step (I): preprocessing of the input
Since the reservoir is implemented with the physical experiment in mind (e.g.semiconductor laser), its state variable x(t) is time-continuous.However, the input data is discrete in typical applications of TDRC [12,13,17].For this reason the preprocessing function u → J(t) translates the discrete input u into a continuous function J(t).
We consider a discrete input sequence (u(k)) k∈N0 , where u(k) ∈ R is one-dimensional, however, the method can be extended to multi-dimensional inputs.The important parameters that define the preprocessing are the clock cycle τ > 0, number of virtual nodes N ∈ N and the resulting time per virtual node θ := τ /N .In Ref. [12] the parameter value for the clock cycle was chosen τ = τ and the optimal values for θ are shown to be in the interval [0.1, 1].Therefore, we will consider the parameter τ of the order of the delay τ and θ within the interval [0.1, 1] or even larger.In fact the exact value of θ, or correspondingly, N , does not qualitatively influence the phenomenon of the error increase by rational τ /τ (as we show in the appendix).
First, a function ū(t) is defined as step function with step length τ .Using the indicator function the definition of ū can be equivalently written as Secondly, ū(t) is multiplied by the τ -periodic mask which is piecewise constant with step length θ = τ /N and values w n .Multiple options for the choice of a mask function are compared in reference [19].The final preprocessed input signal J(t) is u( 2)

u(t) J(t)
Figure 1: Schematic representation of the preprocessing step: the discrete input sequence u(0), u(1), . . .defines the function ū(t) (blue), which is multiplied by a τ -periodic mask function M(t) to obtain the preprocessed input J(t) (red).Here the length of the mask vector is N = 4.The resulting function J(t) enters the reservoir equation (9).

It is a piecewise constant function with values
on the intervals [kτ + (n − 1)θ, kτ + nθ).The details of the preprocessing are illustrated in figure 1.
For further analysis, it is convenient to denote the 'mask'-vector W mask and the input vector J k as follows:

Step (II): reservoir
Inspired by the previous works [12], we study the reservoir given by the delay-diffential equation where τ > 0 is the delay-time, γ > 0 is the input strength, f : R → R the activation function, and J(t) the preprocessed input function.For a given preprocessed input J(t), the reservoir variable x(t) is computed by solving the delay differential equation ( 9) with an initial history function x 0 (s), s ∈ [−τ, 0].In order for the reservoir to be consistent, the reservoir state x(t) for sufficiently large t > 0 should depend only on the input J and be independent on the initial state x 0 (s).In other words, identical reservoirs which are driven by the same input but have different initial states will approximate each other asymptotically.
In the literature about reservoir computing this property is often referred to as echo state property or fading memory property [1].In the literature about dynamical systems, the phenomenon is called generalized synchronization [20,21] or asymptotic stability [22,23].

Step (III): readout
The continuous reservoir variable x(t) needs to be discretized for the output.For this, the dynamical system is read out every θ time units.Because every small time window [kτ + (n − 1)θ, kτ + nθ) is fed with its own input J k,n , these time windows are often seen as 'virtual nodes', and the whole delay system as a 'virtual network' [12,24].We discretize the reservoir variable correspondingly: . . .
In fact, X(k) is the vector containing the N -point discretization of the variable x(t) on the interval ((k−1)τ , kτ ].
The output ŷ = (ŷ(k)) k∈N0 of the machine learning system is defined as where W out is an N -dimensional row vector and c ∈ R is a scalar bias.The output weight variables W out and c are to be adjusted in the training process and are chosen by linear regression for reservoir computers [1].

Effect of the mismatch between delay and clock cycle times
When TDRC was first introduced, the clock cycle τ for the preprocessing of the input mask was chosen to be equal to the delay τ [12,17].In this case one can easily find an 'equivalent network' which is a discrete approximation of the reservoir system.See the suplementery material of [12] or reference [25] for an example.However, recent numerical observations show, that the performance may be improved if one sets τ = τ .The earliest example of this can be arguably found in reference [26].
We use the NARMA-10 task [27], the memory capacity (MC) [10], the time series prediction task for the chaotic Lorenz system and the Santa Fe time-series prediction task [28] to measure the performance of a simple TDRC to illustrate the role of the clock cycle τ .These are typical benchmark tasks and we refer to Appendix C-E for a detailed explanation.
Moreover, we use four different functions for the activation function f in equation ( 9), linear: Mackey-Glass: hyperbolic tangent: quadratic sin: where we use the parameter values p = 1 and ϕ = 0.5.The activation factor is chosen α = 0.9 for all cases.In    9) with linear activation function Eq. (12).For all tasks error peaks (resp.memory capacity drops) are visible for values of τ that are close to integer multiples of τ and low-order resonances.Vertical lines denote the resonant values of τ to τ as indicated on the top axis.The parameters for the simulations are listed in table fact, the precise choice of the parameter α does not influence qualitatively the phenomenon which we consider, see Appendix F for more details.The other parameters are set to τ = 80, N = 50, and γ = 0.02.For our tests we choose 100 different input weight vectors W mask with independently U(−1, 1)-distributed entries w n .For each input weight vector we train the system with 50000 training time steps and a Tikhonov regularization with parameter β = 10 −8 .The training starts after an initial period of 200000 inputs which is necessary to ensure that the reservoirs initial state does not influence the results.Only then do we record the system state for the next 20000 inputs, with which we facilitate the training.The only exception are the simulations for the Santa Fe time-series prediction task.Since the Santa Fe dataset contains only about 9000 data points, we choose aperiod of 4000 initial inputs, 4000 training steps and 1000 test steps.All parameters for the simulations are summarized in table 1.The memory capacity is evaluated up to length 300.Values that are close to 0 are not included in the sum.We estimate this overfitting threshold dynamically during the run, by using uncorrelated random target variables.The top panel of figure 2 shows the results of simulations for the NARMA-10 task for the four different activation functions ( 12)- (15).The figure shows clear peaks of the error for certain values of the clock cycle τ .These peaks are located close to low-order resonances with the delay-time τ = 80 and fulfill the relation aτ ≈ bτ for small a, b ∈ N. In fact, the peaks are located slightly above the resonant τ values.Note, that tanh and the linear function have very similar values.The lower panel of figure 2 depicts the memory capacity and reveals at least part of the reason for this: The total memory capacity for these resonant clock cycles decreases dramatically.
To explore if the observed effect is related to the choice of task, we plot the performance for four different tasks in Figure 3.In each case the linear activation (12) was used.One can see that the error peaks (resp.performance drops) are visible in all tested tasks.
The rest of this paper is devoted to the explanation of this phenomenon.Thereby we focus on the linear activation (12).Lacking a nonlinearity, this is not the optimal choice for the TDRC, the resonance effects that we are interested in seem to be general and independent of f as shown in figure 2, where the performance for the activation functions ( 12)-( 15) is computed.Furthermore, this simplification will allow us to deduce analytical results.In particular, we will be able to show the reason for the drop of the linear memory capacity explicitly.

Approximation by a network
In order to explain the degradation of the memory capacity for the resonant clock cycles, we present the time delay-system reservoir as an equivalent network.Similar procedure was done in [12] for a TDRC with τ = τ and arbitrary activation function.Here we present a derivation of an equivalent network for the case τ = τ , where the results of [12] cannot be applied.However, we have to restrict it to a linear activation function in order to obtain an explicit network representation.
An alternative way to describe the dependence of X(k+ 1) on X(k) and the input for the case τ = τ is presented in [29], where the authors use an integral formula instead of a network formulation.This integral formula includes the case of a nonlinear activation function.For our explanation of the observed memory capacity drops, the network representation presented in this section is, however, indispensable.
Since a detailed derivation is technical, we move it to Appendix A and present the main results in this section.
As follows from Appendix A, the TDRC dynamics can be approximated by the discrete mapping where and X(k) is the discretized vector of the reservoir defined in equation (10).Let us define and explain further nota-tions used in the mapping (16).The matrix is the classical coupling matrix of an equivalent network for TDRC with τ = τ [12].Moreover, where • and • denote the floor and the ceiling function, which we need to employ to allow delay-times τ that are not an integer multiple of the time per virtual node θ.These quantities can be interpreted as follows: m is the number of virtual nodes that are needed to cover a τinterval, q is a measure of the misalignment between τ and τ , and is roughly the ratio between the delay-time τ and the clock cycle τ .For = 0, the delay τ is shorter than the clock cycle τ , and it is similar to or larger than the clock cycle for ≥ 1.The matrices A q and A −(N −q) are shifted versions of the matrix A 0 .They are defined as follows: is obtained by a downwards shift of A 0 by q rows and is obtained by the upwards shift of A 0 by N − q rows.Furthermore and J k is defined as the input vector (8).The mapping (16) generalizes previous results from [12,24].If the clock cycle τ satisfies τ ∈ [τ, τ + θ), the description coincides with the classical case τ = τ and the approximate equation ( 16) yields the same mapping as presented in [12] because then = 1, q = 0, and A −(N −q) = 0.
Analytical approaches for the nonlinear system ( 16) are challenging.To simplify, we will study the effect of different clock cycles τ with the help of a linear activation function f (x) = αx, where α is a scalar.Then equation ( 16) can be written as by plugging in W mask u(k) for J k and by writing ν := 1 − e −θ for the sake of shortness.System ( 16) possesses the following properties: in the case τ ≥ τ + θ, we have = 0, and hence, equation ( 16) is in general an implicit map.This is the physical case of a delay shorter than the clock cycle, which means that the feedback from some of the virtual nodes will act on other virtual nodes within the same cycle.However, for the linear activation function in equation (25) and by (17) we obtain for the case = 0 the explicit map where is a matrix that describes the coupling and local dynamics of the virtual network and is the generalized input matrix.Equation ( 26) is the main result of this section.It shows that the TDRC system can be modeled by an equivalent network for τ ≥ τ + θ if its activation function f is linear.In previous publications [12] such a network representation was derived for the case τ = τ , however, in that case also for nonlinear activation functions.
An important observation from Eq. ( 26) is that the spectral radius of the matrix A must be smaller than one, otherwise the system (26) will not be asymptotically stable.We achieve this by choosing appropriate parameter α = 0.9 (see figure 7 in Appendix F).
Equation (26) allows us to calculate directly some figures of merit in the following.We first use it to explain the drops in the memory capacity in figure 2 for resonant delays.One important aspect to note, is that the basic shape of equation ( 26) does not change with τ .Rather, a changing of the clock cycle leads to a change of the evolution matrices A and W in of the equivalent network.The obtained system (26) can be equivalently considered as a specific echo state network.

Direct calculation of memory capacity
One can find an estimation for the memory capacity of a reservoir computing system by solving the system numerically and let it perform the memory task.But there are also analytic methods for some cases.In this section we explain how to calculate analytically the memory capacity of the linear echo state network (26) which corresponds to the case τ ≥ τ + θ.
Memory capacity was originally defined by Jaeger in [10].In the following, we use a slightly modified formulation.Let the elements u(k) of the input sequence be independently N (0, 1)-distributed.Jaeger introduced the quantity MC d which indicates how well the output ŷ(k) of an ESN may approximate the input value u(k − d) which was fed into the reservoir d time steps earlier.The memory capacity for a recall of d time steps in the past is defined by where E denotes the expectation value and we require the initial state X(0) of the reservoir to be stationary distributed in order to ensure that this definition is consistent, i.e. that the distribution of X(k) does not depend on k.Since the spectral radius of A is less than one, the stationary distribution exists.In such a case, the memory capacity (29) with stationary distributed X(0) can be equivalently written as Note that u(0) ∼ N (0, 1) means that we can drop the term var(u(0)) in (30).The total memory capacity MC is defined as the sum of all d-step memory capacities In the following we denote the optimal output weight vector for (30) by W out d .Let Σ be the covariance matrix of the stationary distribution of the reservoir.Jaeger [10] noted that, if Σ is invertible, one can apply the Wiener-Hopf equation [30] to find For details we refer to Appendix B. Using this optimal value W out d , the memory capacity (30) can be calculated as where we have used the relations and So once the covariance matrix of the reservoir X is invertible, one can directly calculate the memory capacity.
The stationary distribution of system (26) with standard normal distributed input elements u(k) is a multivariate normal distribution with mean zero and covariance matrix We refer to Appendix B for a derivation.It is worth to comment on the structure of the matrix Σ.We note that the summands A j W in (W in ) T A jT in (36) are rank one symmetric matrices with the norm A j W in 2 .Since the spectral radius of A is less than one, this norm converges to zero as j → ∞ and there is only a finite number of terms in (36) which can make numerically significant contribution to the rank of Σ.In addition, as we will see in the following, these rank-one matrices may have almost coinciding eigenspaces.As a result, the matrix Σ is in general numerically not invertible.Since our approach for the derivation of equation (33) relies on the invertibility of Σ, we cannot simply replace Σ −1 by a pseudo-inverse.In order to obtain an invertible covariance matrix, we need to perturb the stochastic process (26).We choose a small number σ η > 0 and let η(k) ∼ N (0, Id) be a sequence of independent multivariate normal distributed random variables.The stochastic process has the stationary distribution N (0, Σ η ) where the covariance matrix given by is invertible.

Explanation for memory capacity gaps
Using the expressions ( 31) and (33) for the memory capacity obtained in section 5, we provide an explanation for the loss of the memory capacity when τ /τ is close to rational numbers with small denominator.The explanation is based on the structure of the covariance matrix Σ η given by equation (38) and the corresponding expression for the memory capacity, which we repeat here for convenience where Our further strategy is as follows: (i) Firstly, we remark that the norms of the individual terms in the sum (40) are converging to zero due to the convergence of the series.Hence, only the first finitely many terms play an important role.For instance, for our previously chosen parameters in figure 2, the terms with j 30 do not make a large contribution and can be neglected.In the following we denote the approximate number of significant terms by j n .
(ii) We show that the largest eigenvalue of the j-th term in (40) can be approximated by A j W in 2 with the corresponding eigenvector A j W in .
(iii) We show that the memory capacity is high, i.e.MC d ≈ 1 for d ≤ j n , when the eigenvectors A j W in corresponding to the first relevant terms in the sum (40) are orthogonal.
(iv) Using our setup, we show numerically that the lower order resonances τ /τ ≈ a/b, where a, b ∈ N and b is small, lead to the alignment of the eigenvectors A j W in , and hence, to the loss of the memory capacity.The small shift from the exact resonance values is explained by the standard drift property of delay systems.
(v) Finally we give an intuitive explanation of the obtained orthogonality conditions.
(i) Convergence of the series (40).The series (40) can be considered as the Neumann series (Id − T ) −1 = ∞ j=0 T j , where T X := AX A T , applied to the matrix W in (W in ) T + σ η Id.A sufficient condition for the convergence of such a series is that T k op := sup X =1 T k X < 1 for some k > 0, where • is some matrix norm.Moreover, Since the spectral radius of A is smaller than one, Gelfand's formula implies that there is a number k > 0 such that A k < 1, and hence, the sufficient condition T k op < 1 for the convergence of the series is satisfied.
(ii) Estimating the largest eigenvalues and eigenvectors of the j-th term in (40).Consider at first the term with j = 0: W in (W in ) T +σ η Id.The largest eigenvalue of this matrix is W in 2 2 + σ η and the corresponding eigenvector is W in as can be easily checked by the direct calculation For all other eigenvectors v, which are orthogonal to W in due to the symmetry of the matrix, the corresponding eigenvalues are σ η because These eigenvalues are by definition small, since σ η is a small perturbation.We can also find approximations of the eigenvectors and eigenvalues for the higher order terms Π j + σ η A j A jT , j > 0. Namely, for the unperturbed matrix Π j , the largest eigenvalue is A j W in 2 and the corresponding eigenvector is All other eigenvalues are zero.Since the largest eigenvalue of Π j is geometrically and algebraically simple, it is continuous under the perturbation by σ η Id.Hence, the largest eigenvalue and the eigenvector of Π j +σ η A j A jT are approximated by A j W in 2 and A j W in with an error of order σ η .All other eigenvalues are correspondingly small of order σ η .
(iii) The orthogonality of A j W in leads to the high memory capacity.Let j n be the number of terms in (40) that are significant (see (i)), and let us assume that the eigenvectors A j W in , j = 0, 1, . . ., j n are close to be orthogonal, i.e.
As we will see in (iv), such an assumption is indeed reasonable in our setup.More precisely, one could consider (44) as A j W in , A i W in < ε introducing another small parameter ε 1.In case, when the orthogonality (44) holds, the largest eigenvalues of Σ η and their corresponding eigenvectors can be approximated by A j W in 2 and A j W in , j = 0, . . ., j n .Indeed In this case, the memory capacity can be calculated as The former is off-resonant, while the later is close to the 3/2 resonance, in particular when the drift property of the DDE system is taken into account.In (b) the vectors A j W in point into the same direction for j, j + 2, j + 4, etc., i.e. after two time steps the input values u(k) overlap in the state space of X and the memory capacity drops.In contrast, in the case τ /τ = 1.06(a) it takes almost 30 time steps before the input overlaps with past inputs in the network state.This explains the high memory capacity in this case, which is illustrated in figure 2. follows: for d ≤ j n .Hence, the orthogonality of the vectors A j W in with A i W in , i = j guarantees a high memory capacity.We will present an intuitive explanation for this shortly.
(iv) Resonances of τ and τ lead to lower memory capacity.
The plots in figure 4 show A j W in , A i W in for different ratios τ /τ .White to light blue off-diagonal squares indicate that assumption (44) is satisfied, i.e. orthogonal or almost orthogonal vectors.Dark blue indicates a strong parallelism of the vectors.As can bee seen in the top panel of figure 4, the assumption (44) holds indeed for ratios τ /τ which yield a good memory performance.Conversely, it is strongly violated for critical ratios τ /τ a/b with small denominator b, e.g. the center panel of figure 4.
We note that the critical ratios τ /τ are slightly shifted from the exact resonant values 1, 3/2, etc.This shift is a manifestation of the 'drift' property of delay systems [16], which is caused by the fact that the effective round-trip of a signal in a delay system equals to the delay τ plus a finite processing time δ due to the integration (filtering).Therefore, the small shift of the error peaks is actually due to a resonance between the clock cycle τ and τ + δ.This fact can be seen also in the structure of the coupling matrix A of the equivalent network.
(v) Intuitive explanations.There is an additional intuitive understanding of the above derived formulas.Recall that the original system of the reservoir of equation ( 9) combines the delay term x(t−τ ) and the input J(t) additively.The approximated network formula for an equivalent network translated this into the matrix A, which describes the free dynamics of the network, and the driving term defined by W in .The state of the network is given by an N -dimensional system, and thus can at most hold N orthogonal dimensions [31].Each summand of Σ η can now be understood as an imprint of the driving term on the system after j time steps.For j = 0 the matrix A 0 = Id, and thus the imprint is given by W in , i.e. the information of the current step is stored in the nodes as given by the weights of the effective input weight vector W in .In the next step, the system will get an additional input, but also evolve according to its local dynamics A. Thus, after one time step, the imprint has transformed into AW in , i.e. the summand for j = 1 and σ η → 0. Now in every step, the information that is currently present in the network will be 'rotated' in the phase space of the network according to A, while a new input will be projected onto the direction of W in .This holds in general, so that the j-th summand of Σ η of equation (38) A j W in describes the linear imprint of the input j steps in the past.
The orthogonality condition of equation ( 44) then is the same concept as demanding that new information from the inputs should not overwrite the already present information.If A r ≈ sId for some s ∈ R, then the information that was stored from r steps in the past will be partially overwritten by the currently injected step and lost.Hence, ensuring that the orthogonality between A j W in is fulfilled as much as possible will maximize the linear memory.For the case of resonant feedback, i.e. τ = τ , this condition is not fulfilled.This is due to the fact, that A has a strong diagonal component for the resonant cases, i.e. virtual nodes are most strongly coupled to themselves.This is a simple consequence of the fact that for τ = τ , virtual nodes return to the single real node at the same time that they are updated.Similarly, for higher resonant cases bτ = aτ , A b will in general have a strong diagonal part and thus the eigenvector A b W in will not be orthogonal to A 0 W in , and the information will be overwritten.

Discussion
In this paper we have shown a generalization of the frequently used time-delay reservoir computing for cases other than τ = τ .We observed that a sudden increase in the computing error (NARMA-10, Lorenz, and Santa-Fe NRMSE) and a drop in the linear memory capacity (MC) can be seen for resonant cases of bτ ≈ aτ with a, b ∈ N where b is small for different activation functions including the linear, tanh, sin 2 , and the Mackey-Glass function.We derived an equivalent network for the case τ ≥ τ + θ which extends the previously studied case τ = τ .Assuming a linear activation function f (x) = αx, we can ana-lytically solve the resulting implicit equations and obtain an expression for the total memory capacity MC.Here we find that the resulting memory capacity will be small for cases where τ and τ are resonant because the information within the equivalent network will be overwritten by new inputs very quickly.Even though our analytics so far are only derived for the linear case, we expect these results to hold in more general situations, as numerical simulations with several different nonlinear activation functions in figures 2, 7, 8 indicate.More detailed analysis can be performed in future studies.49) is highlighted in blue.As stated is equation (50), the point k τ + n θ must be chosen such that it lies within this interval.In equation ( 51) the value of x on the integration interval is approximated by the value of x(k τ + n θ).If the endpoints of the interval are grid points, k τ + n θ is chosen to be the right endpoint.
It follows that for t ≥ t 0 .Set t 0 = kτ + (n − 1)θ and t = kτ + nθ.Then where J k,n is defined in (7).One option to discretize the system, is to approximate the function x by a step function with step length θ which is constant on the integration interval.One can find an appropriate step function by choosing k (k, n) and n (n) such that and defining x(t) ≈ x(t) := x(k τ + n θ) for t ∈ (kτ + (n − 1)θ − τ, kτ + nθ − τ ].Then, one can replace x by x in the integrand in equation (49).This yields (51)

A.2. The choice of k and n
The floor and the ceiling function are denoted by • and • , respectively.One can choose k and n in the following way: First, let m ∈ Z, m ≥ 1 be the unique number such that τ ∈ ((m − 1)θ, mθ], i.e. m = τ /θ .Then as illustrated in figure 5. Now, the choice of n follows directly from the restriction n ∈ {1, . . ., N }.It holds that From this result follows that Hence, equation (52) implies Note that one has k = k as long as A.3.Vectorization of the state space and a matrix equation for the discretized system Define and f ≡ (1 − e −θ )f .From (51) follows and repeated application of Eq. (51) yields for n ∈ {2, . . ., N }.These equations can by rewritten as a matrix equation.Let and Then Let := m/N and q := m mod N , as defined in (19), i.e m = N + q.By plugging this into equation ( 53) and noting that 1 ≤ n ≤ N and 0 ≤ q ≤ N −1, one obtains and by replacing m by N + q equation (55) follows (63) Hence, the vector X n (n) (k (k, n) + 1) n=1,...,N can be written as follows: Thus, the map (61) can be written as where the matrices M q = (δ i,j+q ) 1≤i,j≤N and M −(N −q) = (δ i,j−(N −q) ) 1≤i,j≤N are shift matrices.
The matrix A 0 is invertible and can be used to transform the system.Let X := A −1 0 X.Then where the matrix is obtained by a q rows downwards shift of A 0 and the matrix is obtained by an N − q rows upwards shift of A 0 and The equation (69) for matrix B follows from equation (65).It must hold that A.4.An ESN representation of TDRC systems with suitable parameters If τ ≤ τ − θ, then = 0.This follows from the definitions := m/N and m = τ /θ .Equation (66) is in this case an implicit map: However, for a linear activation function f (x) = αx, where α is a scalar, holds f (x) = (1 − e −θ )αx and hence one obtains the explicit linear map where ν := 1 − e −θ .Since X = A −1 0 X and J k = W mask u k , one can write this map in the original coordinates and in terms of the original input sequence where and The network matrix A is plotted in figure 6 for different parameters.

A.5. The ESN representation of classical TDRC systems
The article [12] contains a description of an equivalent echo state network for TDRC systems with τ = τ .This description is consistent with the case τ ∈ (τ − θ, τ ] in the framework of our discretization.In this case, and therefore, A q = A 0 and A −(N −q) is the zero matrix.Thus, equation (66) simplifies to For a linear activation function f (x) = αx and τ ≤ τ − θ, i.e. = 0, the equivalent network written in the original coordinates is where and

B. Derivation of the memory capacity formula
We consider the linear echo state network where the input elements u(k) are independently N (0, 1)distributed.In section 5 we defined and we claimed that is the optimal argument for (83).In the following we show that W out d is indeed the optimal argument for (83).In order to maximize (83), we need to minimize the mean square error We know that X(k) ∼ N (0, Σ) and hence W out X(k) ∼ N (0, W out Σ(W out ) T ). (86) Note that W out Σ(W out ) T is a scalar because W out is a row vector.Since the mean of W out X(k + d) is zero and u(k) ∼ N (0, 1), we have For a quadratic form where v ∈ R N and M is a symmetric matrix, the vector of the partial derivatives is given by Therefore, and hence This formula is called Wiener-Hopf equation [30].It follows that C. The NARMA-10 benchmark The 10th-order nonlinear autoregressive moving average (NARMA-10) task was introduced in [27] to evaluate the performance of machine learning methods on time series estimation.The NARMA-10 sequence (y(k)) k≥0 is defined as follows: for an input sequence with independently U(0, 0.  12) on the linear memory capacity.All other parameters are as in table 1.The numerical simulations in the main part of this report were obtained with α = 0.9.
We obtain a three-dimensional input sequence u(k) of the reservoir by sampling with period 0.1 and normalization of all components, i.e.
For the evaluation we use the NMSE (101).

E. The Santa Fe benchmark
For the Santa Fe time-series prediction task we use a normalized version of the Santa Fe laser series [28] as input.The target is to predict the next value of the series, i.e. as the Lorenz task, the Santa Fe task is a one-timestep-prediction task.For the evaluation we use the NMSE (101).

F. Parameters
Our choice of parameters does not significantly influence our results.In this section we present the numerical simulations that we used to verify this for α, θ resp.N .
Figure 7 shows the memory capacity as a function of the clock cycle τ for different α.The effect is visible for a large range of values.In the paper, we have used α = 0.9, corresponding to the highest capacity out of the tested ones.
Figure 8 shows the memory capacity as a function of the clock cycle τ for two different numbers of virtual nodes N = 50 and N = 100.Since the time per virtual node is θ = τ /N , it ranges between 0.6 and 5.0 in the case N = 50 and between 0.3 and 2.5 in the case N = 100.In both cases The effect is visible.In the paper, we have used N = 50.

Figure 2 :
Figure 2: Normalized mean square error (NMSE) of four different TDRCs given in Eq. (9) with nonlinearities of Eqs.(12)-(15) for the NARMA-10 task (top) and the total memory capacity MCtot (bottom).Clearly visible are error peaks for values of τ that are close to integer multiples of τ and low-order resonances.The presence of the peaks does not depend on the type of the activation function.Vertical lines denote the resonant values of τ to τ as indicated on the top axis.The parameters for the simulations are listed in table1.

Figure 3 :
Figure 3: Normalized mean square error (NMSE) for the NARMA-10 task, the Santa Fe time-series task and the Lorenz task (left axis) and the total memory capacity MCtot (right axis) of the TDRC of Eq. (9) with linear activation function Eq.(12).For all tasks error peaks (resp.memory capacity drops) are visible for values of τ that are close to integer multiples of τ and low-order resonances.Vertical lines denote the resonant values of τ to τ as indicated on the top axis.The parameters for the simulations are listed in table 1.

igure 4 :
The angles between A j W in and A i W in , i, j = 1, . . ., 50 are plotted in color, measuring, in particular, the orthogonality of the vectors A j W in with different j.Panels (a) and (b) correspond to different ratios of τ /τ : (a) τ /τ = 1.06;(b) τ /τ = 1.52 3/2.

Figure 5 :
Figure 5: The time interval over which the function x is integrated in equation (49) is highlighted in blue.As stated is equation (50), the point k τ + n θ must be chosen such that it lies within this interval.In equation (51) the value of x on the integration interval is approximated by the value of x(k τ + n θ).If the endpoints of the interval are grid points, k τ + n θ is chosen to be the right endpoint.

Figure 6 :
Figure 6: Plot of the network matrix A, given by (75) resp.(80), for N 50 and diverse values of τ and τ .The connection weights are truncated at 0.3.The panels (a) and (c) show matrices for the classical case τ = τ .(See subsection A.5.)The panels (b) and (d) show matrices for the case τ ≤ τ − θ (subsection A.4).These particular examples have resonant values of τ and τ , in fact we have (b) τ /τ = 1.5 and (d) τ /τ = 2.The strongest connection weights lie on diagonal lines which are shifted as the ratio of τ and τ changes.The weights below these lines scale with the factor e −nθ , where n is the distance to the line.Since θ = τ /N , the (off-diagonal) weights are larger in panels (c) and (d), were τ is smaller.

E 1 j=0A
[(W out X(k + d)) 2 ] = var(W out X(k + d)) = W out Σ(W out ) T , out X(k + d)u(k)] = cov(u(k), W out X(k + d)).(89) Moreover, W out X(k + d) = W out   A d X(k) + d−j W in u(k + d − 1 − j) k) is independent of X(k).Therefore, cov(u(k), W out X(k + d)) = cov(u(k), W out A d−1 W in u(k)) = W out A d−1 W in .(91)Thus, we obtainMSE = out Σ(W out ) T + 1 − 2W out A d−1 W in .(92)Since the mean square error is quadratic in the argument W out = (w out 1 , . . ., w out N ), it has exactly one local minimum, which is the global minimum.A row vector W out d is the minimum argument if and only if ∂ ∂w out n MSE(W out d ) = 0, n = 1, . . ., N.

3 Figure 7 :
Figure 7: Influence of the parameter α of the linear model of Eq. (12) on the linear memory capacity.All other parameters are as in table 1.The numerical simulations in the main part of this report were obtained with α = 0.9.

Table 1 :
1. Parameters for the numerical computation of the NMSE for the NARMA-10, Lorenz and Santa Fe task and the memory capacity of the system.