Memory and forecasting capacities of nonlinear recurrent networks

The notion of memory capacity, originally introduced for echo state and linear networks with independent inputs, is generalized to nonlinear recurrent networks with stationary but dependent inputs. The presence of dependence in the inputs makes natural the introduction of the network forecasting capacity, that measures the possibility of forecasting time series values using network states. Generic bounds for memory and forecasting capacities are formulated in terms of the number of neurons of the network and the autocovariance function of the input. These bounds generalize well-known estimates in the literature to a dependent inputs setup. Finally, for linear recurrent networks and independent inputs it is proved that the memory capacity is given by the rank of the associated controllability matrix.


Introduction
Memory capacities have been introduced in [Jaeg 02] in the context of recurrent neural networks in general and of echo state networks (ESNs) [Matt 92, Matt 94, Jaeg 04] in particular, as a way to quantify the amount of information contained in the states of a state-space system in relation with past inputs.
In the original definition, the memory capacity was defined as the sum of the coefficients of determination of the different linear regressions that take the state of the system at a given time as covariates and values of the input at a given point in the past as dependent variables. This notion has been the subject of much research in the reservoir computing literature [Whit 04, Gang 08, Herm 10, Damb 12, Bara 14, Coui 16, Fark 16, Goud 16, Char 17, Xue 17, Verz 19] where most of the efforts have been concentrated in linear and echo state systems. Analytical expression of the capacity of time-delay reservoirs have been formulated in [Grig 15, Grig 16a] and various proposals for optimized reservoir architectures can be obtained by maximizing this parameter [Orti 12, Grig 14, Orti 19, Orti 20]. Additionally, memory capacities have been extensively compared with other related concepts like Fisher information-base criteria [Tino 13, Livi 16, Tino 18].
All the above-mentioned works consider exclusively white noise as input signals and it is, to our knowledge only in [Grig 16b] (under strong high-order stationarity assumptions) and in [Marz 17] (for linear recurrent networks) that the case involving dependent inputs has been treated. Since most signals that one encounters in applications exhibit some sort of temporal dependence, studying this case in detail is of obvious practical importance. Moreover, the presence of dependence makes pertinent considering not only memory capacities, but also the possibility to forecast time series values using network states, that is, forecasting capacities. This notion has been introduced for the first time in [Marz 17] and studied in detail for linear recurrent networks. This work shows, in particular, that linear networks optimized for memory capacity do not necessarily have a good forecasting capacity and vice versa.
The main contribution of this paper is the extension of the results on memory and forecasting capacities available in the literature for either linear recurrent or echo state networks, or for independent inputs, to general non-linear systems and to dependent inputs that are only assumed to be stationary. More explicitly, it is known since [Jaeg 02] that the memory capacity of an ESN or a linear recurrent network defined using independent inputs is bounded above by its number of output neurons or, equivalently, by the dimensionality of the corresponding state-space representation. Moreover, in the linear case, it can be shown [Jaeg 02] that this capacity is maximal if and only if Kalman's controllability rank condition [Kalm 10, Sont 91, Sont 98] is satisfied. This paper extends those results as follows: • Theorem 3.4 states generic bounds for the memory and the forecasting capacities of generic recurrent network with linear readouts and with stationary inputs, in terms of the dimensionality of the corresponding state-space representation and the autocovariance function of the input, which is assumed to be second-order stationary. These bounds reduce to those in [Jaeg 02] when the inputs are independent.
• Section 4 is exclusively devoted to the linear case. In that simpler situation, explicit expressions for the memory and forecasting capacities can be stated (see Proposition 4.1) in terms of the matrix parameters of the network and the autocorrelation properties of the input.
• In the classical linear case with independent inputs, the expressions in the previous point yield interesting relations (see Proposition 4.3) between maximum memory capacity, spectral properties of the connectivity matrix of the network, and the so-called Kalman's characterization of the controllability of a linear system [Kalm 10]. This last condition has been already mentioned in [Jaeg 02] in relation with maximum capacity.
• Theorem 4.4 proves that the memory capacity of a linear recurrent network with independent inputs is given by the rank of its controllability matrix. The relation between maximum capacity of a system and the full rank of the controllability matrix established in [Jaeg 02] is obviously a particular case of this statement. Even though, to our knowledge, this is the first rigorous proof of the relation between network memory and the rank of the controllability matrix, that link has been for a long time part of the reservoir computing folklore. In particular, recent contributions are dedicated to the design of ingenuous configurations that maximize that rank [Acei 17, Verz 20].

Recurrent neural networks with stationary inputs
The results in this paper apply to recurrent neural networks determined by state-space equations of the form: (2.1) These two relations form a state-space system, where the map F : is called the state map and h : R N −→ R m the readout or observation map that, all along this paper, will be assumed to be affine, that is, it is determined just by a matrix W ∈ M N,m and a vector a ∈ R m , m ∈ N. The inputs {z t } t∈Z of the system, with z t ∈ R d , will be in most cases infinite paths of a discrete-time stochastic process. We shall focus on state-space systems of the type (2.1)-(2.2) that determine an input/output system. This happens in the presence of the so-called echo state property (ESP), that is, when for any z ∈ (D d ) Z there exists a unique y ∈ (R m ) Z such that (2.1)-(2.2) hold. In that case, we talk about the state-space filter  Proposition 2.1 Let F : D N × D d −→ D N be a continuous state map such that D N is a compact subset of R N and F is a contraction on the first entry with constant 0 < c < 1, that is, for all x 1 , x 2 ∈ D N , z ∈ D d . Then, the associated system the echo state property for any input in State-space morphisms. In this paragraph we discuss how maps between state spaces can produce different state-space representations for a given filter. Consider the state-space systems determined by the two pairs ( Definition 2.2 A map f : D N1 −→ D N2 is a morphism between the systems (F 1 , h 1 ) and (F 2 , h 2 ) whenever it satisfies the following two properties: When the map f has an inverse f −1 and this inverse it is also a morphism between the systems determined by the pairs (F 1 , M 1 ) and (F 2 , M 2 ) we say that f is a system isomorphism and that the systems (F 1 , M 1 ) and (F 2 , M 2 ) are isomorphic. Given a system F 1 : D N1 × D d −→ D N1 , h 1 : D N1 −→ R m and a bijection f : D N1 −→ D N2 , the map f is a system isomorphism with respect to the system (2.4) Proposition 2.3 Let (F 1 , h 1 ) and (F 2 , h 2 ) be the two systems introduced above. Let f : D N1 −→ D N2 be a map. Then: (i) If f is system equivariant and x 1 ∈ (D N1 ) Z− is a solution for the state system associated to F 1 and the input z ∈ (D d ) Z− , then so is (f (x 1 t )) t∈Z− ∈ (D N2 ) Z− for the system associated to F 2 and the same input.
(ii) Suppose that the system determined by (F 2 , h 2 ) has the echo state property and assume that the state system determined by F 1 has at least one solution for each element z ∈ (D d ) Z− . If f is a morphism between (F 1 , h 1 ) and (F 2 , h 2 ), then (F 1 , h 1 ) has the echo state property and, moreover, (2.5) (iii) If f is a system isomorphism then the implications in the previous two points are reversible.
Proof. (i) By hypothesis as required.
(ii) In order to show that (2.5) holds, it suffices to prove that given any z ∈ (D d ) Z− , any solution (y 1 , x 1 ) ∈ (R m ) Z− × (D N1 ) Z− of (F 1 , h 1 ) associated to z, with y 1 := h 1 (x 1 t ) t∈Z− , which exists by the hypothesis on F 1 , coincides with the unique solution U F2 h2 (z) for the system (F 2 , h 2 ). Indeed, for any t ∈ Z − , . Here, the first equality follows from readout invariance and the second one from the system equivariance. This implies that y 1 , (f (x 1 t )) t∈Z− is a solution of the system determined by (F 2 , h 2 ) for the input z. By hypothesis, (F 2 , h 2 ) has the echo state property and hence y 1 = U F2 h2 (z) and since z ∈ V d is arbitrary, the result follows. Part (iii) is straightforward.
Input and output stochastic processes. We now fix a probability space (Ω, A, P) on which all random variables are defined. The triple consists of the sample space Ω, which is the set of possible outcomes, the σ-algebra A (a set of subsets of Ω (events)), and a probability measure P : A −→ [0, 1]. The input and target signals are modeled by discrete-time stochastic processes Z = (Z t ) t∈Z− and Y = (Y t ) t∈Z− taking values in D d ⊂ R d and R m , respectively. Moreover, we write Z(ω) = (Z t (ω)) t∈Z− and Y(ω) = (Y t (ω)) t∈Z− for each outcome ω ∈ Ω to denote the realizations or sample paths of Z and Y, respectively. Since Z can be seen as a random sequence in D d ⊂ R d , we write interchangeably Z : Z − × Ω −→ D d and Z : Ω −→ (D d ) Z− . The latter is necessarily measurable with respect to the Borel σ-algebra induced by the product topology in (D d ) Z− . The same applies to the analogous assignments involving Y.
We will most of the time work under stationarity hypotheses. We recall that the discrete-time process Z : Ω −→ (D d ) Z− is stationary whenever T −τ (Z)= d Z, for any τ ∈ Z − , where the symbol = d stands for the equality in distribution and T −τ : The definition of stationarity that we just formulated is usually known in the time series literature as strict stationarity [Broc 06]. When a process Z has second-order moments, that is, Z t ∈ L 2 (Ω, R d ), t ∈ Z − , then strict stationarity implies the so-called second-order stationarity. We recall that a square-summable process Z : for all t ∈ Z − (mean stationarity) and (ii) the autocovariance matrices Cov (Z t , Z t+h ) depend only on h ∈ Z and not on t ∈ Z − (autocovariance stationary) and we can hence define the autocovariance function γ : Z −→ R as γ(h) := Cov (Z t , Z t+h ), which is necessarily even [Broc 06, Proposition 1.5.1]. If Z is mean stationary and condition (ii) only holds for h = 0 we say that Z is covariance stationary. Second-order stationarity and stationarity are only equivalent for Gaussian processes. If Z is autocovariance stationary and γ(h) = 0 for any non-zero h ∈ Z then we say that Z is a white noise.
be a state map that satisfies the hypotheses of Proposition 2.1 or that, more generally, it has the echo state property and the associated filter Proof. The continuity (and hence the measurability) hypothesis on U F proves that X := U F (Z) is stationary (see [Kall 02, page 157]). The joint processes (T −τ (X), Z) and (X, T −τ (Z)) are also stationary as they are the images of Z by the measurable maps ( The next result shows that if a state-space system has covariance stationary outputs and the corresponding covariance matrix is non-singular, then there exists an isomorphic state-space representation whose states have mean zero and have orthonormal components. The stationarity condition in this result can be obtained, for example, by using Corollary 2.4. Proposition 2.5 (Standardization of state-space realizations) Consider a state-space system as in (2.1)-(2.2) and suppose that the input process Z : Ω −→ (D d ) Z− is such that the associated state process X : Ω −→ (D N ) Z− is covariance stationary. Let µ := E [X t ] and suppose that the covariance matrix Γ X := Cov(X t , X t ) is non-singular. Then, the map f : is a system isomorphism between the system (2.1)-(2.2) and the one with state map Moreover, the state process X associated to the system F and the input Z is variance stationary and (2.8)

Proof.
Since by hypothesis the matrix Γ X is invertible, then so is it square root Γ 1/2 X , as well as the map f , whose inverse f −1 is given by The fact that f is a system isomorphism between (F, h) and ( F , h) is a consequence of the equalities (2.3)-(2.4). Parts (i) and (iii) of Proposition 2.3 guarantee that if X is the state process associated to F and input Z then so is X defined by X t := Γ −1/2 X (X t − µ), t ∈ Z − , with respect to F . The equalities (2.8) immediately follow.

Memory and forecasting capacity bounds for stationary inputs
The following definition extends the notion of memory capacity introduced in [Jaeg 02] to general nonlinear systems and to input signals that are stationary but not necessarily time-decorrelated.
be a variance-stationary input and let F be a state map that has the echo state property with respect to the paths of Z. Assume, moreover, that the associated state process X : Ω −→ (D N ) Z− defined by X t := U F (Z) t is covariance stationary, as well as the joint processes (T −τ (X), Z) and (X, T −τ (Z)), for any τ ∈ Z − . We define the τ -lag memory capacity MC τ (respectively, forecasting capacity FC τ ) of F with respect to Z as: The total memory capacity MC (respectively, total forecasting capacity FC) of F with respect to Z is defined as: Note that, by Corollary 2.4, the conditions of this definition are met when, for instance, U F is continuous with respect to the product topologies and the input process Z is stationary. The proof of the following result is straightforward.
Lemma 3.2 In the conditions of Definition (3.1) and if the covariance matrix Γ X := Cov(X t , X t ) is non-singular then, for any τ ∈ Z − : . (3.4) The next result shows that the memory capacity of a state-space system with linear readouts is invariant with respect to linear system morphisms. This result will be used later on in Section 4 when we study the memory and forecasting capacities of linear systems. The proof uses the following elementary definition and linear algebraic fact: let (V, ·, · V ) and (W, ·, · W ) be two inner product spaces and let f : V −→ W be a linear map between them. The dual map f * : If the map f is injective, then the inner product ·, · W in W induces an inner product ·, It is easy to see that in that case, the dual map with respect to the inner product (3.5) in V and ·, · W in W , satisfies that and, in particular, f * : W −→ V is surjective.
Lemma 3.3 Let Z be a variance-stationary input and let F 2 : D N2 × D −→ D N2 be a state map that satisfies the conditions of Definition 3.1. Let F 1 : D N1 × D −→ D N1 be another state map and let f : R N1 −→ R N2 be an injective linear system equivariant map between F 1 and F 2 . Then, the memory and forecasting capacities of F 1 with respect to Z are well-defined and coincide with those of F 2 with respect to Z.
Proof. First of all, note that the echo state property hypothesis on F 2 and part (ii) of Proposition 2.3 imply that the filter U F2 is well-defined and, moreover, for any Let ·, · R N 2 be the Euclidean inner product in R N2 , let ·, · R N 1 be the inner product induced in R N1 by ·, · R N 2 and the injective map f using (3.5), and let f * be the corresponding dual map. It is easy to see that the equality (3.7) can be rewritten using f * as (3.8) Let now τ ∈ Z − and let MC τ be the τ -lag memory capacity of F 2 with respect to Z. By definition and (3.8) which coincides with the τ -lag memory capacity of F 1 . Notice that in the last equality we used the surjectivity of f * which is a consequence of the injectivity of f (see (3.6)). A similar statement can be written for the forecasting capacities.
The next result is the main contribution of this paper and generalizes the bounds formulated in [Jaeg 02] for the total memory capacity of an echo state network in the presence of independent inputs to general state systems with second-order stationary inputs.
Theorem 3.4 Suppose that we are in the conditions of Definition 3.1 and that the covariance matrix Γ X := Cov(X t , X t ) is non-singular. Then, (i) If the input process Z is mean-stationary then, for any τ ∈ Z − : 0 ≤ MC τ ≤ 1, and 0 ≤ FC τ ≤ 1. (3.9) (ii) Suppose that, additionally, the input process Z is second-order stationary with mean zero and autocovariance function γ : Z −→ R, and that for any L ∈ Z − the symmetric matrices H L ∈ M −L+1 determined by H L ij := γ(|i − j|) are invertible. Then, if we use the symbol C to denote both MC and FC, we have: (iii) In the same conditions as in part (ii), suppose that, additionally, the autocovariance function γ is absolutely summable, that is, ∞ j=−∞ |γ(j)| < +∞. In that case, the spectral density f : [−π, π] → R of Z is well-defined and given by

11)
and we have that

Proof.
First of all, since by hypothesis Γ X is non-singular, Proposition 2.5 and the second part of Proposition 2.3 allow us to replace the system (2.1)-(2.2) in the definitions (3.1) and (3.2) by its standardized counterpart whose states X t are such that E[ X t ] = 0 and Cov( X t , More specifically, if we denote by σ 2 := Var (Z t ) = γ(0), we can write which, using Lemma 3.2 and the fact that Γ X = I N , can be rewritten as (3.13) Analogously, it is easy to show that (3.14) If we now define Z t := Z t /σ ∈ L 2 (Ω, R), for any t ∈ Z − , it is clear that Z t L 2 = E Z 2 t = 1. Moreover, the relation Cov( X t , X t ) = I N implies that the components X i t ∈ L 2 (Ω, R), i ∈ {1, . . . , N }, and that they form an orthonormal set, that is, X i t , X j t L 2 = δ ij , where δ ij stands for Kronecker's delta. We now separately prove the two parts of the theorem.
(i) The fact that MC τ , FC τ ≥ 0 is obvious from (3.13) and (3.14). Let now S t = span X 1 t , . . . , X N t ⊂ L 2 (Ω, R) and let P St : L 2 (Ω, R) −→ S t be the corresponding orthogonal projection. Then, The inequality FC τ ≤ 1 can be established analogously by considering the projection onto S t+τ of the vector Z t .
(ii) Define first, for any L ∈ Z − , the vector Z L := (Z 0 , Z −1 , . . . , Z L ), and (3.15) In these equalities we used (3.13) and (3.14) as well as the stationarity hypothesis. Now, using the invertibility hypothesis on the matrices H L , we define the random vector Z L := H L −1/2 Z L , whose components form an orthonormal set in L 2 (Ω, R). Indeed, for any i, j ∈ {1, . . . , −L + 1}, ⊂ L 2 (Ω, R) and let P SL : L 2 (Ω, R) −→ S L be the corresponding orthogonal projection. By (3.15) we have that (3.17) Analogously, Now, by (3.16) and (3.17) we can write that An identical inequality can be shown for FC L using (3.18), which proves the first inequality in (3.10). The second inequality can be obtained by bounding ρ(H L ) using Gersgorin's Disks Theorem (see [Horn 13, Theorem 6.1.1 and Corollary 6.1.5]). Indeed, due to this result: This result and its proof are used in the next corollary to recover the memory capacity bounds proposed in [Jaeg 02] when using independent inputs and to show that, in that case, the forecasting capacity is zero. Proof. The first inequality is a straightforward consequence of (3.10) and the fact that for independent inputs γ(h) = 0, for all h = 0. The second one can be easily obtained from (3.14). Indeed, by the causality and the time-invariance [Grig 18, Proposition 2.1] of any filter induced by a state-space system of the type (2.1)-(2.2), for any τ ≤ −1, the random variables X i t+τ and Z t in (3.14) are independent, and hence

The memory and forecasting capacities of linear systems
When the state equation (2.1) is linear and has the echo state property, both the memory and forecasting capacities in (3.3) can be explicitly written down in terms of the equation parameters provided that the invertibility hypothesis on the covariance matrix of the states holds. This case has been studied for independent inputs and randomly generated linear systems in [Jaeg 02, Coui 16] and, more recently, in [Marz 17] for more general correlated inputs and diagonalizable linear systems. It is in this paper that has been pointed out for the first time how different linear systems that maximize forecasting and memory capacities may be. We split this section in two parts. In the first one we handle what we call the regular case in which we assume the invertibility of the covariance matrix of the outputs. In the second one we see how, using Lemma 3.3, we can reduce the more general singular case to the regular one. This approach allows us to prove that when the inputs are independent, the memory capacity of a linear system with independent inputs coincides with the rank of its associated controllability or reachability matrix. This is a generalization of the fact, already established in [Jaeg 02], that when the rank of the controllability is maximal, then linear systems has maximal capacity, that is, it coincides with the dimensionality of its state space. Different configurations that maximize the rank of the controllability matrix have been recently studied in [Acei 17, Verz 20].

The regular case
The capacity formulas that we state in the next result only require the stationarity of the input and the invertibility of the covariance matrix of the outputs. In this case we consider fully infinite inputs, that is, we shall be working with stationary input process Z : Ω −→ R Z that have second-order moments and an autocovariance function γ : Z −→ R for which we shall construct the semi-infinite (respectively, doubly infinite) symmetric Toeplitz matrix H ij := γ(|i − j|), i, j ∈ N + , (respectively, H ij := γ(|i − j|), i, j ∈ Z). Proposition 4.1 Consider the linear state system determined by the linear state map F : R N × R −→ R N given by (4.1) Consider a zero-mean stationary input process Z : Ω −→ D Z , with D ⊂ R compact that has secondorder moments and an absolutely summable autocovariance function γ : Z −→ R. Suppose also that the associated spectral density f satisifies that f (λ) ≥ 0, λ ∈ [−π, π], and that f (λ) = 0 holds only in at most a countable number of points. Then F has the echo state property and the associated filter U A,C is such that its output X := U A,C (Z) : Ω −→ (D N ) Z− as well as the joint processes (T −τ (X), Z) and (X, T −τ (Z)), for any τ ∈ Z − , are stationary. Suppose that the covariance matrix Γ X := Cov(X t , X t ) is non-singular. Then, These vectors form an orthonormal set in ℓ 2 + (R) and the memory capacity MC can be written as (4.4) These vectors form an orthonormal set in ℓ 2 (R) and the forecasting capacity FC can be written as where P Z + : ℓ 2 (R) −→ ℓ 2 (R) is the projection that sets to zero all the entries with non-positive index.
Proof. First of all, the condition σ max (A) < 1 in (4.1) implies that the state map is a contraction on the first entry. Second, as the input process takes values on a compact set, there exists a compact subset D N ⊂ R N (see [Gono 19, Remark 2]) such that the restriction F : D N × D −→ D N satisfies the hypotheses of Proposition 2.1 and of Corollary 2.4. This implies that the system associated to F has the echo state property, as well as the stationarity of the filter output X = U A,C (Z) and of the joint processes in the statement. We recall that, in this case, A j CZ t−j , for any t ∈ Z − , and hence, using the notation introduced in Proposition 2.5 and the invertibility hypothesis on Γ X , the standarized states X t are given by Additionally, when the autocovariance function is absolutely summable, then the spectral density f of Z defined in (3.11) belongs to the so-called Wiener class and, moreover, if the hypothesis on it in the statement is satisfied, then the two matrices H (semi-infinite) and H (doubly infinite) are invertible (see [Gray 06,Theorem 11]).
(i) Let Z := (Z 0 , Z 1 , . . .) and let Z = H −1/2 Z. An argument similar to (3.16) shows that Z i , Z j L 2 = δ ij . Moreover, (4.6) implies that which, using the definition in (4.2) can be rewritten componentwise as The L 2 -orthonormality of the components X i 0 of X 0 implies the ℓ 2 -orthonormality of the vectors {B 1 , . . . , B N } ∈ ℓ 2 + (R). Indeed, for any i, j ∈ {1, . . . , N }, We now prove (4.3). First, taking the limit L → ∞ in (3.17) we write, (ii) Let now Z := (. . . , Z −1 , Z 0 , Z 1 , . . .) and let Z = H −1/2 Z. An argument similar to (3.16) shows that Z i , Z j L 2 = δ ij for any i, j ∈ Z. Also, in this case, (4.6) implies that which, using the definition in (4.4) can be rewritten componentwise as As in (4.7), the L 2 -orthonormality of the components X i 0 of X 0 implies the ℓ 2 -orthonormality of the vectors {B 1 , . . . , B N } ∈ ℓ 2 (R). Finally, in order to establish (4.5), we use the stationarity hypothesis to rewrite the expression of the forecasting capacity in (3.14) as When the inputs are second-order stationary and not autocorrelated (Z is a white noise) then the formulas in the previous result can be used to give, for the linear case, a more informative version of Corollary 3.5, without the need to invoke input independence.
Corollary 4.2 Suppose that we are in the hypotheses of Proposition 4.1 and that, additionally, the input process Z : Ω −→ D Z is a white noise, that is, the autocovariance function γ satisfies that γ(h) = 0, for any non-zero h ∈ Z. Then: MC = N and FC = 0. (4.8) Proof. The first equality in (4.8) is a straightforward consequence of (4.3) and of the fact that for white noise inputs H = γ(0)I ℓ 2 + (R) . The equality FC = 0 follows from the fact that H = γ(0)I ℓ 2 (R) and that B j i = 0, for any i ∈ {1, . . . , N } and any j ∈ Z + , by (4.4). Then, by (4.5), An important conclusion of this corollary is that linear systems with white noise inputs that have a non-singular state covariance matrix Γ X automatically have maximal memory capacity. This makes important the characterization of the invertibility of Γ X in terms of the parameters A ∈ M N (usually referred to as connectivity matrix) and C ∈ R N in (4.1). In particular, the following proposition establishes a connection between the invertibility of Γ X and Kalman's characterization of the controllability of a linear system [Kalm 10]. The relation between this condition and maximum capacity is already mentioned in [Jaeg 02].
Proposition 4.3 Consider the linear state system determined introduced in (4.1). Suppose that the input process Z : Ω −→ D Z is a white noise and that the connectivity matrix A ∈ M N is diagonalizable. Let σ(A) = {λ 1 , . . . , λ N } be the spectrum of A and let {v 1 , . . . , v N } be an eigenvectors basis. The following statements are equivalent: Proof. (i)⇔(ii) Using the notation introduced in the statement notice that: Since by hypothesis |||A||| 2 = σ max (A) < 1, the spectral radius ρ(A) of A satisfies that ρ(A) ≤ σ max (A) < 1, and hence: which shows that in the matrix basis It is easy to see that B = P ⊤ BP and hence the invertibility of B (and hence of Γ X ) is equivalent to the invertibility of B in (4.10), which we now characterize.
In order to provide an alternative expression for B, recall that for any two vectors v, w ∈ R N , we can write v ⊙ w = diag(v)w, where diag(v) ∈ M N is the diagonal matrix that has the entries of the vector v in the diagonal. Using this fact and the Hadamard product trace property (see [ Since v, w ∈ R N are arbitrary, this equality allows us to conclude that B = diag(C)Ddiag(C) and hence B is invertible if and only if both diag(C) and D are. The regularity of diag(C) is equivalent to requiring that all the entries of the vector C are non-zero. Regarding D, it can be shown by induction on the matrix dimension N , that .
Consequently, D is invertible if and only if det(D) = 0, which is equivalent to all the elements in the spectrum σ(A) being distinct. where P is the invertible change-of-basis matrix in the previous point and R(A, C) is given by (4.12) Indeed, for any i, j ∈ {1, . . . , N }, which proves (4.11). Now, using induction on the matrix dimension N , it can be shown that The invertibility of R(A, C) (or, equivalently, the invertibility of R(A, C)) is equivalent to all the coefficients c i and all the eigenvalues λ i being non-zero (so that N i=1 c i λ i is non-zero) and all the elements in σ(A) being distinct (so that N i<j=1 (λ i − λ j ) is non-zero). (iii)⇔(iv) The Kalman controllability condition on the vectors C, AC, . . . , A N −1 C forming a basis of R N is equivalent to the invertibility of the controllability or reachability matrix R(A, C) := C|A 2 C| · · · |A N −1 C (see [Sont 98] for this terminology). Following the same strategy that we used to prove (4.11), it is easy to see that R(A, C) = P ⊤ R(A, C), where The matrix R(A, C) has the same rank as R(A, C) since it can be obtained from R(A, C) via elementary matrix operations, namely, by dividing each row i of R(A, C) by the corresponding eigenvalue λ i .

The singular case
In the following paragraphs we study the situation in which the covariance matrix of the outputs Γ X in the presence of white noise inputs is not invertible. All the results formulated in the paper so far, in particular the capacity formulas in (3.4), are not valid anymore in this case. The strategy that we adopt to tackle this problem uses Lemma 3.3. More specifically, we show that whenever we are given a linear system whose covariance matrix Γ X is not invertible, there exists another linear system defined in a dimensionally smaller state space that generates the same filter and hence has the same capacities but, unlike the original system, this smaller one has an invertible covariance matrix. This feature allows us to use for this system some of the results in previous sections and, in particular to compute its memory capacity in the presence of independent inputs that, as we establish in the next result, coincides with the rank of the controllability matrix R(A, C).
Theorem 4.4 Consider the linear system F (x, z) := Ax + Cz introduced in (4.1) and suppose that the input process Z : Ω −→ D Z is a white noise.
(i) Let R(A, C) be the controllability matrix of the linear system. Then: (4.13) (ii) Let X := span C, AC, . . . , A N −1 C and let r := dim(X) = rank R(A, C). Let V ⊂ R N be such that R N = X ⊕ V and let i X : X −→ R N and π X : R N −→ X be the injection and the projection associated to this splitting, respectively. Then, the linear system F : X × D −→ X defined by F (x, z) := Ax + Cz and determined by A := π X Ai X and C = π X (C), (4.14) is well-defined and has the echo state property.
(iii) The inclusion i X : X −→ R N is an injective linear system equivariant map between F and F .
(iv) Let X : Ω −→ V Z be the output of the filter determined by the state-system F . Then: i=0 β j i A i which, together with (4.17), implies that that equality holds for all j ∈ N. Recall now that by Proposition 4.3 part (i), Γ X = γ(0) ∞ j=0 A j CC ⊤ A j ⊤ , and hence we can conclude that Γ X (v) = γ(0) ∞ j=0 A j CC ⊤ A j ⊤ v = 0, that is, v ∈ ker Γ X . Conversely, if v ∈ ker Γ X , we have that 0 = v, Γ X (v) = γ(0) ∞ j=0 C ⊤ A j ⊤ v 2 , which implies that C ⊤ A j ⊤ v = 0, necessarily, for any j ∈ N and hence v ∈ ker R(A, C) ⊤ .
(ii) The system associated to F has the echo state property because for any x ∈ X, which implies that A 2 ≤ |||A||| 2 = σ max (A) < 1.
(iii) We first show that for any x ∈ X and z ∈ D we have that Ai X x + Cz ∈ X. Indeed, as x ∈ X, there exist constants {α 0 , . . . , α N −1 } such that x = N −1 i=0 α i A i C and hence where the constants β N 0 , . . . , β N N −1 satisfy that A N = N −1 i=0 β N i A i and, as above, are a byproduct of the Hamilton-Cayley Theorem [Horn 13, Theorem 2.4.3.2]. We now show that i X is a system equivariant map between F and F . For any x ∈ X and z ∈ D, i X F (x, z) = i X π X (Ai X (x) + Cz) = Ai X (x) + Cz = F (i X (x), z), where the second equality holds because i X • π X | X = I X and, by (4.18), Ai X x + Cz ∈ X.
(iv) We prove the identity rank R(A, C) = r by showing that i X span C, AC, . . . , A r C = X.
First, using that i X • π X | X = I X and that by the Hamilton-Cayley Theorem A j C ∈ X, for all j ∈ N, it is easy to conclude that for any which shows that i X span C, AC, . . . , A r C ⊂ X. Conversely, let v = N i=1 α i A i−1 C ∈ X. Using again that i X • π X | X = I X , we can write that, which clearly belongs to i X span C, AC, . . . , A r C . The constants β j kj are obtained, again, by using the Hamilton-Cayley Theorem. Finally, the invertibility of Γ X is a consequence of Proposition 4.3.
(v) The statement in part (iii) that we just proved and Lemma 3.3 imply that the memory MC and forecasting FC capacities of F with respect to Z coincide with those of F with respect to Z. Now, the statement in part (iv), Proposition 4.3, and Corollary 4.2 imply that those capacities coincide with r and 0, respectively. Finally, as r = rank R(A, C), the claim follows.

Conclusions
In this paper we have studied memory and forecasting capacities of generic nonlinear recurrent networks with respect to arbitrary stationary inputs that are not necessarily independent. In particular, we have stated upper bounds for those capacities in terms of the dimensionality of the network and the autocovariance of the inputs that generalize those formulated in [Jaeg 02] for independent inputs.
The approach followed in the paper is particularly advantageous for linear networks for which explicit expressions can be formulated for the memory and the forecasting capacities. In the classical linear case with independent inputs, we have proved that the memory capacity of a linear recurrent network with independent inputs is given by the rank of its controllability matrix. This explicit and readily computable characterization of the memory capacity of those networks generalizes a well-known relation between maximal capacity and Kalman's controllability condition, formulated for the first time in [Jaeg 02]. This is, to our knowledge, the first rigorous proof of the relation between network memory and the rank of the controllability matrix, that has been for a long time part of the reservoir computing folklore.
The results in this paper suggest links between controllability and memory capacity for nonlinear recurrent systems that will be explored in forthcoming works.