Global optimization of hyper-parameters in reservoir computing

: Reservoir computing has emerged as a powerful and efﬁcient machine learning tool especially in the reconstruction of many complex systems even for chaotic systems only based on the observational data. Though fruitful advances have been extensively studied, how to capture the art of hyper-parameter settings to construct efﬁcient RC is still a long-standing and urgent problem. In contrast to the local manner of many works which aim to optimize one hyper-parameter while keeping others constant, in this work, we propose a global optimization framework using simulated annealing technique to ﬁnd the optimal architecture of the randomly generated networks for a successful RC. Based on the optimized results, we further study several important properties of some hyper-parameters. Particularly, we ﬁnd that the globally optimized reservoir network has a largest singular value signiﬁcantly larger than one, which is contrary to the sufﬁcient condition reported in the literature to guarantee the echo state property. We further reveal the mechanism of this phenomenon with a simpliﬁed model and the theory of nonlinear dynamical systems.


Introduction
Recently, reservoir computing (RC) [1], also known as a generalization of echo-state network (ESN) [2] or liquid state machine (LSM) [3], has emerged as a powerful and efficient machine learning tool in reconstruction or/and prediction of many complex physical systems even for chaotic systems only based on the observational data of time series data [4][5][6]. In sharp contrast to its great efficacy, as a special variant of a recurrent neural network (RNN), RC has a surprisingly contracted architecture, where only three weight matrices are involved: the input weight matrix and a reservoir network matrix are randomly generated and fixed, leaving only one output weight matrix for training, as shown in Figure 1. As such, simple and efficient least squares optimization methods rather than the resourceconsuming back propagation algorithm are good enough for the training process [7]. Thus, a question arises naturally: how to capture the art of hyper-parameter settings for RC's networks? As a matter of fact, this is a long-standing and urgent problem and great attentions have been attracted to carry out various discussions, i.e., from the topology and distribution of the random connections [8,9] to the spectral radius and singular value of the random network [10,11]. Generally, the existing studies are mainly based on a variable control experimentation manner, i.e., optimizing one hyper-parameter while leaving all the other hyper-parameters constant. In such a way, the obtained results are mainly local with specific settings but sometimes cannot be generalized to global hyper-parameters space.
In this work, we propose a global optimization framework using simulated annealing technique to find the optimal architecture of the randomly generated networks for a successful RC. With the optimized results, we further study several important properties of some hyper-parameters, i.e., the sparsity and distribution of the networks, the spectral radius and the largest singular value of the reservoir networks. Interestingly, we find that the globally optimized reservoir network has a spectral radius near one and a largest singular value significantly larger than one, which is contrary to the sufficient condition reported in the literature to guarantee the echo state property. We further study the mechanism of this phenomenon with a simplified model and the theory of nonlinear dynamical systems.  Figure 1. Reservoir computing framework. The three layers included in the RC framework are sketched and three weight matrices W in , W res , and W out are highlighted.

Reservoir computing
For the job of nonlinear dynamics reconstruction based on time series data, a general framework of RC could be sketched in Figure 1. Here, the input data x k ∈ R n represents the observation vector of a dynamical system sampled at time step k with a specific smooth observe function such that x k = h(z k ) where z k is the state vector. The underlying dynamical system is assumed to be evolving on a compact manifold M with the evolution operator ϕ ∈ Diff 2 (M) : z k+1 = ϕ(z k ). The reservoir network is composed of m reservoir neurons whose connections are represented by the reservoir network matrix W res and the vector r k ∈ R m represents the state of reservoir neurons at time step k. The input layer weight matrix W in and the reservoir network matrix W res are, respectively, m × n and m × m random matrices generated according to certain distribution laws.
The dynamical evolution of the reservoir neurons is governed by: where α is the leakage factor, and tanh ∈ C 2 (R, (−1, 1)) is a sigmoid function. The reservoir vector and the output vector y k ∈ R l is connected by the output weight matrix W out ∈ R l×m such that y k = W out r k . Given the time series, denoted by x k , k = 1, · · · , N, as training data, the target is to train the output weight matrix W out so as to approximate the one-step dynamics prediction, i.e., y k ≈ x k+1 . To this, a loss function is designed as where β > 0, the L 2 -regularization coefficient, is introduced to make optimization robust. The output weight matrix W out is thus generally obtained by minimizing the loss function (2.2) over the training data set. After training, one can fix the output weight matrix W out and redirect the output y k = W out r k as an approximation of x k+1 into the input layer of the network and thus the autonomous dynamics for x k with k > N could be generated.

Key hyper-parameters
The RC framework is distinguished from usual RNNs due to the fact that the input weight matrix W in and the reservoir network matrix W res are randomly generated rather than being trained. Therefore several properties e.g., the sparsity, the distribution, the spectral radius, and the largest singular value of the randomly generated weight matrices will undoubtedly affect the performance and careful choice a priori is required, i.e., they are hyper-parameters in a RC framework. Among all the hyper-parameters, the spectral radius ρ, defined as the largest absolute eigenvalue, of W res , is generally believed as a key to the success of reservoir computing. The seminal works [1] and [10] conclude that spectral radius is related to the echo state property(ESP), a necessary condition for a RC to work properly, and the memory capacity, a capacity evaluating time series reconstruction ability, and therefore spectral radius is suggested to be less than 1. However, several other works [8,11,12] also imply that the optimal spectral radius varies case from case and spectral radius larger than 1 sometimes shows best performance. To further theoretically study the echo state property, the largest singular value σ is introduced in [2] and σ(W res ) < 1 is adopted as a sufficient condition to ensure ESN. Besides ρ and σ, the sparsity, topology, and distribution of the randomly generated matrices are also usually considered as key hyper-parameters of a RC. It is suggested in the seminal work [7] that in order to generate rich variety of dynamics, the reservoir network should be sparse and recent works also confirms that low connectivity is beneficial for forecasting chaotic systems [9]. As for the topology and distribution of the randomly generated matrices, besides the commonly used Erdös-Rényi random network, both smallworld and scale free networks have been discussed in [13], and analogously, other than the commonly adopted uniform distribution, both Gaussian and even Bernoulli distribution have been studied in [14]. Though many efforts have been made to understand and reveal the optimal choice of these hyperparameters, the existing studies are mainly reported in a variable control experimentation manner, i.e., optimizing one hyper-parameter while leaving all the other hyper-parameters constant. In such a way, the obtained optimal results may be local with specific settings. As a matter of fact, in [8,13], it is reported that different topologies of reservoir networks may yield significantly different conclusions for optimal spectral radiuses. Therefore, a global optimization study is urgent and necessary.

Global optimization using simulated annealing
Simulated annealing (SA) is a probabilistic technique for approximating global optimization in a large search space for a given function L. The basic idea of SA is that, in each iteration step, the SA heuristically considers some neighboring state s * of the current state s and probabilistically decides whether moving the system to the new state s * or staying in the current state s. Such probability is decided by both the improvement L(s * ) − L(s) and a decaying temperature T , and typically the iterated step is repeated until the system reaches a state good enough for the application.
In this work, we take the chaotic Lorenz system as a benchmark: where a = 10, b = 28, and c = 8/3. We consider the job of dynamics reconstruction using RC, i.e., after training period, the autonomous dynamics x t = [x(t), y(t), z(t)] T is generated by the trained RC and the reconstruction is evaluated by the forecasting horizon (the length of precise predictions with error less than a threshold). To find the global optimization for the key hyper-parameters of RC, we take all the weights in the two matrices W in and W res as state variables s = [W in , W res ] in the SA framework and design the target function as L = h − α W res 1 where h is the forecasting horizon expressed in Lyapunov times and α > 0 is a regularization coefficient to make W res sparse as generally required for RC [7,9]. Thus the SA algorithm adopted in this work could be summarized in Algorithm 1 as follow. Repeat the iteration until k k MAX or forecasting horizon h keeps unchanged for R MAX iterations: 1) Generate a random neighboring state s * based on the current s, train the RC and evaluate the target function L * = L(s * ). 2) IF L * > L, P t = 1. Random(0, 1), accept the new state, i.e., let s = s * and L = L * .
Here k stands for the kth iteration, k MAX is the largest number of iterations.
In this work, we set the size of W in and W res as 100 × 3 and 100 × 100 respectively and thus there are in total 100, 300 parameters in the SA state s. To accelerate the optimization, inspired by the drop out strategy in deep learning, we only update a set of randomly picked 1000 parameters in each iteration and update the decaying temperature T after each 100 iterations, i.e., T (k) = 1 * 0.95 k|100 .
In order to avoid unbounded states, inspired by the work [15], we also restrict the update of states in the following way where the upper bound and lower bound are ub = 0.8, lb = −0.8 and β is a uniformly random number between 0 and 1.
A typical process of SA and the optimized results are illustrated in Figure 2, and in order to guarantee convergence, we set k MAX = 20, 000 and R MAX = 5000 empirically.

Results
To make the results robust, we carry out the SA global optimization algorithm 50 independent times to find the optimal hyper-parameters for the RC and obtain 50 sets of optimized W in and W res respectively. Based on such a result we discuss the choice of several key hyper-parameters, i.e., the topology of the reservoir matrix, the distribution of the weight, the spectral radius ρ, and the largest singular value σ.

Topology and weight distribution
Before we discuss the topology of the optimized random network, we note that though the L1 regulation has been introduced into the SA algorithm, there are still several very small non-zeros weights which will contaminate the topology analysis. Therefore, we run the sparsity test as illustrated in Figure 3(a) and set 10 −3.5 as the threshold for non-zero weight in the random network. The degree distributions of such pre-trimmed W res are shown in Figure 3(b), where the majority shows a symmetric style, and this character implies that the optimized networks W res in this job are generally Erdös-Rényi random networks since the degree distribution of an ER random network obeys Poisson distribution while Poisson distribution with high mean values appears in a symmetric style [16].
Additionally, we also carry out a test to study whether there is any small-world property in the optimized W res . In Figure 4(a), we show the clustering coefficients for the 50 independent optimized W res as well as the ones for the corresponding random networks. Here the corresponding random network refers to the random network that has the same numbers of nodes and same number of edges per node as this network. It is clear that the clustering coefficients of W res are not significantly larger than the ones for the corresponding random networks, implying that the optimized W res does not have small-world property.  When generating random matrices, the weight distribution is another essential hyper-parameter. To test whether the weight distribution obeys normal distribution for the optimized networks, we use the Quantile-Quantile Plots (QQ Plots) as a test. To be concrete, we first linearly scale the weight distribution into zero mean and one deviation, and then plot the quantile against the standard normal distribution. If more than 80% points falls in the confident interval, i.e., ± 0.3 of the fitted line, we regard the weight distribution as normal, and two typical QQ Plots are illustrated in Figure 3(c),(d).
In order to avoid the affection of initial conditions and the random neighborhood used in the SA optimization algorithm, we have tried both normal distribution and uniform distribution to generate initial states as well as the random neighborhood, and we come to the conclusion that the weights of all the optimized W res obey normal distribution while the weights of W in highly depends on the choice of initial state distribution, as shown in Table 1. Actually, this observation coincides with the existing results where many works prefer normal distribution for W res while both normal distribution and uniform distribution are adopted for W in [5,9,12,[17][18][19].  Based on the 50 independent runs for optimized W res , the respective values for spectral radius ρ and the largest singular value σ are illustrated in Figure 4(b). It is clear that the optimized values of ρ are all less than one and clustered around the mean value 0.87, which coincides with the conclusion suggested by the seminal works [1, 10] that a spectral radius less than one but close to one is a good choice for both echo state property and the memory capacity.
While for the largest singular values, we find that the optimized values of σ are all larger than one with a mean value as large as 1.71. Since the theoretical analysis in [2] shows σ < 1 is a rigorous sufficient condition to ensure echo state property, we further check that whether the echo state property holds for the RC using our optimized σ which is larger than one. The echo state property means that the effect of initial conditions should vanish as time passes, i.e., starting from different initial values, the dynamics of internal neurons should synchronize with each other rapidly [10]. In Figure 4(c), a typical result is shown that no matter what the neurons' initial values are, their dynamics quickly synchronize with each other under the same input, i.e., the echo state property still holds. In order to study the mechanism of this phenomenon, we first carry out a simple analysis analogous to the work in [2]. Starting with arbitrary two different initial conditions r 1 0 and r 2 0 , consider two trajectories r 1 k and r 2 k of RC neurons generated by Eq (2.1) under the same input x k . The evolution of difference between two trajectories ∆r k = r 1 k − r 2 k could be estimated as ∆r k = (1 − α)∆r k−1 + α tanh(W res r 1 k−1 + W in x k ) − tanh(W res r 2 k−1 + W in x k ) (1 − α) ∆r k−1 + α W res ∆r k−1 (1 − α + ασ) ∆r k−1 , thus as long as σ < 1 we have ∆r k β ∆r k−1 where β = 1 − (1 − σ)α < 1, and consequently ∆r k tends to zero exponentially, i.e., r 1 k and r 2 k synchronizes with each other quickly. However when σ > 1, the above estimation no longer holds. To further reveal the mechanism why echo state property still generally holds when σ > 1, we consider a simple model as an illustration. We consider the 1-D situation, i.e., there is only one neuron in the reservoir, and let the leaky coefficient α = 1 which is also adopted in many works [5]. Then the dynamics of the single neuron could be described by The map r k+1 = tanh(σr k ) has three fixed points when σ > 1, as illustrated in Figure 5, where c 0 = 0 is repelling while c + > 0 and c − < 0 are attractive. It is clear from Figure 5 that (0, +∞) is the attractive basin for c + while (−∞, 0) is the attractive basin for c − . Thus, as long as the initial values r 1 0 and r 2 0 fall in the same attractive basin, two trajectories r 1 k and r 2 k under map r k+1 = tanh(σr k ) will soon converge to the same fixed point c + or c − . Note the fact W in x k 1 and it could be noted as a perturbation γ k . Thus the two trajectories r 1 k and r 2 k under map r k+1 = tanh(σr k + γ k ) will generically oscillate around the same fixed point c + or c − . Further note that the derivative of the map around c + or c − is less than one, i.e., the map is contractive around c + or c − as highlighted in the pink region of Figure 5, thus the two trajectories will synchronize with each other soon, and therefore the echo state property still holds.

Discussion and conclusions
In this paper, we carry out an optimization schema using simulated annealing (SA) technique to find the globally optimized hyper-parameters for reservoir computing in the job of chaotic dynamics reconstruction. Specifically, we discuss the choice of random network topology for the reservoir network W res , the type of weight distribution for both the input layer network W in and W res , and we further study the spectral radius and largest singular value of the reservoir network W res which are closely related to the echo state property and memory capacity of the RC. Most of the globally optimized hyper-parameters coincide with the main stream of the existing results, i.e., the topology of the random network W res satisfy ER random network, the weight distributions for W res and W in are mainly normal distribution, and the spectral radius ρ is less than one but close to one. On one side, such results confirm the effectiveness of our proposed SA schema as a global optimization method, and on the other side, these results also provide a way to choose/initialize W in and W res when considering these hyper-parameters.
However, we also find the optimized largest singular value σ is significantly larger than one, which does not satisfy the theoretical sufficient condition to ensure the echo state property. To study this phenomenon, we further confirm that the echo state property still holds even with σ > 1 and RC works well. With a simple illustrative model, we reveal the mechanism of RC with σ > 1 using the theory of nonlinear dynamical systems. Indeed, such analysis is simple and heuristic, it is a challenging work to consider more complicated situations, e.g., if the initial values r 1 0 and r 2 0 fall in different attractive basins, or even high-dimensional cases with saddles. Some tools in nonlinear dynamical systems such as noise induced synchronization may be introduced to understand the mechanism and some recently developed methods [20,21] may be helpful for further applications. However this is beyond the topic in this paper and the quantitative analysis for these problems is still open and forms the direction for the future works.