Nonlinear Dynamical System Modeling Via Recurrent Neural Networks and A Weighted State Space Search Algorithm

Given a task of tracking a trajectory, a recurrent neural network may be considered 
as a black-box nonlinear regression model for tracking unknown 
dynamic systems. An error function is used to measure the difference 
between the system outputs and the desired trajectory that 
formulates a nonlinear least square problem with dynamical 
constraints. With the dynamical constraints, classical gradient type 
methods are difficult and time consuming due to the involving of the 
computation of the partial derivatives along the trajectory. We 
develop an alternative learning algorithm, namely the weighted state 
space search algorithm, which searches the neighborhood of the 
target trajectory in the state space instead of the parameter space. 
Since there is no computation of partial derivatives involved, our 
algorithm is simple and fast. We demonstrate our approach by 
modeling the short-term foreign exchange rates. The empirical 
results show that the weighted state space search method is very 
promising and effective in solving least square problems with 
dynamical constraints. Numerical costs between the gradient method 
and our the proposed method are provided.


1.
Introduction. Artificial neural networks are trainable analytic tools that attempt to mimic information processing patterns in the brain. Recurrent neural networks (RNNs) have rich dynamical structures in which they can learn extremely complex temporal patterns. Thus, RNNs have been studied by many researchers in last few decades and become popular both for implementation and applications.
networks, which can possibly be extended for other networks, for example, the B-spline network, and large-scale problems can be solved more efficiently.
The organization of this paper is as follows. In section 2, we highlight the leaky integrator RNN model and present some characteristics of the solutions and the stability properties of the RNN system. The WSSSA is presented in section 3. A comparison study of the proposed method and the gradient-type methods is discussed in section 4. In section 5, modeling the short-term foreign exchange rates is used as an example to illustrated our approach. Some final remarks and future research directions are given in section 6.
2. System dynamics of the leaky integrator model. Consider the continuoustime leaky integrator model of the RNN with n neurons represented by a system of nonlinear equations w ij x i + θ i ), i = 1, 2, 3, ...n, (2.1) where x i represents the internal state of the i th neuron, σ is a neuronal activation function that is bounded, differentiable and monotonic increasing on [-1, 1]. We assume further that σ(z) = tanh(z) which is the symmetric sigmoid logistic function; θ = [θ 1 , θ 2 , ..., θ n ] T is the input bias or threshold vector of the system; W = [w ij ] n×n is the synaptic connection weight matrix with w ij being the synaptic weight of a connection from neuron n j to neuron n i ; A =diag[a 1 , a 2 , ..., a n ] and B =diag[b 1 , b 2 , ..., b n ] are diagonal matrices with positive diagonal entries. Then system (2.1) can be rewritten in the following matrix form where σ = diag[σ 1 , σ 2 , ..., σ n ] with σ i is defined by w ij x i + θ i ), i = 1, 2, ..., n.
System (2.2) is a continuous-time model of the leaky integrator RNNs (we simply refer it as the RNNs for the rest of the paper), which has been studied in many applied areas of science (see, for example, [5], [6], [7]). In the spirit of [11], we show in the following theorems some solution behaviors and the stability property of system (2.2).
By Peano's local existence theorem for the solution of ordinary differential equations, given any x 0 ∈ R n , there exists a positive number t * (x 0 ) such that the system (2.2) has a solution x(t, x 0 ) ∈ R n for t ∈ [0, t * (x 0 )), which is the maximal right existence interval of the solution x(t; x 0 ) satisfying x(0, x 0 ). Thus, the trajectory of the RNN model (2.2) is bounded for any initial point in R n . With this theorem in mind, we introduce Theorem 1 as follows: Theorem 1. Given an initial state x(0) = x 0 , there exists a positive number t * (x 0 ) such that the solution x(t) ∈ R n of model (2.2) is unique and bounded for all t ∈ [0, t * (x 0 )). In addition, the solution x(t) exists globally. In other words, the solution of system (2.2) is globally bounded.
Proof. Let us consider the linear system of differential equations dy dt = −Ay + Bv p + θ and dz dt = −Az + Bv q + θ, (2.4) where matrices A and B are defined as in (2.2). v p = (p, p, ..., p) T and v q = (q, q, ..., q) T are the n dimensional constant vectors to be determined later, and θ = [θ 1 , θ 2 , ..., θ n ] T is the system input. We set y(0) = z(0) = x(0) = x 0 . We claim that: y i (t) ≤ x i (t) ≤ z i (t) for all i = 1, ..., n. and for all real t. In fact, let r = x j − y j , then r(0) = 0, and where we choose p such that Then r(t) = e −ait t 0 ge ais ds > 0 for each i = 1, 2, ..., n. It follows that Similarly we can show that z i (t) ≥ x i (t). Hence, for any given initial state x 0 , there exists a bounded solution x(t) of system (2.2). Moreover, Thus, we obtain Hence, as time t tends to infinity, each component of the solution ]. This implies that system (2.2) is globally bounded.
For the uniqueness of the solution x(t) of system (2.2), let f ∈ C n [R n ] defined by and σ ∈ C n for n ≥ 2. Suppose σ is bounded by a positive constant k. Then by the mean-value theorem, Therefore, f is Lipschitz over R n with Lipschitz constant A +k B W . Thus, the uniqueness of the solution is also guaranteed by the theory of ordinary differential equations. Hence, for a fixed W and arbitrary initial state x 0 , there exists an unique solution for system (2.2).
Note that if the input θ(t) is a bounded function of time, the global boundedness of the system is not affected. Yet, the equilibrium state x * = σ(W x * ) + θ depends on the input. For a dynamical system, although all the real parts of the eigenvalues of the Jacobian are negative for every x ∈ R n , the dynamical system need not be globally asymptotically stable. We are interested in finding a condition for which the system is strictly globally asymptotically stable so that the output state x(t) will reach the same steady state as t tends to ∞ for an arbitrary given initial state.
One standard approach is to consider the eigenvalues of the Jacobian matrix where δ ij is the Kronecker delta. Hence, where Λ is a diagonal matrix with Λ jj = σ ( n k=1 w jk x k ). Now let Ω be the set of solutions of (2.2). Since Ω is bounded in R n , we assume that there exist constants c, d ∈ R n such that with c i ≤ bi ai for all i = 1, 2, ..., n. We shall show in Theorem 2 that Ω is a positive invariant and attractive set of the RNN model (2.2). That is, for any solution trajectory of the RNN model (2.2) starting from Ω, the solution trajectory cannot escape from Ω. Moreover, for any solution trajectory of the RNN model (2.2) starting from outside of Ω, it will converge to Ω. First, we rewrite (2.1) in the following form Notice that all b i 's are positive numbers. We now introduce the following definitions.
Definition. Given any > 0, for n = 1, 2, ..., n, we define where c i 's and d i 's are defined as in (2.7). Before we proceed to prove Theorem 2, we need the following two lemmas. Lemma 1. For i = 1, 2, ..., n, Ω i is positive invariant.
Proof. We need to show that if the initial condition x 0 ∈ R n satisfies the ith component of the solution x(t; x 0 ) has the property that In fact, suppose the contrary, that ist i < ∞, then there exists a positive number (2.11) Without loss of generality, we assume that Note that by the similarity we can obtain the same conclusion in the case of From (2.15) we obtain which contradicts condition (2.16). Therefore,t i = ∞. This means that Ω i is positive invariant.
Lemma 2. For i = 1, 2, ..., n, if the initial condition x 0 ∈ R n satisfies then the solution trajectory x i (t; x 0 ) will stay in Ω i after a finite period of time.
That is, there exists a positive numbert such that which implies Ω i is an attractive set.
Proof. Since x 0i ∈ Ω i , we let From equation (2.14), we have By the theory of differential inequalities, we have Condition (2.20) together with the definition oft i , we obtaiñ This means that Ω i is an attractive set. Proof. By Lemma 1 and Lemma 2, for any positive number > 0, Ω i is a positive invariant and an attractive set. Moreover, for any solution x(t; x 0 ) with initial point x 0 ∈ Ω , it will enter into Ω in a finite time. Because of the positive invariant property of Ω i , x i (t; x 0 ) will stay in Ω i after a finite period of time. Let → 0 + , we have lim →0 + Ω = Ω. Therefore, we can conclude that Ω is a positive invariant and attractive set of the RNN model (2.2).
Let the matrix A = diag[a 1 , a 2 , ..., a n ] = I n in system (2.2), where I n is the n × n identity matrix, then system (2.2) becomes We cite without proof another result in [8] as Theorem 3 for the stability property of system (2.24) and its equivalent form (2.23). Theorem 3. If W is invertible and the matrix W B is semi-negative definite, then the RNN model (2.23) is globally exponentially stable. That is, there exists two positive constants p ≥ 1 and q > 0 such that for any x 0 ∈ R n and t ∈ [0, ∞) is the solution of (2.23) with initial condition x(0; x 0 ) = x 0 , and x * is an equilibrium point of (2.23) defined by Remarks. We notice that the inequality shows the global exponential convergence rate of the continuous-time RNN model (2.24). Let the matrix A = aI n×n for some positive constant a, we have If a is small, the RNN model (2.1) will be more sensitive to input noise and round-off errors in numerical simulations and implementation.
3. Weighted state space search learning algorithm. We have presented three important characteristics of the solutions to system (2.1) in Section 2. It is known that the continuous-time model (2.1) (and its equivalent form (2.2)) and the discretetime system (3.1) need not share the same dynamical behavior. However, the discretized variant (3.1) of system (2.1) will inherit the same dynamics of system (2.1) when the step size is "small" (see [14]). In practice, we approximate system (2.1) by Euler's method and obtain the dynamics of a discrete-time RNN model is a diagonal matrix with positive entries h i for each i, and h i is the step size of Euler's discretization. The two systems (3.1) and (2.2) share the same dynamical behavior when h i tends to 0. For simplicity, we let h i = h for all i. Consider a given trajectory y(t) ∈ R n and t = 1, .., m, we use (3.1) to approximate the target trajectory y(t) with the error function E defined by for some positive integer m. To further simplify the notation and analysis, we fix h and let A = B = I n×n , i = 1, 2, ..., n. We extend the trajectory dimension by one and let y n+1 (t) = 1 for all t in order to absorb the variable θ into the last column of W . Without loss of generality, we still consider an n 2 -dimensional constrained least square problem That is subject to RNN dynamics As we mentioned in Section 1, the gradient descent, conjugate gradient and quasi-Newton's method have been applied to solve the above least square problems for many decades. In recent years, the growing demand for sophisticated derivative free optimization methods has triggered the development of a relatively wide range of approaches (see [3]). Note that in neural networks, learning is a process of changing the network parameters W so that the system outputs x(t) will approach to the target trajectory y(t). For a fixed h, in the ideal case where the network is exactly capable, that is if E(W 0 ) = 0, then the optimal solution can be simply solved by W 0 where For the more general cases, naturally after we obtain W 0 from (3.5), we generate the whole trajectory set X(W 0 ) by The basic idea of the state space search algorithm(SSSA) is that we search the class of the x-convex set C x (A + ) in the state space R m×n instead of the parameter space of W in each iteration, where A + is the set of attainable points of x in R n . In the next iteration, instead of approximating the whole desired trajectory Y , then we approximate the new set of trajectory X 1 defined by In fact, this new trajectory X 1 is a convex combination of the RNN system output and the desired trajectory Y (t). If α is close to 1, the desired trajectory Y (t) is considered being massaged by the RNN system output. If α is close to 0, Y is just a deterministic perturbation to the system output. Geometrically, α(Y − X(W 0 )) defines the error direction. Hence, we may vary α as α i in the i th learning iteration. By the continuity of W , there exists some α * with 0 < α * < 1, such that In practice, we only need to store the best solution for each α. Notice that X 1 may not be attainable even though X(W 0 ) is attainable, but we may repeat the process and obtain W 1 by The basic idea of (3.8) assumes a uniform weight for every t and we search a reachable solution in the neighborhood of Y . In many circumstances we anticipate the system output x(t) of the RNN will become closer to y(t) as time goes by. That is, if z(t) = x(t) − y(t) 2 , then z(t) tends to 0 as t tends to m. In view of this, we define the WSSSA by changing the k th iteration α k to α k,t , where In our numerical example, we use (3.9) as the weighting rule. Of course, we may have α k,t = α k (1 + 2t/m) or other composition such as where f (t) is an increasing function of t with the above consideration. The choice of α k,t is an art in WSSSA for future research and the optimal choice of α * k,t is problem dependent in nature and may be approximated by a line search algorithm in each iteration. Now, we repeat the above procedure and obtain a sequence of attainable system output {X k } = {X(W k−1 )} in the state space defined by ..., ..., .... where Consequently, we obtain, for each k, The learning process stops if we obtain a solution W N such that the error E(W N ) is less than a prescribed tolerance, say E(W N ) < 10 −4 . Nevertheless, in practice, we store the best results for all iterations. For the convergence and the rates of convergence of the SSSA, we cite without proof the following theorem from [9].
Convergence Theorem. There exists a limit point X * such that the attainable sequence {X k } of X k converges, that is, lim k→∞ X k = X * . Moreover, (i) The above Convergence Theorem can be extended to WSSSA since t α k,t < 2mα k is finite and bounded in our construction. The WSSSA performs the nonlinear optimization learning process and provides the best feasible solution for the nonlinear optimization problem (3.3). The convergence analysis shows that the network converges to the desired solution and the stability of the algorithm depends on {α i }'s. Geometrically, for each iteration, X k defines a homotopy with respect to α between the desired trajectory Y and the system output X(W k−1 ). Comparison between computational cost with classic gradient type method will be shown in next section. It follows from the fact that the limit point X * of {X k } will lead us to obtain the corresponding best feasible solution W * . Meanwhile, the error sequence E(W k ) ↓ 0 when lim k→∞ X k = X, while lim k→∞ < ∞ for any ρ < 1. Recall that our task is to minimize x(W ) − y 2 subject to the RNN dynamics. In practice, we may vary h within a small range in order to find the best initial state with respect to h. Then, we may improve it further by a line search algorithm in each iteration.
Furthermore, we may make a local approximation of E(W k+1 ) with respect to α, since W k+1 and E(W k+1 ) are relatively easy to compute. Nevertheless, it is important to have an efficient, effective and flexible learning algorithm for the optimization with dynamical constraints. From experimental examples, the WSSSA demonstrated to be a very effective tool to provide extremely promising results for learning the RNN dynamics. Here, RNN serves a good example to demonstrate the complexity of a nonlinear discrete-time dynamics that is extremely hard to be solved by classical gradient methods.

4.
A comparison study of the SSSA and the gradient based learning algorithm. Our result demonstrate the efficiency of our learning algorithm over the gradient-type methods. In this section, we provide a comparison of the computational operations needed for the SSSA and the gradient based learning algorithm for one iteration for the discrete time RNN (4.1). Note that the complexity of the WSSSA method is same as the SSSA method. Given a sequence S(t) ∈ (−1, 1) n , for 1 ≤ t ≤ m, we consider a discrete time RNN similar to (3.1) with a constant step size h. Both systems are equivalent if W is nonsingular and we may let z(t) = W x(t), so that (4.1) As before, we absorb θ in W for simplicity in the following discussion. For the k th component of z(t), Recall that the error function For a gradient based learning algorithm, we need to compute ∂E ∂Wij where and for all 1 ≤ k ≤ n and 1 ≤ i, j ≤ n. Given a connection weight matrix W , before applying any learning algorithm, we need to compute the entire trajectory z(t) and σ(z). If we store the intermediate partial derivatives in order to achieve a speedy computation, it requires extra np function evaluations to approximate σ (z). For each W ij , it requires (2n + 1) multiplications to estimate ∂z k (t+1) ∂Wij . If we neglect the computational cost of the addition operations, it takes nm(2n + 1) multiplications to estimate the partial derivatives ∂E ∂Wij along the trajectory z(t). Hence, the total computational cost for one gradient iteration is n 2 [2n 2 m + 2n] = 2n 3 (nm + 1). (4.4) In other words, the order of multiplications is 2n 4 m, plus nm functional evaluations of σ (z) together with the trajectory information z(t) and σ(z).
On the other hand, for the WSSSA approach, besides the trajectory information of z(t) and σ(z), it takes n 2 m multiplications to form the matrix equation. Hence, we use only 5n 3 /6 multiplications to obtain the connection weight matrix W for the next iteration, which is much more cost effective than the gradient-based learning algorithms.

5.
The empirical example. We assume that there is an implicit dynamics between the exchange rates, so that a short-term trend exists in our foreign exchange series. We try to model this by neural network techniques. Our assumption is based on the result of [17], which they show that statistically the foreign exchange series do not support the random walk hypothesis. Historical data of the foreign exchange rates are available from many web sites in the form of daily averages. We chose to retrieve the inter-bank rate data from OANDA.com. In this example, data consist of the daily exchange rates of seven major currencies, Australian dollar (AUD), Canadian dollar (CAD), Swiss Franc (CHF), Euro (EUR), Sterling (GBP), Russian Rouble (RUB) and Japanese Yen (JPY) compared to the U.S. dollar from Jan 2 to October 2009. We collected 294 daily data for each currency.
For simplicity, let matrices A and B of (3.1) be identity matrices, that is, We may consider the RNN constraint as a time series model, with this setting, if we set h = 0, the equation becomes which represents the last-value forecasting of x(t + 1), that is the naive model for forecasting. When h = 1, we have that is a nonlinear time series model. In particular, if σ is the identity mapping, it becomes the multi-linear time series model. These models can be used to capture the behavior of the internal mechanism of the financial market such as foreign exchange rates. Thus, we may regard the RNN as a nonlinear time series model since σ is nonlinear. Therefore, for 0 < h < 1, we may consider equation (5.2) as the convex combination of two forecasting models. We use the first 284 days observations to train and validate the RNN model. After the training, we then use the resulting neural network parameters to make the out-of-sample forecasts the last 10 observations. Out-of-sample forecast errors are measured in order to judge how good our model is in terms of its prediction abilities. The learning dynamics used for the discrete-time RNN is equation (3.3).
Before the training process, the data needs to be transformed into an appropriate form for the networks. We employ a normalization for each component series in x(t i ). Since tanh is chosen as the sigmoid activation function that gives an output in the interval [-1, 1], it is necessary to normalize the data into this interval to avoid working near the asymptote of the sigmoid. This normalization is given by as in as in [5]. The data is then denormalized using the inverse of formula (5.3).
As we knew that this normalization will generally smooth out the extreme outliers, and guarantees x(t i ) to lie between −0.95 and +0.95, and therefore, they are inside the range of the neural activation function σ. In addition, this normalization process will facilitate the computational work in the state space learning process.
Instead of using the normalized raw data to feed the system for learning, we use the moving average series z(t) of order 5, 10, 20, 50 and 100, respectively, obtained from x(t). For example, we may take a moving average of 100 terms, that is, x i (t) = 1 100 t+99 j=t z i (j). In general, the moving average is defined as: Also, since the range of the ratios varies for different currencies, we normalize the range of the exchange rates to ±0.95. The advantage of using moving averages to model the short-term trend in foreign exchange rates can be found in [17]. We apply the same learning process to the trajectory of the 5, 10, 20, 50, 100 days moving averages of the exchange rate sequences. For simplicity, we set the external force J to be the zero vector.
In this experiment, with the 100 days moving average, the optimal connection weight matrix W is an 8 by 8 matrix generated by the WSSSA in the in-sample training process with the threshold θ as the 8 th input which is a constant vector 1. We use 4000 iterations and the different optimal step sizes h range from 0.1 to 0.15 for each of the currencies. Software MATLAB was used to conduct all computations and plot the graphs. The total forecasting errors for each of the 7 currencies in the last 10 days are displayed in the following The linear time series model x e (t + 1) = Sy(t) is used in the comparison in this study, where x e (t + 1) is the estimate at time (t + 1) and S is a 7 by 7 matrix. The RNN time series is x e (t + 1) = y(t) + h[−y(t) + σ(W y(t))].
From the above table, the results of the RNN time series model are usually better than the time series model because the RNN is a nonlinear regression model. On the other hand, it is reasonable that the RNN dynamics model has the largest prediction error because of the accumulation of errors over time if we did not update the information like the regression models. In the given figures, the solid line are the 100 moving average currencies series while the dotted lines are the system outputs of the RNN dynamics. 6. Concluding remarks and further researches. In this paper, we study the nonlinear dynamical system modeling via a recurrent neural network with dynamic constraints. We propose a derivative-free and non-random learning algorithm (WSSSA). We show that our algorithm is stable and convergent. A complexity analysis presented here shows that our algorithm is much simpler than the gradient based methods. We employed the proposed method to learn short term exchange rates and demonstrated its usefulness in performing the short term forecasts in the currency markets.
There are several important directions for future research in this area. Better results are expected if we use a diagonal matrix H instead of a constant step size h. Generally speaking, for a nonlinear least-squares optimization problem with the desired trajectory y(t), if our task is to solve the least-squares problem min W t x(t) − y(t) 2 subject to a dynamical system of constraints, where F (W, x(t)) is the discrete-time RNN dynamics, our approach can be easily applied to other discrete dynamic F providing that the parameters W of the connection weighted matrix can be obtained in a simple way as W 0 in (3.5). For other extensions to the prosed method, it will certainly be of interest to study the asynchronized recurrent networks and other neural network structures.