CONVERGENCE ANALYSIS OF THE WEIGHTED STATE SPACE SEARCH ALGORITHM FOR RECURRENT NEURAL NETWORKS

. Recurrent neural networks (RNNs) have emerged as a promising tool in modeling nonlinear dynamical systems. The convergence is one of the most important issues of concern among the dynamical properties for the RNNs in practical applications. The reason is that the viability of many applications of RNNs depends on their convergence properties. We study in this paper the convergence properties of the weighted state space search algorithm (WSSSA) – a derivative-free and non-random learning algorithm which searches the neighborhood of the target trajectory in the state space instead of the parameter space. Because there is no computation of partial derivatives involved, the WSSSA has a couple of salient features such as simple, fast and cost eﬀective. In this study we provide a necessary and suﬃcient condition that required for the convergence of the WSSSA. Restrictions are oﬀered that may help assure convergence of the of the WSSSA to the desired solution. The asymptotic rate of convergence is also analyzed. Our study gives insights into the problem and provides useful information for the actual design of the RNNs. A numerical example is given to support the theoretical analysis and to demonstrate that it is applicable to many applications.


(Communicated by Xiaoling Sun )
Abstract. Recurrent neural networks (RNNs) have emerged as a promising tool in modeling nonlinear dynamical systems. The convergence is one of the most important issues of concern among the dynamical properties for the RNNs in practical applications. The reason is that the viability of many applications of RNNs depends on their convergence properties. We study in this paper the convergence properties of the weighted state space search algorithm (WSSSA)a derivative-free and non-random learning algorithm which searches the neighborhood of the target trajectory in the state space instead of the parameter space. Because there is no computation of partial derivatives involved, the WSSSA has a couple of salient features such as simple, fast and cost effective. In this study we provide a necessary and sufficient condition that required for the convergence of the WSSSA. Restrictions are offered that may help assure convergence of the of the WSSSA to the desired solution. The asymptotic rate of convergence is also analyzed. Our study gives insights into the problem and provides useful information for the actual design of the RNNs. A numerical example is given to support the theoretical analysis and to demonstrate that it is applicable to many applications.
1. Introduction. A recurrent neural network (RNN), which has at least one internal or external feedback loop, is modeled by the recurrently connected neurons that attempt to mimic information processing patterns in the brain. RNNs are described mathematically by dynamical systems where the states evolve according to certain nonlinear differential equations ( [1]). The presence of feedback loops has a profound impact on the network learning capacity and its performance in modeling nonlinear dynamical phenomena ( [12]). The dynamical properties of RNNs play an important role in their applications since their nonlinear dynamics can implement 194 LEONG-KWAN LI AND SALLY SHAO rich and powerful computations, allowing the RNN to perform modeling, system identification and prediction tasks for sequences with highly complex structure. Despite the great capability of RNNs, the main problems are the difficulty of training them, and the complexity of rigorously analyzing the convergence properties of the learning algorithms. The adjustments of network parameters can affects the entire RNN state variables during the network evolution ( [12]) and training is usually complex and might divergent ( [1]). Moreover, when the network becomes large, the training process may become prohibitively costly for each iteration even with the growing sophistication of computer hardware and software as well as mathematical algorithms.
Yiu, Wang, Teo and Tsoi in 2001 developed a novel knot-optimizing B-spline network to approximate general nonlinear system behavior ( [16]). They showed that a simulated annealing algorithm with an appropriate search strategy as an optimization algorithm for the training process could avoid any possible local minima. Their work is insightful and inspiring. Li, Shao and Yiu (( [7]) developed a novel learning algorithm WSSSA, a derivative-free and non-random learning algorithm, to solve the least square problems with recurrent neural dynamic constraints. The method is based on approximating a new trajectory which is a convex combination of the system output of the RNN and the desired trajectory with a ratio α ( [7]). This approach provides the best feasible solution for the nonlinear optimization problem. Unlike the conventional gradient methods, there is no computation of partial derivatives along the target trajectory involved in the WSSSA approach, it has lower computational complexity in computing the weighted updates than the competing techniques for most typical problems.
The goal of this paper is to provide a rigorously analysis for the convergence properties of the WSSSA, which shows that the algorithm has a couple of salient features such as simple, fast learning rates and finite time convergence. Recent evidences have shown that the convergence is one of the most important issues of concern among the dynamical properties for the RNNs in practical applications. The reason is that the viability of many applications of RNNs depends on their convergence properties. Without a proper understanding of the convergence properties of RNNs, many of the applications would not be possible ( [17]). We provide the necessary and sufficient conditions which are required for the convergence of the WSSSA. The restrictions are offered that may help assure convergence of the of the WSSSA to the desired solution. The asymptotic rate of convergence is also analyzed. Our result shows that the choice of the sequence of parameters depends on the choice of neural networks, which can possibly be extended for other networks, such as the B-spline network, and large-scale problems etc. We report the results of a computational experiment performed by the WSSSA algorithm, which has good exploration and exploitation capabilities in searching the optimal weight in training RNNs, and is given to support the theoretical analysis.
The organization of this paper is as follows. In section 2, we highlight the WSSSA for the discrete-time RNN model and present some the stability properties of the discrete-time RNN system. The convergence analysis for the WSSSA is provided in section 3, which the conditions that are required for convergence of the WSSSA algorithm. An empirical example is given in section 4. Some final remarks and future research directions are given in section 5.

2.
The WSSSA for the discrete-time RNN model. Consider the discretetime leaky integrator model of the RNN with n-neuron described by a nonlinear system of the form where x = [x 1 , x 2 , ..., x n ] T is a column vector with x i represents the internal state of the i th neuron, σ is a neuronal activation function that is bounded, differentiable and monotonically increasing on [-1, 1].
, a 2 , ..., a n ] and B = diag[b 1 , b 2 , ..., b n ] are diagonal matrices with positive entries h i , a i and b i for each i = 1, 2, ..., n, where a i represents the inverse of the neuron's leakage time constant of i th neuron τ i , b i represents the neuron's resistance, and h i is the step size of Euler's discretization. We assume further that σ(z) = tanh(z) which is the symmetric sigmoid logistic function; θ = [θ 1 , θ 2 , ..., θ n ] T is the input bias or threshold vector of the system; W = [w ij ] n×n is the synaptic connection weight matrix with w ij being the synaptic weight of a connection from neuron n j to neuron n i . It is a fully connected network recurrently.
It is known that the discretized variant (1) of the continuous-time leaky integrator model of the RNN with n neurons, will inherit the same dynamics of system (2) when the step size is "small" (see [11]). In other words, The two systems (1) and (2) share the same dynamical behavior when h i tends to 0. The positive entries of the diagonal matrix H of (1) can also be used to represent the different economic cycles. It implies that if we choose appropriate h i s, the discrete-time RNN model (1) and (2) should be at least as good as any time series models ( [6]). Without loss generality, we assume that for 0 < a i h i < 1, i = 1, 2, ..., n. Since the discrete-time RNN model is easily implemented in digital hardware and easily simulated in computers, it presents advantages over the continuous-time model in practice. We focus, in this section, on the discrete-time model (1) to introduce the robust learning algorithm -the WSSSA, its stability properties and the convergence analysis of WSSSA. Before introducing the state space search algorithm, we first discuss some stability properties of the discrete-time RNN model (1) and provide the convergence analysis afterward. The stability and bifurcation properties of the discrete-time RNNs were analyzed by Wang and Blum [11], Li [6], Jin, Nikiforuk and Gupta [3]. We introduce here some useful definitions and theorems. For the following definitions. All the norms used in this paper are either L 2 or l 2 . All the results can be found in ( [6]).
(ii) The function f (x) = (I − HA)x + HBσ(W x + θ) has only asymptotically stable equilibrium point if all the eigenvalues of the Jacobian are inside the unit circle for all the states x, a given connection weight matrix W , and the input θ.
(iii) If the discrete-time RNN system (1) has only asymptotically stable equilibrium points for a given connection weight matrix W and θ, then the system (1) is said to be absolutely stable.
(iv) If a sequence of vectors {x k } k converges to a limit point x * , then the asymptotic convergence rate (order of convergence) of {x k } k is defined as the supremum of the nonnegative number p satisfying Remark 1. It is important to notice that the asymptotic stability may depend upon the input θ, whereas the absolute stability does not depend upon the input θ. This is the fundamental difference between the asymptotic stability and the absolute stability.
Remark 2. For the asymptotic convergence rate, larger values of the order p, imply more rapid convergence. Now we show in the following two theorems in ([6]) on absolute stability properties of the system (1).
Remark 3. From Theorem 2.2, we notice that inequality (5) implies that the solution space of the connection weight matrix W forms n open convex hyper cones in n-dimensional space. Moreover, Since the discrete-time model (1) and the continuous-time model (2) will share the same dynamical behavior as h i → 0, we can conclude that as h i → 0 both (1) and (2) are absolutely stable. It is known that for an absolutely stable neural network model, the system state will converge to one of the asymptotically stable equilibrium points regards of the initial state. Now given a trajectory y(t) ∈ R n where y(t) is a column vector and t = 1, .., m, we use (1) to approximate the target trajectory y(t) with the error function E defined by for some positive integer m. To further simplify the notation and analysis, we fix h and let A = B = I n×n , i = 1, 2, ..., n. We extend the trajectory dimension by one and let y n+1 (t) = 1 for all t in order to absorb the variable θ into the last column of W . Without loss of generality, we still consider an n 2 -dimensional constrained least square problem That is subject to RNN dynamics It is well known that the gradient descent, conjugate gradient and quasi-Newton's method have been applied to solve the least square problems for many decades. In recent years, the growing demand for sophisticated derivative free optimization methods has triggered the development of a relatively wide range of approaches (see [2]). Note that in neural networks, learning is a process of changing the network parameters W so that the system outputs x(t) will approach to the target trajectory y(t). For a fixed h, in the ideal case where the network is exactly capable, that is if E(W 0 ) = 0, then the optimal solution can be simply solved by W 0 , where For the more general cases, naturally after we obtain W 0 from (10), we generate the whole trajectory set X(W 0 ) by The basic idea of the WSSSA is that we search the class of x-homotopy in the state space R m×n instead of the parameter space of W in each iteration, where A + is the set of attainable points of x in R n . In other words, we search a reachable solution in the neighborhood of Y . In the next iteration, instead of approximating the whole desired trajectory Y . In many circumstances we anticipate the system output x(t) of the RNN will become closer to y(t) as time goes by. That is, if z(t) = x(t) − y(t) 2 , then z(t) tends to 0 as t tends to m. In view of this, we approximate the new set of trajectory X 1 defined by where the weighting rule where 0 < α 1 < 4m 3m+1 . Geometrically, α 1, t (Y − X(W 0 )) defines the error direction. Hence, we may vary α 1, t as α i, t in the i th learning iteration. By the continuity of W , there exists some α * 1, t = α * 1 (1 + t m ) such that We suppress the notation α * k, t as α k, t in the rest of the paper. In practice, we only need to store the best solution for each α 1, t . Notice that X 1 may not be attainable even though X(W 0 ) is attainable. However, we may repeat the process and obtain W 1 by Now, we repeat the above procedure and obtain a sequence of column vectors of attainable system output {X(W k )} and the learning sequence {X k } in the state space defined by X 0 = X(W 0 ), where and C k = [X k (m), · · · , X k (2)] T and D k = [X k (m − 1), · · · , X k (1)] T .
To summarize the above procedures, we obtain, for each k, with The learning process stops if we obtain a solution W N such that the error E(W N ) is less than a prescribed tolerance. Nevertheless, in practice, we store the best results for all iterations.
Remark 4. We may use other weighting rules, such as

t) is an increasing function of t ([7]).
In the next section, we will present the convergence analysis of the WSSSA.
3. Convergence Studies of the WSSSA. As we mentioned in Section 1, the convergence properties of RNNs are very important for applications in optimization, image processing, and pattern recognition to name a few ( [14]). The fact that the convergence of a learning algorithm is very important issue for RNNs because of their inherent signal feedback structure ( [10]). For the WSSSA defined in (16), the choice of α k, t is an art in the center of this study. Before introduce our convergence theorem, let us introduce a useful lemma below. Moreover, Proof. For all positive integers k, let with γ 1 = β 1 , then It follows that we have (18), that is, To obtain the results in (i) and (ii), we divide the proof into two cases. (A) Case 0 < β k < 1. We notice that 0 < 1 − β k < e −β k for all 0 < β k < 1, which implies Therefore, (i) if Hence, Lemma 3.1 holds for the case of 0 < β k < 1.
We now ready prove our convergence theorem. (16) is convergent if and only if (16), for each k,

LEONG-KWAN LI AND SALLY SHAO
By some simple algebra we obtain Now if and only if for all k.
Since sequence {Y − X k } k is monotonically decreasing and bounded below, thus, it converges, which implies that {X k } k is convergent.
(ii) The convergence analysis of Theorem 3.2 shows that {X k } k defined in (16) is convergent if and only if 0 < α k < 4m 3m+1 for all k. The network (9) converges to the desired solution when lim k→∞ α k = 2m 3m+1 for 0 < α k < 4m 3m+1 and the stability of the algorithm depends on {α k }'s. Geometrically, for each iteration, X k defines a homotopy with respect to α between the desired trajectory Y and the system output X(W k−1 ) and α(Y − X(W k )) defines the error directions.
From our constructions, {X k } k , {E(W k )} k are bounded sequences with the error sequence E(W k ) ↓ 0. Therefore, the WSSSA learning algorithm defined by (16) converges with lim k→∞ X k −Y = 0 and lim k→∞ E(W k ) = 0, and have Theorem 3.3 holds.
Remark 6. The WSSSA algorithm of the discrete-time RNN is a fast learning algorithm that provides the best feasible solution for the least square problem of (1). This follows from the fact that the limit point of {X k } k will lead us to obtain the corresponding best feasible solution W of (1). Meanwhile, the error sequence E(W k ) ↓ 0 when {X k } k converges.
In the next theorem, we discuss the asymptotic rates of convergence of {X k } k . Our result shows that the rates of convergence of {X k } depends on {α k }.
Proof of Theorem 3.4. Notice that for 0 < α k < 4m 3m+1 , Therefore, the order of convergence of {X k } is linear. We finish the proof.
Remark 7. The results of Theorem 4 provides the conditions that are required for convergence of the WSSSA for the discrete-time RNN model, and Theorem 5 shows that the asymptotic rates of convergence of {X k } of the WSSSA. Furthermore, X k → Y as lim k→∞ α k = 2m 3m+1 with 0 < α k < 4m 3m+1 .
4. The empirical example. We have tested the WSSSA learning algorithm based on the least square problem of compressing the signals of infra-red. That is to minimize the mean square error with the WSSSA. For a signal z(t) of length k (or with k data) is given, it is first partitioned into n equal segments with the same length p; the set of neurons {x 1 (t), x 2 (t), ..., x n (t)} of the RNN system (1) with network size n is fixed. We assume that the given signals are finite and continuous. We use a noise free data sequence of Cl3pheno infra-red spectrum. The size of the RNN used is 9. After the processes of segmentation, cycling and smoothing of the data, We apply the WSSSA learning method to do the approximation process to generate the compressed data, then perform the optimization process to minimize the mean square error (33). We found that the empirical results from the different samples are extremely promising. To avoid over stuff the paper we show only one of the reconstructed signals, Cl3pheno, in Figure 1 to illustrated our results. The top diagram of Figure 1 is the original signal, the bottom diagram of Figure 1 is the reconstructed signal. The graph in Figure 2 below demonstrates the linear rate of convergence of the algorithm established in Theorem 3.4. We plot X k+1 −Y X k −Y against the number of iterations k. We choose m = 1798, then as α k → 2m 3m+1 , we found that In these experiments, we took the first 1800 points, choose m = 1798, use less than 500 iterations and use no averaging technique. The maximum relative error for the testing infra-red sample Cl3pheno is 3.3 × 10 −3 . The total number of the parameters stored during the entire compression process is n 2 + n + 1 plus the 4 neglected points, where the network size n. Hence, in our case, The total number of the parameters that we stored 9 2 + 9 + 1 = 91 plus the 4 neglected points.
From these experiments, we demonstrates some very promising results, which support our theoretical results in Section 3, by using the WSSSA for the RNN system to perform the data compression. They converge very fast and effectiveless than 3 seconds with a MATLAB running in a reasonable current PC system.

5.
Concluding remarks and further researches. In this paper, we investigate the the derivative-free and non-random learning algorithm WSSSA, which is a learning algorithm for the nonlinear dynamical system modeling via a recurrent neural network. We provide the conditions that are required for convergence of the WSSSA algorithm in literature. The convergence analysis shows that it guarantees the convergence to the desired solution (see Theorems 3.3 and 3.4). The rates of asymptotic convergence are analyzed and the stability properties are discussed. Our studies give insights into the problem, and it can potentially beneficial to derive some new algorithms for training RNNs. The complexity analysis presented here shows that our algorithm is simpler and effective. We employed the proposed method to compress the signals and demonstrated its usefulness in performing the least square optimization for the RNN model. For other extensions to the proposed method, it will certainly be of interest to study the asynchronized recurrent networks and other neural network structures.