Elsevier

Neural Networks

Volume 16, Issue 2, March 2003, Pages 223-239
Neural Networks

Dual extended Kalman filtering in recurrent neural networks1

https://doi.org/10.1016/S0893-6080(02)00230-7Get rights and content

Abstract

In the classical deterministic Elman model, the estimation of parameters must be very accurate. Otherwise, the system performance is very poor. To improve the system performance, we can use a Kalman filtering algorithm to guide the operation of a trained recurrent neural network (RNN). In this case, during training, we need to estimate the state of hidden layer, as well as the weights of the RNN. This paper discusses how to use the dual extended Kalman filtering (DEKF) for this dual estimation and how to use our proposing DEKF for removing some unimportant weights from a trained RNN. In our approach, one Kalman algorithm is used for estimating the state of the hidden layer, and one recursive least square (RLS) algorithm is used for estimating the weights. After training, we use the error covariance matrix of the RLS algorithm to remove unimportant weights. Simulation showed that our approach is an effective joint-learning–pruning method for RNNs under the online operation.

Introduction

The most well-known training approach for recurrent neural networks (RNNs) is the real time recurrent learning (RTRL) (Schmidhuber, 1992, Williams and Zipser, 1989, Zipser, 1990). However, it is a first order stochastic gradient descent method (Robbins & Monro, 1951) and hence its learning speed could be very slow. Recently, extended Kalman filtering (EKF) (Haykin, 1991) based algorithms have been introduced to train feedforward neural networks (FNNs) (Scalero and Tepedelelenlioglu, 1992, Shah et al., 1992, Singhal and Wu, 1989) and RNNs (Puskorius and Feldkamp, 1994, Williams, 1992). With the EKF approach, the learning speed has been improved. In some real-time applications, such as neural network controllers, the number of training iterations required for convergence is substantially important. The EKF approach is very useful for these applications. From the analysis in Williams (1992), the computational complexity of the EKF algorithm at each iteration is similar to that of the RTRL. However, the EKF algorithm offers the advantage of convergence in fewer iterations than the RTRL.

Another issue in neural networks is to remove some unimportant weights from a trained network. The Hessian based approach, such as: the optimal brain damage (OBD) method (Le Cun et al., 1989, Reed, 1993), is one of efficient approaches to remove unimportant weights. To identify unimportant weights, we are required to estimate the Hessian matrix, as well as the importance of each weight, by feeding the training set into the trained network. The computational complexity to get the importance of every weight is O(M2p) (Le Cun et al., 1989, Pearlmutter, 1994), where M is the number of weights and p is the number of training patterns. To avoid serious overfitting, the number of training patterns is usually much greater than the number of weights, i.e. pM. In the online situation, the Hessian matrix is usually unavailable since training patterns are not held after training. In Leung et al., 1996, Leung et al., 2001 Leung proposed a joint-learning–pruning algorithm for FNNs based on the error covariance matrix of the recursive least square (RLS) approach. The computational complexity for pruning is O(M3), that is much smaller than O(M2p) when pM. The stopping criteria of pruning is based on the estimated training error and the estimated change in the training error. The advantage of this RLS pruning approach is that during pruning the training set is not required. Hence, this approach is suitable for the online situation.

Williams (1992) proposed a global EKF algorithm to estimate the weights and the hidden state of RNNs. In this approach, the goal is to maximize the a posterior probability, rather than to minimize the training error. In Sum, Leung, Chan, Kan, and Young (1999), we used the error covariance matrix of the Williams's algorithm for estimating a posterior probability of each weight. Afterwards, we can obtain the pruning order of the weights based on the estimated probabilities. However, during pruning, we still need to use the training set (or test set) for determining the change in the training error (or test error). In this method, the stopping criteria of pruning is based on the actual training error (or test error) and the actual change in the training error (or test error). It is because the error covariance matrix of the EKF algorithm is related to the a posterior probability, rather than the training error (or test error).

Wan and Nelson, 1997a, Wan and Nelson, 1997b used the dual EKF (DEKF) approach for processing noisy time series. However, they did not consider external input, recurrent connection, and pruning. In their approach, the error covariance matrix of the weight estimation EKF algorithm is related to the a posterior probability maximization rather than the training error.

In this paper, we consider the operation, training and pruning of a stochastic RNN model with external input. In this model, during the operation, an EKF algorithm estimates the hidden state of the RNN model. Hence, the performance of this approach is better than that of the classical Elman model. During training, we use the DEKF approach to train and to operate a RNN, wherein an EKF algorithm estimates the state of the hidden nodes and a RLS algorithm estimates the weights. Since the objective of the RLS algorithm is to minimize the training error, we can use its error covariance matrix to estimate the importance of each weight, as well as to prune unimportant weights. The advantage of the proposing pruning approach is that we no need to use the training set to prune a trained network.

In the rest of the paper, the stochastic RNN model will be discussed in the Section 2. Section 3 presents our DEKF approach for stochastic RNNs. The pruning method will be discussed in Section 4. Section 5 discusses the complexities of the proposed approach. Five simulation examples will then be given in Section 6 for illustrating the effectiveness of our joint-learning–pruning scheme. Section 7 follows with a concluding remark.

Section snippets

EKF for nonlinear systems

Without loss of generality, we consider a stochastic nonlinear system, given byat+1=κ(at,ut+1)+vt,dt=ℏ(at)+wt,where κ(·,·) is a function of the Lh-dimensional hidden state at and the Li-dimensional input, ℏ(·) is a function of the hidden state, dt is the Lo-dimensional observation. If we define κt(at)=κ(at,ut+1), κt(at) becomes a time varying function. In the above system, yt=ℏ(at) is the actual system output; wt is the measurement noise; and vt is the process noise which

DEKF based training for RNNs

In this section, we first review the Williams's global EKF approach (Williams, 1992) for RNNs and the Wan's DEKF approach (Wan & Nelson, 1997a) for FNNs. Afterwards, we point out the weakness of these two approaches in pruning RNNs. To overcome this weakness, a new DEKF approach for training RNNs is introduced.

Pruning RNN

This section discusses the conjunction between pruning and the weight estimation RLS algorithm, that is, the relationship between the error covariance matrix Pt and the Hessian matrix of the training error. Also, the pruning method for a trained RNN will be presented.

From , , the Hessian matrix of the energy function J(θ) is2J(θ)θ2≈2P0∗−1+t′=1tHtHtT,2J(θ)θ2=2Pt∗−1.Define the training error J′(θ) asJ′(θ)=t′=1tdt′ŷt′2.Since the difference between J(θ) and J′(θ) is (θθ̂0)

Training phase

For standard EKF or RLS algorithms, it is well known that both the computational and space complexities are equal to O(K2), where K is the dimension of the state vector.

For the state estimation EKF algorithm of the DEKF approach, the dimension of the state vector is equal to the number of hidden nodes, i.e. K=Lh. Therefore, both the computational and space complexities of the state estimation EKF algorithm are equal to O(Lh2). For the weight estimation RLS algorithm of the DEKF approach, the

Simulations

In this section, we will demonstrate the effectiveness of the proposed joint-learning–pruning scheme through five examples. The first two examples are system identification problems(Tsoi & Tan, 1997). The purpose of these two examples is to demonstrate that using our approach (without the training set and test set) can produces a good pruning order and a good estimate of the training error of a pruned RNN. The last three examples are time series prediction problems (Weigend et al., 1991, Mackey

Conclusion

In this paper, we have introduced a joint-learning–pruning scheme for online learning and pruning RNNs. The DEKF approach consists of two algorithms, namely, the state estimation EKF algorithm and the weight estimation RLS algorithm. During training, they are running concurrently over data until converge. The EKF algorithm uses the last estimate of the weight vector to estimate the hidden state, as well as to predict the system output. The RLS algorithm uses the current estimated hidden state

Acknowledgements

The work described in this paper was supported by the Strategic Grant, City University of Hong Kong, Hong Kong (Project No. 7001218).

References (27)

  • Moody, J. E (1991). Note on generalization, regularization and architecture selection in nonlinear systems. Proceedings...
  • B.A. Pearlmutter

    Fast exact multiplication by the Hessian

    Neural Computation

    (1994)
  • G.V. Puskorius et al.

    Neurocontrol of nonlinear dynamical systems with Kalman filter training recurrent networks

    IEEE Transactions on Neural Networks

    (1994)
  • Cited by (64)

    • Artificial neural networks in microgrids: A review

      2020, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      Another, less used method for assigning weight parameters is the extended Kalman filter (EKF). The EKF has been shown to improve learning convergence in comparison to the backpropagation algorithm, although it is more costly (Leung and Chan, 2003). It is known that Kalman filtering (KF) estimates the state of a linear system with additive state and output white noises.

    • Optimized control and neural observers with germinal center optimization: A review

      2019, Annual Reviews in Control
      Citation Excerpt :

      In Rios et al. (2018), a recurrent high order neural network observer (RHONNO) (Sanchez et al., 2008) trained with extended Kalman filter (EKF) and optimized with GCO is presented. To train recurrent neural networks the back-propagation through time algorithm is commonly used, but it presents disadvantages as (Hermans & Schrauwen, 2010; Leung & Chan, 2003; Rovithakis & Christodoulou, 2012): High sensitivity to initial conditions,

    View all citing articles on Scopus
    1

    This work was supported by the Strategic Grant, City University of Hong Kong (No. 7001218).

    View full text