Dual extended Kalman filtering in recurrent neural networks1
Introduction
The most well-known training approach for recurrent neural networks (RNNs) is the real time recurrent learning (RTRL) (Schmidhuber, 1992, Williams and Zipser, 1989, Zipser, 1990). However, it is a first order stochastic gradient descent method (Robbins & Monro, 1951) and hence its learning speed could be very slow. Recently, extended Kalman filtering (EKF) (Haykin, 1991) based algorithms have been introduced to train feedforward neural networks (FNNs) (Scalero and Tepedelelenlioglu, 1992, Shah et al., 1992, Singhal and Wu, 1989) and RNNs (Puskorius and Feldkamp, 1994, Williams, 1992). With the EKF approach, the learning speed has been improved. In some real-time applications, such as neural network controllers, the number of training iterations required for convergence is substantially important. The EKF approach is very useful for these applications. From the analysis in Williams (1992), the computational complexity of the EKF algorithm at each iteration is similar to that of the RTRL. However, the EKF algorithm offers the advantage of convergence in fewer iterations than the RTRL.
Another issue in neural networks is to remove some unimportant weights from a trained network. The Hessian based approach, such as: the optimal brain damage (OBD) method (Le Cun et al., 1989, Reed, 1993), is one of efficient approaches to remove unimportant weights. To identify unimportant weights, we are required to estimate the Hessian matrix, as well as the importance of each weight, by feeding the training set into the trained network. The computational complexity to get the importance of every weight is O(M2p) (Le Cun et al., 1989, Pearlmutter, 1994), where M is the number of weights and p is the number of training patterns. To avoid serious overfitting, the number of training patterns is usually much greater than the number of weights, i.e. p≫M. In the online situation, the Hessian matrix is usually unavailable since training patterns are not held after training. In Leung et al., 1996, Leung et al., 2001 Leung proposed a joint-learning–pruning algorithm for FNNs based on the error covariance matrix of the recursive least square (RLS) approach. The computational complexity for pruning is O(M3), that is much smaller than O(M2p) when p≫M. The stopping criteria of pruning is based on the estimated training error and the estimated change in the training error. The advantage of this RLS pruning approach is that during pruning the training set is not required. Hence, this approach is suitable for the online situation.
Williams (1992) proposed a global EKF algorithm to estimate the weights and the hidden state of RNNs. In this approach, the goal is to maximize the a posterior probability, rather than to minimize the training error. In Sum, Leung, Chan, Kan, and Young (1999), we used the error covariance matrix of the Williams's algorithm for estimating a posterior probability of each weight. Afterwards, we can obtain the pruning order of the weights based on the estimated probabilities. However, during pruning, we still need to use the training set (or test set) for determining the change in the training error (or test error). In this method, the stopping criteria of pruning is based on the actual training error (or test error) and the actual change in the training error (or test error). It is because the error covariance matrix of the EKF algorithm is related to the a posterior probability, rather than the training error (or test error).
Wan and Nelson, 1997a, Wan and Nelson, 1997b used the dual EKF (DEKF) approach for processing noisy time series. However, they did not consider external input, recurrent connection, and pruning. In their approach, the error covariance matrix of the weight estimation EKF algorithm is related to the a posterior probability maximization rather than the training error.
In this paper, we consider the operation, training and pruning of a stochastic RNN model with external input. In this model, during the operation, an EKF algorithm estimates the hidden state of the RNN model. Hence, the performance of this approach is better than that of the classical Elman model. During training, we use the DEKF approach to train and to operate a RNN, wherein an EKF algorithm estimates the state of the hidden nodes and a RLS algorithm estimates the weights. Since the objective of the RLS algorithm is to minimize the training error, we can use its error covariance matrix to estimate the importance of each weight, as well as to prune unimportant weights. The advantage of the proposing pruning approach is that we no need to use the training set to prune a trained network.
In the rest of the paper, the stochastic RNN model will be discussed in the Section 2. Section 3 presents our DEKF approach for stochastic RNNs. The pruning method will be discussed in Section 4. Section 5 discusses the complexities of the proposed approach. Five simulation examples will then be given in Section 6 for illustrating the effectiveness of our joint-learning–pruning scheme. Section 7 follows with a concluding remark.
Section snippets
EKF for nonlinear systems
Without loss of generality, we consider a stochastic nonlinear system, given bywhere κ(·,·) is a function of the Lh-dimensional hidden state and the Li-dimensional input, ℏ(·) is a function of the hidden state, is the Lo-dimensional observation. If we define becomes a time varying function. In the above system, is the actual system output; is the measurement noise; and is the process noise which
DEKF based training for RNNs
In this section, we first review the Williams's global EKF approach (Williams, 1992) for RNNs and the Wan's DEKF approach (Wan & Nelson, 1997a) for FNNs. Afterwards, we point out the weakness of these two approaches in pruning RNNs. To overcome this weakness, a new DEKF approach for training RNNs is introduced.
Pruning RNN
This section discusses the conjunction between pruning and the weight estimation RLS algorithm, that is, the relationship between the error covariance matrix and the Hessian matrix of the training error. Also, the pruning method for a trained RNN will be presented.
From , , the Hessian matrix of the energy function isDefine the training error asSince the difference between and is
Training phase
For standard EKF or RLS algorithms, it is well known that both the computational and space complexities are equal to O(K2), where K is the dimension of the state vector.
For the state estimation EKF algorithm of the DEKF approach, the dimension of the state vector is equal to the number of hidden nodes, i.e. K=Lh. Therefore, both the computational and space complexities of the state estimation EKF algorithm are equal to O(Lh2). For the weight estimation RLS algorithm of the DEKF approach, the
Simulations
In this section, we will demonstrate the effectiveness of the proposed joint-learning–pruning scheme through five examples. The first two examples are system identification problems(Tsoi & Tan, 1997). The purpose of these two examples is to demonstrate that using our approach (without the training set and test set) can produces a good pruning order and a good estimate of the training error of a pruned RNN. The last three examples are time series prediction problems (Weigend et al., 1991, Mackey
Conclusion
In this paper, we have introduced a joint-learning–pruning scheme for online learning and pruning RNNs. The DEKF approach consists of two algorithms, namely, the state estimation EKF algorithm and the weight estimation RLS algorithm. During training, they are running concurrently over data until converge. The EKF algorithm uses the last estimate of the weight vector to estimate the hidden state, as well as to predict the system output. The RLS algorithm uses the current estimated hidden state
Acknowledgements
The work described in this paper was supported by the Strategic Grant, City University of Hong Kong, Hong Kong (Project No. 7001218).
References (27)
- et al.
A pruning method for recursive least square algorithm
Neural Networks
(2001) - et al.
Optimal filtering algorithm for fast learning in feedforward neural networks
Neural Networks
(1992) - et al.
Recurrent neural networks: a constructive algorithm, and its properties
Neurocomputing
(1997) - et al.
Optimal filtering
(1979) Adaptive filter theory
(1991)- et al.
A real-time learning algorithm for a multilayered neural network based on the extended Kalman filter
IEEE Transactions on Signal Processing
(1992) - Larsen, J (1993). Design of neural network filters. PhD Thesis, Electronics Institute, Technical University of...
- et al.
Optimal brain damage
- et al.
On-line training and pruning for RLS algorithms
Electronics Letters
(1996) - et al.
Oscillation and chaos in physiological control systems
Science
(1977)
Fast exact multiplication by the Hessian
Neural Computation
Neurocontrol of nonlinear dynamical systems with Kalman filter training recurrent networks
IEEE Transactions on Neural Networks
Cited by (64)
Neural control of an induction motor with regenerative braking as electric vehicle architecture
2021, Engineering Applications of Artificial IntelligenceArtificial neural networks in microgrids: A review
2020, Engineering Applications of Artificial IntelligenceCitation Excerpt :Another, less used method for assigning weight parameters is the extended Kalman filter (EKF). The EKF has been shown to improve learning convergence in comparison to the backpropagation algorithm, although it is more costly (Leung and Chan, 2003). It is known that Kalman filtering (KF) estimates the state of a linear system with additive state and output white noises.
Real-time Neural Input–Output Feedback Linearization control of DFIG based wind turbines in presence of grid disturbances
2019, Control Engineering PracticeOptimized control and neural observers with germinal center optimization: A review
2019, Annual Reviews in ControlCitation Excerpt :In Rios et al. (2018), a recurrent high order neural network observer (RHONNO) (Sanchez et al., 2008) trained with extended Kalman filter (EKF) and optimized with GCO is presented. To train recurrent neural networks the back-propagation through time algorithm is commonly used, but it presents disadvantages as (Hermans & Schrauwen, 2010; Leung & Chan, 2003; Rovithakis & Christodoulou, 2012): High sensitivity to initial conditions,
Neural Input Output Feedback Linearization Control of a DFIG based Wind Turbine
2017, IFAC-PapersOnLineReal-time neural inverse optimal control for indoor air temperature and humidity in a direct expansion (DX) air conditioning (A/C) system
2017, International Journal of Refrigeration
- 1
This work was supported by the Strategic Grant, City University of Hong Kong (No. 7001218).