Learning long-term dependencies in segmented-memory recurrent neural networks with backpropagation of error
Introduction
Conventional recurrent neural networks (RNNs) have difficulties in modelling the so-called long-term dependencies, i.e., learning a relationship between inputs that may be separated over several time steps. Since the mid-1990s a lot of research effort was put on the investigation of this problem, e.g., [1], [2], [3], [4], [5], [6]. Usually recurrent networks are trained with a gradient based learning algorithm like backpropagation through time (BPTT) [7] and real-time recurrent learning (RTRL) [8]. Bengio et al. [1] and Hochreiter [9] found that the error gradient vanishes when it is propagated back through time and back through the network, respectively. There are basically two ways to circumvent this vanishing gradient problem. One possibility is to use learning algorithms that simply do not use gradient information, e.g., simulated annealing [1], cellular genetic algorithms [10] and the expectation-maximisation algorithm [11]. Alternatively, a variety of network architectures were suggested to tackle the vanishing gradient problem, e.g., second-order recurrent neural network [12], non-linear autoregressive model with exogenous inputs recurrent neural network (NARX) [3], [13], hierarchical recurrent neural network [2], long short-term memory (LSTM) network [14], anticipation model [15], echo state network [16], [17], latched recurrent neural network [5], recurrent multiscale network [18], [19], modified distributed adaptive control (DAC) architecture [20], and segmented-memory recurrent neural network (SMRNN) [6], [21].
Encouraging results with SMRNN have been reported on the problem of emotion recognition from speech [22] and protein secondary structure (PSS) prediction [6], [21]. In [6] it was shown that SMRNN performs competitive to LSTM on an artificial benchmark problem (two-sequence problem). Further, bidirectional SMRNN outperforms bidirectional LSTM networks on PSS prediction. The SMRNN training is essentially gradient descent. Therefore, it does not get rid of the vanishing gradients, but attenuates the problem. A comprehensive discussion on the effect of the segmented memory is given in [6].
This paper addresses the gradient based training of SMRNNs. Basically, the architecture fractionates long sequences into segments. Then, these segments form the final sequence if connected in series. Such procedure can be observed in human memorisation of long sequences, e.g., for phone numbers.
So far, SMRNNs are trained with an extended real-time recurrent learning (eRTRL) algorithm [6]. The underlying RTRL algorithm has an average time complexity in the order of magnitude , with n denoting the number of network units in a fully connected network [23]. Because of this complexity, the algorithm is often inefficient in practical applications where considerably big networks are used, as the time consuming training prohibits a complete parameter search for the optimal number of hidden units, learning rate, and so forth.
In this paper we adapt BPTT for SMRNNs, calling it extended backpropagation through time (eBPTT). Compared to RTRL, the underlying BPTT algorithm has a much smaller time complexity of [23]. We compared both algorithms on a benchmark problem designed to test the network׳s ability to store information for a certain period of time. In comparison to eRTRL we found eBPTT being less capable to learn the latching of information for longer periods of time. However, those networks that were trained successfully with eBPTT showed a better generalisation than eRTRL, i.e., higher accuracy on the test set. In a second step, we show that an unsupervised layer-local pre-training improves eBPTT׳s ability to learn long-term dependencies significantly preserving the good generalisation performance. This is further accompanied by experimental results and a discussion concerning the effect of the pre-training and different weight initialisation techniques.
The remainder of the paper is organised as follows. Section 2 introduces the SMRNN architecture together with the eBPTT training algorithm. Further, the layer-local pre-training procedure is described and the information latching benchmark problem is introduced. Following this, Section 3 provides experimental results on the benchmark problem for eRTRL and eBPTT training. Further, randomly initialised and pre-trained networks with subsequent supervised eBPTT training are tested. Additionally, we investigate the effect of the pre-training and alternative weight initialisation procedures. Finally, the results are discussed in Section 4 and some concluding remarks on future work are given in Section 5.
Section snippets
Segmented-memory recurrent neural network
The basic limitation of gradient descent learning for the weight optimisation in recurrent networks led to the development of alternative network architectures. One particular approach is the segmented-memory recurrent neural network (SMRNN) architecture proposed in [21]. From a cognitive science perspective, the idea has the pleasant property that it is inspired by the memorisation process of long sequences, as it is observed in humans. Usually people fractionate sequences into segments to
eBPTT Versus eRTRL on the information latching problem
From the literature we know that BPTT is extremely advantageous compared to RTRL concerning the computational costs for network training, i.e., compared to , with n denoting the number of network units in a fully connected network [23]. As these algorithms underlie their extended equivalents for the SMRNN training, we have the educated guess that eBPTT requires far less computation than eRTRL. In the following, we compare their ability to deal with the learning of long-term
Discussion
The SMRNN architecture implements the concept of a segmented memory to attenuate the vanishing gradient problem. Together with this architecture, eRTRL was proposed for the network training. Unfortunately, it has a very high computational complexity such that it is impractical for the training of large SMRNNs (cf. Fig. 5). Alternatively eBPTT was introduced (Section. 2.2). It does not suffer from the high computational costs of eRTRL, but the comparison on the information latching problem
Future work
For the future, pre-trained SMRNNs with a supervised eBPTT training should be applied to real world problems like handwriting recognition or speech processing. The UCI Machine Learning Repository [41] provides a variety of sequential and time series data for classification and regression tasks. It is a good starting point to study the assets and drawbacks of SMRNNs, because the classification/regression results for some of the datasets are already published. Therefore, those datasets may
Stefan Glüge received the diploma degree in Informations Technology, in 2008, from the Otto-von-Guericke University Magdeburg, Germany. Since 2009, he is Ph.D. Student at the Cognitive Systems group at the Otto-von-Guericke-University at the Institute for Information Technology and Communications. Further, since 2013, he is an assistant researcher at the Institute of Applied Simulation at the Zurich University of Applied Sciences, Switzerland. He published several peer-reviewed papers and
References (41)
- et al.
How embedded memory in recurrent neural network architectures helps learning long-term temporal dependencies
Neural Netw.
(1998) - et al.
A real-world rational agentunifying old and new AI
Cogn. Sci.
(2003) - et al.
Capturing long-term dependencies for protein secondary structure prediction
Finding structure in time
Cogn. Sci.
(1990)A scaled conjugate gradient algorithm for fast supervised learning
Neural Netw.
(1993)- et al.
Improving model selection by nonconvergent methods
Neural Netw.
(1993) - et al.
Learning long-term dependencies with gradient descent is difficult
IEEE Trans. Neural Netw.
(1994) - et al.
Hierarchical recurrent neural networks for long-term dependencies
- et al.
Learning long-term dependencies in NARX recurrent neural networks
IEEE Trans. Neural Netw.
(1996) - et al.
Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies. Wiley-IEEE Press eBook Chapters: A Field Guide to Dynamical Recurrent Networks Year: 200
(2001)
Latched recurrent neural network
Elektroteh. Vestn./Electrotech. Rev.
Segmented-memory recurrent neural networks
IEEE Trans. Neural Netw.
Backpropagation through time: what it does and how to do it
Proc. IEEE
A learning algorithm for continually running fully recurrent neural networks
Neural Comput.
Adding learning to cellular genetic algorithms for training recurrent neural networks
IEEE Trans. Neural Netw.
Fast training of recurrent networks based on the EM algorithm
IEEE Trans. Neural Netw.
Long short-term memory
Neural Comput.
Anticipation model for sequential learning of complex sequences
Cited by (0)
Stefan Glüge received the diploma degree in Informations Technology, in 2008, from the Otto-von-Guericke University Magdeburg, Germany. Since 2009, he is Ph.D. Student at the Cognitive Systems group at the Otto-von-Guericke-University at the Institute for Information Technology and Communications. Further, since 2013, he is an assistant researcher at the Institute of Applied Simulation at the Zurich University of Applied Sciences, Switzerland. He published several peer-reviewed papers and articles in the fields of recurrent neural networks and automatic speech processing.
Ronald Böck received the diploma degree in Computer Science, in 2007, from the Ilmenau University of Technology, Germany. Since 2007, he is a research assistant at the Cognitive Systems group, Institute for Information Technology and Communications, Otto-von- Guericke-University Magdeburg, Germany. In 2012 he stayed as a guest scientist at the School of Linguistic, Speech and Communication Sciences, Trinity College Dublin, Ireland. He is an organiser of several international workshops in the field of emotion modelling in human–computer interaction. He is a (guest) editor of diverse book series and published several peer-reviewed papers and articles in the fields of speech processing, human–computer interaction, and neural networks.
Günther Palm studied Mathematics at the Universities of Hamburg and Tübingen. After his graduation in mathematics he worked at the Max-Planck-Institute for Biological Cybernetics in Tübingen on the topics of nonlinear systems, associative memory and brain theory. In 1983/84 he was a fellow at the Wissenschaftskolleg in Berlin. From 1988 to 1991 he was a professor for theoretical brain research at the University of Düsseldorf. Since then he is a professor for computer science and director of the institute of Neural Information Processing at the University of Ulm. He is working on information theory, neural networks, associative memory and Hebbian cell assemblies.
Andreas Wendemuth received the Master of Science degree (1988) from the University of Miami, Florida, and the diploma degrees in physics (1991) and electrical engineering (1994) from the Universities of Giessen, Germany, and Hagen, Germany. He received the Doctor of Philosophy degree (1994) from the University of Oxford, United Kingdom, for his works on “Optimisation in Neural Networks.” After postdoctoral stay in Oxford (1994–1995), he worked from 1995 to 2000 at the Philips Research Labs in Aachen, Germany, on algorithms and data structures in automatic speech recognition and man–machine interfaces. Since 2001, he is a Professor for Cognitive Systems at the Ottovon- Guericke-University Magdeburg, Germany, at the Institute for Information Technology and Communications. Since 2009, he is a co-speaker of the Transregional Collaborative Research Centre SFB/TRR 62 “Companion-Technology for Cognitive Technical Systems” funded by the German Research Foundation (DFG). He published three books on signal and speech processing, as well as manifold peer-reviewed papers and articles in these fields.