Multiple Time Scales Recurrent Neural Network for Complex Action Acquisition

Martin Peniak∗, Davide Marocco∗, Jun Tani†, Yuichi Yamashita†, Kerstin Fischer‡ and Angelo Cangelosi∗ , ∗ The University of Plymouth, Drake Circus, Plymouth, PL4 8AA, United Kingdom Tel: +44-1752-586217, Fax: +44-1752-586300 †Brain Science Institute, RIKEN, 2-1 Hirosawa, Wako-shi, Saitama, 3510198 Japan Tel: +81-48-467-1111 (ext. 7415), Fax: +81-48-467-7248 ‡University of Southern Denmark, DK-6400 Snderborg, Denmark Tel: +45-6550-1220, Fax: +45 655


I. INTRODUCTION
Humans are able acquire many skilled behaviours during their life-times.Learning complex behaviours is achieved through a constant repetition of the same movements over and over while certain components are segmented into reusable elements known as motor primitives.These motor primitives are then flexibly reused and dynamically integrated into novel sequences of actions.Arbib proposed a schema theory that provides the theoretical foundations underlying this process [1].The schema theory has been adopted in many studies, for example in [2]- [4].
For example, the action of lifting an object can be broken down into a combination of multiple motor primitives.Some motor primitives would be responsible for reaching the object, some for grasping it and some for lifting it.These primitives are represented in a general manner and should therefore be applicable to objects with different properties.This capacity is known as generalsation, which also refers to the ability to acquire motor tasks by different ways.This means that the learning of new motor tasks can be done by using any body effector, or simply by imagining the actual task itself (see for example [5]).In addition, one might want to reach for the object and throw it away instead of lifting it up.Therefore these motor primitives need to be flexible in terms of their order within a particular action sequence.The amount of combinations of motor primitives grows exponentially with their number and the ability to exploit this repertoire of possible combinations of multiple motor primitives is known as compositionality.The hierarchically organised human motor control system is known to have the motor primitives implemented as low as at the spinal cord level whereas high-level planning and execution of motor actions takes place in the primary motor cortex (area M1).The human brain implements this hierarchy by exploitation of muscle synergies and parallel controllers.These have various degrees of complexity and sophistication that are able to address both the global aspects of the motor tasks as well as fine-tune control necessary for the tool use [6].
The flexibility of the motor control system allows humans to execute behavioural actions, dynamically set the end point and degrees of freedom used for next task while being able to quickly adapt to various disturbances.Fogassi et al. argues that the flexibility of choosing different effectors is crucial to adaptability and related to the existence of periphersonal space [7].Pioneering experiments on adaptation to rotating articial gravity environments led to the general belief that humans would not be able to adapt to rotating environments with angular velocities over around 3 to 4rmp (see [8]).An important study conducted by Lackner and DiZio showed that this sensorimotor adaptation is possible even with angular velocities reaching 10rmp [9].The experimental results showed that this can be achieved by making the same movement repeatedly, which allows the neural system to estimate and compensate for the Coriolis forces generated by a moving reference plane.These studies are clearly demonstrating the robustness and the flexibility of human motor control system capable of exploiting the use of motor primitives in order to reach higher level goals.
The existence of motor primitives and their recombination into sequences of actions is supported by the biological observations of humans and animals.Sakai et al. conducted experiments in visiomotor sequential learning and demonstrated that his subjects spontaneously segmented motor sequences into elementary movements [10].Thoroughman and Shadmehr showed that the complex dynamics of reaching motion is achieved by flexibly combining motor primitives [11].dAvella et al. analysed the data recorded from electromyographic activity from 19 shoulder and arm muscles and concluded that: "the complex spatiotemporal characteristics of the muscles patterns for reaching were captured by the combinations of a small number of components, suggesting that the mechanisms involved in the generation of the muscle patterns exploit this low dimensionality to simplify control" ( [12], p. 7791).Experiments conducted on animals are also consistent with these findings.For example, it has been shown that the electrical stimulation of primary motor and premotor cortex in monkeys triggers coordinated movements such as reaching and grasping [13].found that a frog's leg contains a finite number of modules organised as linearly combinable muscle synergies [14].
Several action learning models, for example MOSAIC [15] or mixture of multiple RNN expert systems [16], implemented functional hierarchies via explicit hierarchical structure where the motor primitives are represented through the local lowlevel modules whereas the higher-level modules were in charge of recombining these primitives using extra mechanisms such as gate selection systems.These systems, based on predefined hierarchical structures, were appealing because of their potential benefits.For example, learning of one module does not interfere with learning of other modules and it would also seem that by adding extra low-level modules the number of acquirable motor primitives would increase as well.However, it has been demonstrated that the similarities between various sensorimotor sequences result in competition between the modules that represent them.This leads to a conflict between generalisation and segmentation, since generalisation requires the representation of motor primitives through many similar patterns present in the same module whereas different primitives need to be represented in different modules to achieve a good segmentation of sensorimotor patterns.Because of the conflict that arises when there is an overlap between different sensorimotor sequences, it is not possible to increase the number of motor primitives by simply adding extra lowlevel modules [17].Learning of motor primitives (low-level modules) and sequences of these primitives (hi-level modules) need to be explicitly separated through subgoals [16], [18].
Yamashita and Tani were inspired by the latest biological observations of the brain and developed a completely new model known as multiple timescales recurrent neural network (MTRNN).The MTRNN attempts to overcome the generatisation-segmentation problem through the realisation of functional hierarchy that is neither based on the separate modules nor on a structural hierarchy, but rather on multiple time-scales of neural activities that seem to be responsible for the process of motor skills acquisition and adaptation as well as perceptual auditory differences between formant transition and syllable level [19]- [23].
This paper presents preliminary results of complex action learning based on an MTRNN model embodied in the iCub humanoid robot (see section II-A).The model was implemented as part of Aquila cognitive robotics toolkit [24] and accelerated through the CUDA architecture making use of massively parallel GPU devices that significantly outperform standard CPU processors on parallel tasks.

II. METHOD
The following experiment was designed to test the capability of the MTRNN system to learn multiple sensorimotor sequences in an object manipulation scenario.There are three semantically different classes of actions (see table II, section III) that are expected to exhibit similar sensorimotor patterns (e.g.push or pull the block).The choice of these semantically similar behaviours was influenced by our next planned experiments (section IV) and will facilitate the investigation of the verb island hypothesis (section IV).

A. iCub Humanoid Robot Platform
The iCub (www.icub.org) is a small humanoid robot that is approximately 105cm high, weights around 20.3kg and its design was inspired by the embodied cognition hypothesis.This unique robotic platform with 53 degrees of freedom (12 for the legs, 3 for the torso, 32 for the arms and six for the head) was designed by the RobotCub Consortium [25], which involves several European universities and it is now widely used by the iTalk project and few others.The iCub platform design is strictly following open-source philosophy and therefore its hardware design, software as well as documentation are released under general public license (GPL).RobotCub name is partially an acronym where Cub stands for Cognitive Universal Body and the initial funding for this project was 8.5 million from Unit E5 (Cognitive Systems and Robotics) of the European Commission's Seventh Framework Programme.Tikhanoff et al. have developed an open-source simulated model of the iCub platform [26].This simulator has been widely adopted as a functional tool within the developmental robotics community, as it allows researchers to develop, test and evaluate their models and theories without requiring access to a physical robot.
The iCub was used in the current study where the MTRNN system controlled four joints of each arm.Each of these joints have different freedom of movement constrained by the actual design of the iCub's body and partly by the software for security reasons.The sensorimotor states of the iCub were sampled at 100ms rate and were used for training of the self-organising map.These sequences were farther down-sampled to 500ms for the initial training to simplify the learning process of the backpropagation through-time (BPTT) algorithm and to examine the precision of the learned sensorimotor patterns.

B. Self Organising Maps for Input Sparse Encoding
The MTRNN system used self-organising maps as means of preserving the topological relations in the multidimensional input space to reduce the possible overlap between various sensorimotor sequences and to aid the learning process (see Figure 1 and 3).
The self-organising map was trained prior to the MTRNN's BPTT training using a slight variation of the standard unsupervised learning algorithm [27].The data set consisted of all the sequences used to for the MTRNN training as well as additional sequences, which involved variations to achieve smoother representation of the input space and minimise data loss incurred during the process of vector transformation.(1) shows the description of these vectors where l(i) defines their dimensions.
The transformation of a vector to a self-organising map (SOM) is given by ( 2) where v sample = l(i), σ defines the distribution shape of p i,t and N represents the overall size of the selforganising map.
The neural activations on the output layer are assumed to correspond to an activation probability distribution of the self-organising map whose inverse transformation generates multidimensional vector that directly sets the target joint angles of the iCub.(3) describes this transformation where v i represents the target position for the i th joint index, y j,t is the MTRNN's j th output activity, s ij is the i th index of the vector corresponding to the SOM's node j.

C. Online Control
The MTRNN's core is based on a continuous time recurrent neural network characterised by the ability to preserve its internal state and hence exhibit complex dynamics.The system receives sparsely encoded proprioceptive input from the robot (see section II-B), which is used to predict next sensorimotor states and therefore acts as a forward kinematics model (e.g.[28].
The neural activities were calculated following the classical firing rate model where each neuron's activity is given by the average firing rate of the connected neurons.In addition to this, the MTRNN model implements a leaky integrator and therefore the state of every neuron is not only defined by the current synaptic inputs but also considers its previous activations.The differential equation 4 describes the calculation of neural activities over time where u i,t is the membrane potential, x j,t is the activity of j th neuron, w ij correspond to synaptic connections from the j th to the i th neuron and finally the τ parameter that defines the decay rate of i th neuron.
The decay rate parameter τ modifies the extent to which the previous activities of the neuron affect its current state.
Fig. 1.The system receives propriocpetive information as a multidimensional vector mt subsequently activating a self-organising map, activity of which is associated to the network's input.The neural network then predicts the next sensorimotor state m t+1 based on its current state and input.At this stage, the neural activations on the output layer are assumed to correspond to the activity of the self-organising map whose inverse transformation generates multidimensional vector that directly sets the target joint angles of the iCub.
Therefore, when the neurons are set with large τ values their activities will be changing more slowly over time as compared to those neurons set with smaller τ values.In this experiment, 256 input-output neurons were set to τ = 2 while the hidden neurons consisted of two different categories where each had a different time integration constant.The first category comprise of 60 fast neurons with τ = 5 and the second of 20 slow neurons set to τ = 70.These two categories are attempting to capture the dynamics of complex behavioural patterns by flexible recombination of motor primitives into novel sequences of actions.As described in the section I multiple timescale systems have been suggested as the underlying system that facilitates this behavioural compositionally.
The network is fully connected and hence every neuron is connected to every other neuron including itself.There is one exception where the slow neurons are not directly connected to the input-output layer but rather indirectly via the fast neurons.
The continuous time integration model of the MTRNN's neurons were defined by the differential equation 4 while the actual membrane potentials are calculated by its numerical approximation defined by (8).
The activity of neuron is calculated in two different ways (see ( 6)) depending on whether a neuron belongs to the input-ouput (i ∈ Z) or the hidden layer.
Therefore, the input-ouput neuron activations are calculated using the Softmax function (the top part of ( 6)) while the hidden neurons use conventional Sigmoid function (see (7)).
The Softmax function was used to achieve an activation distribution that is consistent with that of the self-organising map.
The system receives propriocpetive information as a multidimensional vector m t subsequently activating a self-organising map, activity of which is associated to the network's input.The neural network then predicts the next sensorimotor state m t+1 based on its current state and input.At this stage, the neural activations on the output layer are assumed to correspond to the activity of the self-organising map whose inverse transformation generates multidimensional vector that directly sets the target joint angles of the iCub.The iCub then updates the positions of its joints, which are again fed back through the SOM into the MTRNN system as x i,t+1 .Hidden neurons are simply copied as the recurrent states for the next time step, see (8).

D. Back Propagation Through Time
The MTRNN needs to be trained via an algorithm that considers its complex dynamics changing through time and for this reason we used the BPTT algorithm as it has been previously demonstrated to be effective [2].
This learning process is defined by finding the suitable values for the synaptic connections by minimising the global error parameter E, which represents the error between the training sequences and those generated by the MTRNN.The error E is calculated using the Kullback-Leibler divergence as described in (9) where y * i,t is the desired activation value of the i th output neuron at the time t and y i,t is its actual output.
The synaptic connection values are updated according to (10) where their optimal levels are approached through minimising their values with respect to ∂E/∂w that defines the gradient.The learning rate is given by α parameter and n represents the learning iteration step.
The already mentioned gradient ∂E/∂w is defined by (11) while the recurrence equation 6 is used to recursively calculate ∂E∂ui, t.
The () is the derivative of the sigmoid function defined by (7).The δ i,k is Kronecker's delta, which is set to 1 when i = k otherwise it is 0. The initial values of the synaptic connections were randomly generated between -0.025 and 0.025 and the first five slow neurons were set to different values for different behavioural sequences to allow their learning, which utilises the initial sensitivity characteristics of the continuous time recurrent neural networks [29].

III. EXPERIMENTS AND RESULTS
This section presents preliminary results of the initial testing of the MTRNN and BPTT systems on the iCub robot.The experimental task required the MTRNN system to learn 8 different behavioural patterns (see table II The Sequence Recorder module of Aquila was used to record these sensorimotor patterns while the experimenter was guiding the robot by holding its arms and performing the above mentioned actions.Every behaviour was recorded three times with slight variations that involved 5cm offsets with respect to the center of the object to achieve smooth representation of the input space and reduce the errors incurred during the SOM transformations.This generated thousands of sensorimotor sequences all of which were used to train the SOM prior to the MTRNN training that only used the original sequence (without offsets) for each behaviour.
The self-organising map consisting of 256 nodes was trained (see Figure 3) using the Aquila's SOM module and based on all the data collected during the tutoring session sampled at 100ms.In order to achieve a good precision of the SOM, it was necessary to run its training for 160,000 iterations using the initial learning rate η = 0.05.
Five different trials were conducted where each trial was initialised with a different seed used to generate random numbers for synaptic connections.The BPTT algorithm was set to run for one million iterations with the learning rate set to 0.015 and sigma parameter set to 0.0045.At the end of the training, the learned neural network was tested on the iCub in the same setup as during the tutoring part.The results from the first three trials showed that the MTRNN system was able to replicate all the eight sequences while successfully manipulating the object.The last two trials were not equally successful.While the fourth trial produced MTRNN capable of performing the first five behaviours the last trial showed only hints of learning and was not able to replicate any action satisfactory.This can be seen from the error, which was significantly higher that in the rest of the runs (see Table III).These preliminary experiments revealed an interesting dynamics of the system, which would for example change if the iCub's interaction with the object was not previously experienced.For example, the behaviour of pushing the block involved a complex sensorimotor flow that is naturally constrained by the actual interaction with the object.This means that many times this interaction would be significantly different to the learned interaction and thus, in several cases, the dynamics was very different from the original one.Interestingly, when this was the case, the iCub would spend a bit more time correcting its positions and only then it would push the block forward.
The results presented herein demonstrated that MTRNN system is able to learn at least eight different behavioural sequences.This was a significant step since the number of learned behaviours in our case already exceeded the number of behaviours in Yamashita and Tani experiments [2] where the computational power required for the training and processing of SOMs was saved by using small input sizes, which might have consequently limited the number of learnable sensorimotor patters: "If the sizes of the TPMs (SOMs) are set to larger value, representations in the TPMs become smoother and data loss in the vector transformation decreases.For the current experiment, however, in order to reduce time spent on computation, sizes of the TPMs were selected such that they were the minimum value large enough to allow the TPMs to reproduce, in real time, sensori-motor sequences through the process of vector transformation.theteaching sequences and output sequences." This was not the limitation in our case since both the SOM and the MTRNN are massively parallelised and processed on the GPU devices, which allowed us to experiment with larger network sizes.In fact, it was found that the 64 neurons used to represent the proprioceptive input space were not enough in our experimental scenario.There seem to be three primary reasons for this.The first is the fact that the number of sequences were higher in our case and therefore more nodes were needed to smoothly represent the input space.The second is due to higher complexity of the learned sensorimotor sequences, which is particularly true for pushing and pulling behaviours.And finally, the iCub's joint angle ranges are significantly higher that those of Sony QUIRO used in Yamashita and Tani experiments [2].

IV. FUTURE WORK
We have showed that the MTRNN model was able to learn eight different behavioural sequences.The system is likely to learn additional sequences, however, to date, we have not conducted experiments with additional sensorimotor patterns.
Three additional self-organising maps were linked to the MTRNN system and trained to represent simple linguistic inputs as well as objects' shapes and colours obtained from images fed through logpolar transform inspired by human visual processing.This extension will facilitate our investigation of action-language integration and grounding as well as the role of embodiment during the developmental stages.
In particular, our next experiments will be addressing a specific linguistic hypothesis first proposed by the cognitive psychologist and linguist Michael Tomasello.The hypothesis, which is also known as the verb island theory, predicts that verbal argument structures are learned on a purely itemspecific basis [30], [31].In other words, children do not learn that verbs can be combined with certain types of nominals and clauses to get a transitive direct-object structure, but rather by developing this on an item-by-item basis where the understanding of verbs is at first limited to the context where these verbs appeared.Consequently, the general notion of transitive construction and direct object is an abstraction that occurs only during later developmental stages when a critical mass of these verb islands have been attained and thus recognised as the instances of the same general underlying structure through the process of semantic analogy [32].During the later developmental stages children are exposed to more and more construction types where different semantic roles are linked in a similar way and as a result the involved syntactic categories will be abstracted into subjects and objects [33].
The first planned experiment will investigate the role of semantic similarities between different words during early language acquisition.In particular, the hypothesis addressed by this experiment is whether a generalisation to unheard sentences is easier in condition where all learned events are of the same semantic type.Though conceptually simple, this experiment will constitute the first viable extension of the already conducted research within the iTalk Project.In addition, these problems are also discussed in child development research and therefore this work could provide useful insights.
The extension of the experiment will investigate the effects of using different learning techniques such as holistic, scaffolded and parallel learning.There are several other possibilities for farther experiments on which we are yet to agree, however, the experiments outlined in this section present an important step towards expanding our current knowledge of action-language integration as well as the acquisition of more complex grammatical constructions.

Fig. 3 .
Fig. 3. 3D visualisation of the trained self organising map.The left image shows the visualisation of the left arm's input space and the right image is the visualisation of the right arm's input space.The input space visualisation of each arm was done via Aquila where the second, the third and the fourth joints were assigned x,y,z dimensions respectively. ).

TABLE III ERRORS
AT THE END OF EACH TRIAL