1 Introduction

For the design of social robots (Breazeal 2004; Dautenhahn 2007), besides of building robots with human-like external morphology, the ability to process, to understand and generate language is one of the key factors to support human-robot interaction. However, to build a model to accomplish similar processes for social robotics, the design of the robot’s abilities of understanding, generation and generalisation of natural language is still an open challenge. Particularly, natural language understanding for a social robotic system plays an essential role as it interfaces the vocal command from human users to an internal representation in the robot’s own cognitive system. In this study, we will apply a developmental robotics approach to the design of language and communication abilities in robots, following an incremental and interactive process to language learning, inspired by language development in infants.

1.1 Language understanding for robot systems

Important recent developments in social robotics, such as robots performing human-like emotion expression (Zhong and Canamero 2014) and social attention for autonomous movement (Novianto 2014), have been accompanied by language understanding approaches focusing on the grounding of natural language into the agent’s sensorimotor experience and its situated interaction (Cangelosi 2010a; Steels and Hild 2012).

For instance, in Tellex et al. (2011), Matuszek et al. (2013), syntactic parsing techniques are used to ground the language into primitive motor actions (e.g., pickup, move, place), which can be inferred within graph models. Similarly,  Misra et al. (2014) developed a system for mobile robots which is able to learn to ground the language instructions from a corpus of pairs of natural language including both verbs and spatial information. In Yürüten et al. (2013), it was proposed that in order to understand the object affordance which can be described by adjectives, the most crucial property is the shape-related one.

Besides the direct modelling methods for robot language learning, an alternative approach to build a learning model for language is based on developmental robotics (Weng 2001; Asada 2009; Cangelosi and Schlesinger 2015). Taking inspiration from developmental psychology and developmental neuroscience studies, this approach emphasises the role of the environment and of the interactions that occur during learning, over a progression of learning stages. In the context of language understanding, the core of developmental robotics approaches to language learning is following a similar developmental pathway of infants acquiring grounded representations of natural language and forming a symbol system through embodied interaction with the physical environment  (Cangelosi 2010b). Furthermore, via language learning an agent should also be able to generalise by inferring un-trained combinations of words within the lexical constructions acquired. One possibility to accomplish generalisation is to make good use of the semantic compositionality.

Various developmental robotics models have been developed that incrementally model the various stages of language acquisition in infants, from phoneme acquisition, to object and action names, to word combinations. For example, the cognitive model presented in Guenther (2006) outlines the cortical interactions in the syllable generation process which result in different developmental phenomena. This mimics the first stage of language development. The Elija model (Howard and Messum 2011) is a vocal apparatus which strictly follows detailed developmental stages. Working as an articulatory synthesizer, it firstly learns the production of sounds on its own. Then a caregiver is used to produce speech by using speech sounds for object names using reinforcement learning, where the reward is again given by the response of the caregiver. Likewise, a self-organizing map together with reinforcement learning was proposed in Warlaumont (2013), which demonstrated that the reinforcement learning based on the similarity of vocalization can improve the post-learning production of the sound of one’s language.

From the models mentioned above, we can see that most of the methods for modelling the first stages of phonetics production do not tend to use robotic platforms. On the other hand, for the modelling of the later stages of lexical development, after assuming that phonetics skills are mastered, robotic systems are usually employed to establish the meta-knowledge about the association between vocal speech and the referents or the actions. Therefore, except studies focusing on the mental imagination of actions as in  Golosio (2015), the mechanical morphology of a robot is particularly important when modelling the acquisition of words, especially those used to name the motor actions. For instance, the model from Mangin and Oudeyer (2012) gets as input dance-like combinations of human movement primitives plus ambiguous labels associated with these movements. Concentrating on the second and third stages of the associating lexicon, words and motor actions, the robot in Dominey et al. (2009) is able to acquire new motor behaviours in an on-line fashion by grounding the vocal commands on the pre-defined control motor primitives. Similarly, Siskind (2001) proposed a model which uses visual primitives to encode notions of different actions to ground the semantics of events for verb learning. Using structured connectionist models (SCMs), (Chang et al. 2005) built a layered connectionist model to connect embodied representations and simulative inference for verbs. In Cangelosi and Parisi (2004), the emergence of verb-noun separation is learned while the agents are interacting and manipulating the objects. Meanwhile, the tasks during of such interaction may be essential during learning too (Goodman and Frank 2016). Recent experiments (Rohlfing 2016; Andreas and Klein 2016) and also proposed that language learning should be posited in the context of task-directed behaviours.

In terms of the learning structure,  Stramandinoli et al. (2012) developed a model about the grounding hierarchy of the verbs with more complex meanings (such as “keep”, “reject”, “accept” and “give”) which related to the internal states of the caregivers and which were used to build a robotic model for the grounding of increasingly abstract motor concepts and words. As follow-up studies of Dominey et al. (2009), Dominey (2013), Hinaut and Dominey (2013) focused on the understanding of grammatical complexity. They used recurrent neural networks (RNN) to learn grammatical structure based on temporal series learning in artificial neural networks.

Also using RNN, Sugita and Tani (2005) reported experiments with a mobile robot implementing a two-level RNN architecture called Recurrent Neural Network with Parametric Bias Units (RNNPB). This allows the robot to map a linguistic command containing verbs and nouns into context-dependent behaviours corresponding to the verb and noun descriptions respectively. It was among the first to develop a robotic model of semantic compositionality based on the sensorimotor combinatory. With a cognitive robot experiment, the recurrent network models the emergence of compositional meanings and lexicons with no a priori knowledge of any lexical or formal syntactic representation.

Comparing to RNNPB, another kind of RNN architecture called Multiple Timescale Neural Network (MTRNN) is able to ground different scales of sensorimotor information into the hierarchical structure of sentences, such as the spelling of words (Ogata and Okuno 2013) and words and sentences (Hinoshita 2011). The kind of recurrent models provides a memory to store the spatial and temporal structure of the environment and the lexical structures. Given the fact that RNN can learn the arbitrary length of the dependencies in statistical structures and their context, the storage ability of the RNN out-performs most of the language learning models.

1.2 Embodied symbolic emergence in a hierarchical structure

In the developmental psychology which studies focusing specifically on the emergence of nouns and verbs, there is still an open debate between the learning stages and their relative temporal acquisition order. For the early stages of the verb and noun learning, it is widely accepted that most of the common nouns are generally learned before verbs (Gentner 1982), by first connecting speech sounds (labels, nouns) to physical objects in view. However, some nouns which relate to context, such as “passenger”, are learnt at a relatively later stage, only after “an extensive range of situations” (contexts or life phases) have been encountered (Hall and Waxman 1993), during which verbs may play a crucial role. The embodied learning of verbs and nouns is not correlated to one single modality in sensory percept’s: experiments done in Kersten (1998) suggest that the nouns are grounded from the intrinsic properties of an object, even at different movements and orientations, while verbs are accounted for the movement path of an object. This distinction may be associated with the neuroanatomy distinction between the ventral and dorsal (what/where) visual streams, involved in the generation of nouns and verbs respectively. As Maguire et al. (2006) suggested, some nouns and verbs can be learnt more straightforward to learn because they can be accessed perceptually. On the other hand, some abstract words, either verbs or nouns, should only be learnt from a social and linguistic context.

For instance, while infants learn the word-gesture combination at the age of two, they associate the meaning of verbs with the meanings of the higher-order nouns (Bates and Dick 2002). Such verbs with complex meaning are obtained from both motor action and visual percept (Longobardi et al. 2015). As summarised in Cangelosi and Parisi (2001) and Cangelosi and Parisi (2004), comparing to the static object perception that associates with simple nouns, the early verb learning involves a temporal dynamic from motion perception. Indeed, we assert that the learning processes of nouns and verbs (especially for those with complex meanings) are not separated; there is a close relation between verb and noun development, during which the embodied sensorimotor information plays a crucial role.

During this embodied development, both the perceptual system and the motor system contribute to language comprehension (e.g. Pulvermüller 2002; Kaschak 2005; Pecher et al. 2003; Saygin 2010). This embodied development may contribute to the emergence of how compositional semantics of a sentence can be acquired by a language acquisition system without knowing any explicit representations about either the meaning of word or motor behaviours as a priori. In this way the system can refer a semantic compositionality by the sensorimotor combinatoriality. It also extends Piaget’s proposal that language learning is a symbolised understanding process for dynamic actions, which is “a situated process, function of the content, the context, the activity and the goal of the learner” (Holzer 1994).

The sensorimotor information is not the only mechanism acting as a learning tool for language acquisition. Conversely, recent research also proposes that language is such a flexible and efficient system for symbolic manipulation which is more than a communication tool of our thoughts (e.g. Landy et al. 2014; Mirolli and Parisi 2009, 2011.) For the predictive effect from language to sensorimotor behaviours, vocal communication can be one of the sources that drive the visual attention to become predictive, by making inferences as to the source-inferences (Tomasello and Farrar 1986). In this process, language can trigger a predictive inference about the appearance of a visual percept, driving a predictive saccade (Eberhard 1995). Therefore, the sensorimotor system is affected by the inferences from the auditory modality or even from higher level cognitive processes.

We concluded this bidirectional relationship between language learning and sensorimotor system in a hierarchical cognitive framework proposed in Zhong (2015), in which the language understanding and grounding occurs during the dynamical process hierarchically from the neural processes on the (lower) receptor level to the higher level understanding which happens in the (higher) prefrontal cortex. As the review done by Tenenbaum (2011), the hierarchical framework can be detailed formulated in a probabilistic way, in which the abstract knowledge also acts as a prior to guide our learning and reasoning. The probabilistic based models have also been applied in acquiring abstract knowledge from robot-environment interaction (Konidaris et al. 2015), human-robot interaction (Iwahashi 2008) and multi-modal living environment (Attamimi 2016). Additionally, the hierarchical architecture can also be implemented as connectionist models. For example, the hierarchical recurrent neural architectures can be found in Zhong et al. (2011), Zhong et al. (2012a), Zhong et al. (2012b), due to the fact that the learning modalities of visual perception and motor actions can be represented as both spatial and temporal sequences, so that the recurrent connections provide possibilities to intertwine these two modalities.

In this paper, due to our interests in the non-linear dynamics of the system and its contribution to the generalisation abilities, the recurrent neural models would be a proper model to model this process. Although similar RNNPB (Sugita and Tani 2005) or MTRNN (Heinrich et al. 2015) networks have been used to learn verbs and nouns features with motor actions and visual features, the model we will use is a single MTRNN model to learn both the sensory and motor information in a single set of sequences, because we regard the perception and action having inseparable links (e.g. Wolpert et al. 1995; Noë 2001) and should be encoded solely as similar data structures. Moreover, since the training of such a large MTRNN has become more and more feasible in recent years due to the accessibility and affordability of GPU computing, a large data-set from robotic experiment will be tried to be conceptualised towards abstract representations on the higher level of this hierarchy, similar to the developmental processes of language conceptualisation and categorisation.

To summarise, compared with the connectionist models on semantic compositionality (Sugita and Tani 2005; Heinrich et al. 2015), the novelties of our model and experiments are:

  • Instead of using the neural binding methods on multiple RNNs, the hierarchical MTRNN provides another perspective to model the emergence of semantic compositionality over multi-modal data, which may be more parallel to the perception-action coupling of different levels of the nervous system (Sperry 1952): perception and action processes are functionally intertwined, which we represent in the recurrent connections from the low to the top layer in our hierarchical network.

  • Technically, in our model, the multi-modal data (language, visual and proprioceptive) was implemented into a single hierarchical network. This uniformity can be discovered in the higher-level heteromodal representation in the multisensory neurons with continuous feedback and feed-forward connectivities (Ghazanfar and Schroeder 2006; Macaluso and Driver 2005). That is similar to the recurrent neural architecture we use. Furthermore, a single RNN network that incorporates multi-modal signals would be beneficial to improve the generalisation ability.

  • Using a humanoid robot and a large-scale dataset, we can observe how the semantic dynamic is emerged on different levels with a similar learning process of the human morphology. The later experiments will also show how the semantic structures of verbs are self-organised on the higher-level of neurons, suggesting a similar neural representation may exist in the human brain activities.

Fig. 1
figure 1

Architecture of multiple time-scale recurrent neural network

2 The multiple timescale recurrent neural network model

Briefly, the motivations that we employ recurrent neural models, specifically, the MTRNN, to model the learning processes of the language learning from the sensorimotor interaction are:

  • The hierarchical neuron distribution in a single MTRNN with multi-modal inputs is able to mimic the dynamical and bidirectional processes of the heteromodal neurons when human is learning the multi-sensory knowledge;

  • Furthermore, such dynamical process in the RNNs is able to form the bifurcation functions in which the functional hierarchy is formed in a self-organized way in one network (Tani 2014);

  • The MTRNN is able to be stacked in a hierarchical way which is also similar to the hierarchical organization of the brain areas (Zhong 2015);

Our language learning model is based on the combination of an MTRNN network with Self-Organizing Maps (SOMs) to control the humanoid robot iCub, being trained on the understanding of a set of noun-verb combinations to perform a variety of actions with different objects. Figure 1 shows the learning architecture incorporating a Multiple Timescale Recurrent Neural Network (MTRNN) (Yamashita and Tani 2008) and the self-organizing maps. The core module of the system is the MTRNN, which will learn sequences of verb-noun instructions and will control the movement of the robot in response to such instructions. The inputs to the MTRNN correspond to the language command inputs, to the visual inputs as well as the proprioceptive inputs. We regard these three modalities as a whole sensorimotor input because the MTRNN model is able to learn the relation between the verbs and nouns and seen objects within the context of the non-linearity of the sensorimotor sequences in a hierarchical manner. This network will learn this non-linearity in the functional hierarchy in which the neural activities are self-organised, exploiting the spatiotemporal variations.

2.1 Using a self-organizing map as a sparse structure

The initial input data sets, consisting of speech, camera images, and proprioceptive (kinesthetic) states are pre-processed (see Eqs. 14) using three SOMs respectively for the linguistic, visual and motor input modalities.

Although the MTRNN could be trained with original data representation, we usually employ pre-processing modules for the MTRNN inputs, which result in a sparse structure of the weighting matrices in the network. Also the MTRNN outputs are decoded into the original data structures. The sparseness in weighting matrices has a similar concept of sparse coding in computational neuroscience (Olshausen and Field 1997): the weighting matrices are sparsely distributed, which is an analogous form of the sparse distributed representations that are used in our neural activities, such as in visual (Essen 1985) and auditory cortex (Reale and Imig 1980). Previous research on language learning in RNN (Awano et al. 2011) also showed that a sparse encoding results in robustness in training and a better generalisation results and improved robustness with noisy inputs.

Here the sparseness structure in the weight matrices is given by the SOMs (Kohonen 1998). During this process, the SOM performs as a dimensional mapping function, with an output space with higher dimensions than the input space. Having a discretised and distributed neural encoding in the output space, the pre-processed SOM modules are able to reduce the possible overlap of the original data within the original input space. Therefore, the topological homomorphism produced by the SOM guarantees that the training vectors between the raw training-sets and the input vectors are topologically similar with each other.

In the SOM training here, assuming the input vectors are

$$\begin{aligned} x = [x^1 , x ^2 , \ldots , x ^m ]^\intercal \end{aligned}$$
(1)

where m is the number of dimensions of the input vectors. These input vectors are mapped to an output space whose coordinates define the output topology of the SOM. Connecting between the input and output spaces, the weight vector is defined as

$$\begin{aligned} w_j = \left[ w_j^1 , w_j^2, \cdots , w_j^m \right] ^\intercal , j = 1, 2, 3, \cdots , n \end{aligned}$$
(2)

where neuron j is one of the input space vectors and n is the total number of those neurons. When a self-organising map receives an input vector, the algorithm finds a neuron associated with weights that are most similar to the input vector. The measure of similarity is usually done using the Euclidean distance metric, which is mathematically equivalent to finding a neuron with the largest inner product \(w^\intercal _j x\). Thus the very neuron that is the most similar match for the input vector is referred to as the best matching unit (BMU) and it is defined as:

$$\begin{aligned} c = \text{ arg } min_j \Arrowvert x-w_j \Arrowvert \end{aligned}$$
(3)

The dimensionality mapping is achieved when the BMU coordinates are used to update the weights of the neighbourhood neurons around neuron c by driving them closer to the input vector at iteration t:

$$\begin{aligned} w_j(t+1) = w_j(t) + \delta (x_j - w_j) \end{aligned}$$
(4)

\(\delta \) is a Gaussian neighbourhood function, which determines the adjusting rate for the weights.

Therefore, the output of the SOM which is encoded in a high-dimensional input space, is still able to preserve the topological properties of the input space due to the use of the neighbourhood function.

2.2 Multiple timescale recurrent network (MTRNN)

As shown in Fig. 2, the neurons in the MTRNN form three layers: an input-output layer (IO) and two context layers called Context fast (\(C_{f}\)) and Context slow (\(C_s\)). In the following text, we denote the indices of these neurons as:

$$\begin{aligned} I_{all} = I_{IO} \cup I_{C_{f}} \cup I_{C_{s}} \end{aligned}$$
(5)

where \(I_{IO}\) represents the indices to the neurons at the input-output layer, \(I_{C_{f}}\) belongs to the neurons at the context fast layer and \(I_{C_s}\) belongs to the neurons at the context slow layer. The neurons on a layer own full connectivity to all neurons within the same and adjacent layers, as shown in Fig. 1. The difference between the fast and slow context layers as well as the input-output layer consists in having distinct time constants \(\tau \), which determine the speed of the adaptation given a time sequence with a specific length, when updating the neural activity. The larger the value of \(\tau \), the slower the neuron adaptation. The difference of adaptation rate of the neurons further assemble features of the input sequences in various timescales. Therefore, given the previous states \(S(0), S(1), \ldots , S(t)\), their spatiotemporal features will be self-organised on different levels of the network. So the MTRNN is not only a continuous time recurrent neural network that can predict the next states \(S(t+1)\) of the time sequence, but also its internal state acts as a hierarchical memory to preserve the temporal features of the non-linear dynamics in different timescales. In the embodied learning case, such memories, mostly in a set of of oscillatory patterns, represent the verb/noun semantics during the robot interaction. Therefore, such patterns are learnt by self-organising as fixed points and limit cycle non-linear dynamics.

Fig. 2
figure 2

Language learning model based on MTRNN

2.2.1 Learning

In general, the training of the MTRNN follows the updating rule of classical firing rate models, in which the activity of a neuron is determined by the average firing rate of all the connected neurons. Additionally, the neuronal activity is also decaying over time following an updating rule of the leaky integrator model. Therefore, when time-step \(t>0\), the current membrane potential status of a neuron is determined both by the previous activation as well as the current synaptic inputs, as shown in Eq. 6:

$$\begin{aligned} \tau _i u^{'}_{i,t} = -\,u_{i,t} + \sum _j w_{i,j} x_{j,t} \end{aligned}$$
(6)

where \(u_{i,t}\) is the membrane potential, \(x_{j,t}\) is the activity of j-th neuron at t-th time-step, \(w_{i,j}\) represents the synaptic weight from the j-th neuron to the i-th neuron and \(\tau \) is the time scale parameter which determines the decay rate of this neuron. One of the features that is similar to the generic continuous time recurrent neural networks (CTRNN) model is that a parameter \(\tau \) is used to determine the decay rate of the neural activity; a larger \(\tau \) means their activities change slowly over time compared with those with a smaller \(\tau \).

Assuming the i-th neuron has the number of N connections (i.e. the total number of the neurons in the network is N), Eq. 6 can be transformed into

$$\begin{aligned} u_{i, t+1} = \left( 1 - \frac{1}{\tau _i}\right) u_{i,t} + \frac{1}{\tau _i}\left[ \sum _{j \in N} w_{i,j} x_{j,t} \right] \ \ (\text{ if } t>0)\nonumber \\ \end{aligned}$$
(7)

When the time-step \(t = 0\), the membrane potential of the IO neurons is set to 0 and the context neurons are set to initial states \(C_{s_c}({i,0})\):

$$\begin{aligned} u_{i,0} = \left\{ \begin{array}{ll} 0,&{} \text{ if }\quad t =0 \text{ and } i \in I_{IO}, \\ C_{{s_c}({i,0})},&{} \text{ if }\quad t = 0 \text{ and } i \notin I_{IO} \end{array} \right. \end{aligned}$$
(8)

The neural activity of a neuron is calculated in two methods (the sigmoid function and the soft-max function), depending on which level the neuron belongs with:

$$\begin{aligned} y_{i,t} = \left\{ \begin{aligned} \frac{e^{u_{i,t}}}{\sum _{j \in Z} e^{u_{j,t}}},&\text{ if } i \in I_{IO}, \\ \frac{1}{1+e^{-u_{i,t}}},&\text{ otherwise. } \end{aligned} \right. \end{aligned}$$
(9)

Particularly, the soft-max activation function gives rise to the recovery of a similar probability distribution as the SOM pre-processing modules. Therefore, this activation function results in a faster convergence to the MTRNN network training.

During the training process, it is to minimize the error E defined by the Kullback-Leibler divergence:

$$\begin{aligned} E = \sum _t \sum _{i \in O} y^{*}_{i,t} log\left( \frac{y^*_{i,t}}{y_{i,t}}\right) \end{aligned}$$
(10)

where \(y^*_{i,t}\) is the desired neural activation of the i-th neuron at the t-th time-step, which acts as the target value for the actual output \(y_{i,t}\). The target of the training is to minimize E by back-propagation through time (BPTT).

In the BPTT algorithm, the input of the IO neuron is calculated from a mixed partition value r (called the feedback rate) of the previous output value y and the desired value \(y^*\). (Eq. 11)

$$\begin{aligned} x_{j,t+1} = (1-r) \times y_{j,t} + r \times y_{j,t} ^ * \end{aligned}$$
(11)

where we will use \(r=0.1\) during training, and \(r=0\) during generation, which means that the network is used to generate the sequences autonomously.

At the n-th iteration of training, the synaptic weights and the biases of the network of neuron i are updated according to Eq. 12.

$$\begin{aligned} w_{i,j}^{n+1}= & {} w_{i,j}^n - \eta _{i,j} \frac{\partial E}{\partial w_{i,j}} \nonumber \\= & {} w_{i,j} - \frac{\eta _{i,j}}{\tau _i} \sum _t x_{j,t} \frac{\partial E}{\partial w_{i,t}} \end{aligned}$$
(12)
$$\begin{aligned} b_i^{n+1}= & {} b_i^n - \beta _i \frac{\partial E}{\partial b_i} = b_i - \beta _i \sum _t \frac{\partial E}{\partial u_{i,t}} \end{aligned}$$
(13)
$$\begin{aligned} \frac{\partial E}{\partial u_{i,t}}= & {} \left\{ \begin{array}{r} y_{i,t+1}-y^*_{i,t+1}+\left( 1-\frac{1}{\tau _i}\right) \frac{\partial E}{\partial u_{i,t+1}}, \quad \\ \quad \text{ if }\quad i \in I_{IO}, \\ \sum _{k \in I_{all}} \frac{\partial E}{\partial u_{k,t+1}} \left[ \lambda _{i,k}\left( 1-\frac{1}{\tau _i}\right) + \frac{1}{\tau _k} w_{ki} f'(u_{i,t}) \right] , \\ \quad \text{ otherwise }. \end{array}\right. \nonumber \\ \end{aligned}$$
(14)

In Eqs. 12 and  13, the partial derivatives for w and b are the sums of weight and bias which determine the changes over the whole sequence respectively, and \(\eta \) and \(\beta \) denote the learning rates for the weight and bias changes. Particularly, the term \( {\partial E}/{\partial u_{k,t}}\) can be calculated recursively as Eq. 14, where the \(f'()\) is the derivative of the sigmoid Function defined by Eqs. 8 and 9. The term \(\lambda _{i,k}\) is the Kronecker’s Delta, whose output is 1 when \(i = k\), otherwise, it is set to 0.

3 Experiments

To examine the network performance, we recorded the real world training data from object manipulation experiments based on an iCub robot (Metta et al. 2008). This is a child sized humanoid robot built as a testing platform for theories and models of cognitive science and neuroscience. Mimicking a two-year old infant, this unique robotic platform has 53 degrees of freedom. As such, using the iCub, we set a learning scenario in which a human instructor was teaching the robotic learner a set of language commands whilst providing kinaesthetic demonstration of the named actions. This setting is similar as the infant-directed action or motionese scenario (e.g. Brand et al. 2002; Brand 2007) where the mother modifies their actions when demonstrating objects to infants in order to assist infants’ processing of human action. Duplicating the learning environment of the development process, the aim of these experiments was to evaluate the verb-noun generalisation with a large data-set using the MTRNN. We were also interested in how the mechanisms, especially the neural activities in the hierarchical architecture, result in such a generalisation.

Fig. 3
figure 3

Experimental scenario. a iCub Manipulation setting. b Objects used in the experiment. There are eight different objects shown in this image. The last object that is not present is a green ball, which is shown in Fig. 3c. c Example of a complex lifting action involving the coordination of the entire upper body actuated by 41 motors

3.1 Experimental setup

Figure 3a shows the setup used in our experiments. During the training process, the data set was obtained using the following steps:

  1. 1.

    Objects with significantly different colours and shapes were placed at 6 different locations along the same line in front of the iCub (i.e. the objects from perception).

  2. 2.

    A vocal command was spoken by an instructor according to the visual scene that was perceived by the iCub. A complete sentence of the vocal command is composed of a verb and a noun such as “lift [the] ball”. This was recognised by the speech recognition software called Dragon dictate,Footnote 1 with which the corresponding verb and noun were recognised and then translated into two dedicated discrete values based on the verb and noun look-up table (Table 1) (i.e. a sentence includes a verb and a noun).

  3. 3.

    Following the command “lift [the] ball”, the built-in vision tracker of the iCub searches for a ball-shaped object and automatically locate it in the middle of the receptive field; in this way, the joint angles of head and neck measure the position of the object (for the purpose of generalisation of different locations).

  4. 4.

    Joint positions of the head and neck are recorded. The sequence recorder module of the iCub was used to record the sensorimotor trajectories while the instructor was guiding the robot by holding its arms to perform a certain action for each object (i.e. the motor actions).

During the testing process, all the objects are placed on the table. The vocal command from the instructor are acted before the action execution. The whole experimental setup used combinations of 9 actions and 9 objects. The objects and one example of the action can be found in Fig. 3b and  3c. From these combinations, both the vocal commands (i.e. a complete sentence includes verb and noun) and the sensorimotor sequences can be created. To the best of our knowledge, this \(9 \times 9\) noun-verb scenario is one of the setups with the highest combination of verbs and nouns in grounded robot language experiments (e.g. Tani et al. 2004; Yamashita and Tani 2008). We used such a large number of data to test the combinatorial complexity and mechanical feasibility of this model, as well as to evaluate the generalisation ability and its internal non-linear dynamics when using such a large data-set. From an engineering point of view, after testing the feasibility of generalisation, it is also possible to apply this model in a real-world robot application.

Table 1 Look-up table of verbs and nouns for the data sets: the instructor showed the robot with different combinations of the 9 actions and 9 objects

As mentioned before, each speech command was recognised and translated into two semantic command units. Using 9 discretised values for verbs and 9 for nouns, the semantic commands have thus 81 possible combinations. This translation was done according to the verb and noun look-up table, as shown in Table 1. Since we used the visual object tracker in the iCub, the joints of neck and eyes automatically represent the location of the particular object which is presented in the vocal commands. Also the movements of the joint angles in the torso are recorded as the sequences of the motor actions. During the data recording, each recording sequence lasted 5 seconds and the encoder values of 41 joints were sampled at 50ms intervals. Thus, the complete input vector of the data set contains 100 steps of the discrete semantic command, location of visual attention and joint movement of the torso, as shown in Table 2.

Table 2 Structure of the training data

Three experiments were carried out and are described in the next subsections: in the first experiment, given the 9 actions and 9 objects data set, we will search the parameter space and find the best parameters for the network training. In the second experiment, the training and generalisation performance will be shown given different types of manipulated data sets. For the third experiment, we will further analyse the generalisation ability of the MTRNN network. All these experiments were run using a modified version of the Aquila software  (Peniak et al. 2011) in a GPU computer with one Tesla C2050 and two GeForce GTX 580 graphic cards.

Table 3 Training error with different parameter settings \((C_s, C_{f}, N_{C_s}, N_{C_{f}})\)
Table 4 Some of the sequences containing particular semantic combinations of verbs and nouns were removed during training

3.2 Training performance

In this experiment, we used the data set consisting of the complete \(9 \times 9\) combinations (i.e. number of verbs: \(N_v = 9\), number of nouns: \(N_n=9\)), which include information about 6 different object locations. The 6 locations were placed along the straight line on the table as shown in Fig. 3a. Thus the whole data-set contains \(9 \times 9 \times 6 = 486\) sequences (teaching time took less than 1 hour totally), which were all used for training the network.

After a brief hyper-parameter search experiment shown in Table 3, we selected the best parameters for this data-set are (70, 3, 50, 120) in the parameter space \(( \tau _s, \tau _f, N_{C_s}, N_{C_{f}} )\). We then examined the training performance of the network under this parameter setting using different data-sets. To test the generalisation ability, these data-sets were manipulated: a subset of the combinations of actions and objects were removed from the training set, to be used as validation test sets when testing the generalisation ability of the network. The detailed information about the manipulated data-sets are shown in Table 4, where the coloured numbers N indicate the specific verb-noun combination removed in the specific N-th data-set. We can see that the number of removal sets was increasing from the first to the third test-set, indicating the difficulty of generalisation was increasing. Also at the second and the third data-sets, some of the removal sets were next to each other, which further increased the difficulty of generalisation.

Table 5 RMS error of the generalisation tests

We used the parameter set of (50, 5, 70, 100). To further demonstrate the robustness of the generalisation ability given the un-trained sensorimotor sequences, the validation sets, which were not included in the training, were fed into the network. In this way, we aimed to test how the network responds to noun-verb combinations not used during training. Using the three MTRNNs we trained from three data-sets, we performed three generalisation experiments using the missing verb-noun combinations. In the experiments, only the first time step data in the sequence was provided (i.e. \(r=0\) in Eq. 11), which includes the initial position of the torso, head, and eye motors, as well as the vocal command. Then the network prediction was used as the input of the next time-step and formed a closed-loop to complete 100-step of the time sequence generation. The errors of the whole three training-sets, as well as those in different steps are shown in Table 5. A more straightforward visualisation of the network performance can be found in Fig. 4, which displays three examples of generated time sequences for motor actions from three MTRNNs. As we calculated in Table 5, the training error became larger when the number of training samples was smaller. In particular, a larger error could be found at the beginning of each time sequence, but the network became stable and generated a stable motor trajectory with less error as time elapsed. There were some errors displayed in the trajectories generation, so sometimes the generated robot behaviours based on the trajectories are biased with the original ones. However, in most of the cases, the generated robot behaviours correctly followed the semantic commands.Footnote 2

Fig. 4
figure 4

Trajectory generation The generated trajectories (dotted) with 41 dimensions were plotted and compared with the original trajectories. Three test-sets were selected to validate the training performances with different training sets. Similar to our RMS error shown in Tab. 5, larger errors could be found at the beginning of the sequences. a Generated trajectory from MTRNN 1, Test-set 61 (v.-n.: 0.1–0.1). b Generated trajectory from MTRNN 2, Test-set 231 (v.-n.: 0.4–0.2). c Generated trajectory from MTRNN 3, Test-set 484 (v.-n.: 0.8–0.8)

4 Generalisation analyses

In this section, we focus on the problem of how the verb-noun generalisation ability of the MTRNN network is achieved. The experiments we showed in the previous section, while only part of the verb-and-noun combinations were presented in the training of the network, it was able to “understand” the un-trained verb-and-noun semantic compositionality. During the training and execution phrases, the iCub learnt and duplicated the actions that the verb instructor speaks with the object that specified in the noun. At the meanwhile, since we trained one object at 6 different locations on the table, the robot can “adjust its attention” toward the intended object at different random locations on the table during execution. For an experiment with a similar aim of generalisation, (Sugita and Tani 2005) reported combining two hierarchical recurrent neural networks which can also accomplish verb-noun generalisation for understanding semantic compositionality in a situated environment. The model they used, called recurrent neural networks with parametric biases units (RNNPB), had similar non-linear dynamics as the MTRNN: the non-linear dynamics are determined by a small number of neural units which act as bifurcation for the whole system.

However, in our case, the learning sequences contain a much larger dimension (35) of the motor joint angles for the iCub movements, compared with motor sequences that trained in Sugita and Tani (2005). Furthermore, while the object appeared at one location in Sugita and Tani (2005), the differences in location of our work also increases the complexity of learning. On the other hand, this complex setting results in the bifurcation which occurs hierarchically in the MTRNN structure, but not been discovered in RNNPB yet.

From this point, we hypothesise that the MTRNN, or any other hierarchical RNNs, results in the separation in the network dynamics about different modalities in a self-organised way associating the semantics with the robot behaviours and the object categories after training. This type of separation should depend on the different organisation of the training data structures, and occurs on different levels of the hierarchical architecture using different strategies. For instance, in Sugita and Tani (2005), such association learning occurring on the PB level binds the semantic and the behaviour representations. Similar association learning also can be found in Heinrich and Wermter (2018). On the other hand, the single RNN we use, although with more complexity in training, allows a higher generalisation abilities because all the modalities are learnt in a single dynamical system. As shown In our experiment setting, after enough training, the synaptic weights between a basic motor behaviour (e.g. concepts of “lift”)Footnote 3 are strengthened about the verb input. And due to its complexity of iCub’s (as well as human’s) morphology, controlling its behaviours is difficult so it dominates a large portion of the spatio-temporal space in the sensorimotor sequences as well as in the neural dynamics. This is similar to the mechanism that the hearing of a verb causes neural firing in the primary motor and pre-motor cortices, corresponding to certain motor action fires when a particular verb is heard or said on the \(C_s\) layer. On the contrary, the noun also affects part of the sensorimotor outputs by offsetting the motor actions toward its interacting object, resulting in a specific goal-directed action. This appears to depend on somatotopically mapped parietal regions, parallel to our \(C_{f}\) layer.

Table 6 Removal of data in the \(3 \times 3\) data-set
Table 7 Errors: removal part of input (3 verbs and 3 nouns)
Table 8 Errors: removal part of input (9 verbs and 9 nouns)
Fig. 5
figure 5

Weight visualization by input removal: different colours along the axis represent different layers (red: IO, green: \(C_{f}\), blue: \(C_s\)) Without the verb input, we could easily notice that a large number of weights from IO layer to \(C_{f}\) remain to be un-trained in Fig. 5b. And no big differences can be observed in Fig. 5a, c and d. a Weight matrix of normal training (base-line). b Weight matrix without verb input. c Weight matrix without noun input. d Weight matrix without visual input (Color figure online)

In the following experiments, we will examine this hypothesis by means of manipulating data and visualising the training results.

4.1 Generalisation with partial inputs

In this subsection, we concentrate on the comparisons of the results after the removal of different modalities. These comparisons included two parts: i) Error of generalisation after removals; ii) Visualisation of weights after removals.

For the first part of the analysis, in order to obtain a more conclusive statement, we used two sets of data \(9 \times 9\) and \(3 \times 3\) of verb-noun combinations. The \(3 \times 3\) data-set (Table 6) contains a subset of the data-set from previous experiment; it contains the combinations of three actions and three objects, which were placed in 6 different locations. We used a similar look-up table as Table 1 except that only 3 nouns and 3 verbs were used for the vocal command discretisation. For the second part of the experiment, the visualisation of weights was only done with the \(3 \times 3\) data-sets, since its features are easier to observe and its basic principle can be easily extended to the \(9 \times 9\) data-set.

For both parts of the experiment, in order to observe how different lexical categories and visual input affected the training results, especially within the output of the sequences of the motor behaviours, different parts of the input data were removed:

  1. 1.

    No modification (base-line)

  2. 2.

    Remove the noun input (i.e. the first input unit was reset to zero.)

  3. 3.

    Remove the verb input (i.e. the second input unit was reset to zero.)

  4. 4.

    Remove the location of the visual object (i.e. from the third to eighth units were reset to zero.)

During the generalisation tests, the full \(3 \times 3\) or \(9 \times 9\) datasets were placed into the network. The training error and generalisation error of the motor output was compared in Tables 7 and  8. From these two tables, we can see that the removal of the verb resulted in a larger generalisation error than the other two tests, while the removal of the object location resulted in the lowest generalisation error.

For the second part of the experiment, the main aim was to understand the effect of a particular input modality (presenting as semantic structures or visual input) in the whole network activities by observing the visualization of the weights. We conducted an experiment with a smaller data-set (\(3 \times 3\)), due to the fact that smaller number of weights give a better presentation for the visualization. But a similar conclusion would be extended into the larger \(9 \times 9\) data-set. Figure 5 visualises the weighting matrix, where the neurons from number 0 to number 703 were neurons on the IO layer, from number 704 to number 764 were neurons on the \(C_{f}\) layer and from number 765 to number 794 were neurons on the \(C_s\) layer. The weight matrices in Fig. 5a, Fig. 5c and Fig. 5d looked quite similar. But in Fig. 5b, without the verb input, we could easily notice that a large amount of weights from IO layer to \(C_{f}\) remain to be un-trained. To quantitatively evaluate this observation, Table 9 calculated the 2-norm to obtain the Euclidean distances from the manipulated weighting matrices to the base-line matrix. The 2-norm was calculated by:

$$\begin{aligned} d(\mathbf {W}^m - \mathbf {W}^b) = \sqrt{\sum _{i=1}^n\sum _{j=1}^n(d^m_{ij}-d^b_{ij})^2} \end{aligned}$$
(15)

where \(\mathbf {W}^m\) is the weighting matrix after data manipulation, \(\mathbf {W}^b\) is the weighting matrix from the base-line experiment, d is the weight from the i-th neuron to j-th neuron. Here \(n = 795 \) which is the total number of neurons.

From the comparisons of weight matrices and the Euclidean distances, we further verified our hypothesis that the semantic compositionality of verbs represented as motor behaviours plays a significant role in the network since it is further grounded in the differences of motor action trajectories, which dominate a large spatio-temporal space of the sequences.

4.2 Internal dynamics

In the previous analysis, we have looked at the generalisation ability of the MTRNN. A preliminary conclusion suggests that the lexical structure of the verb plays a significant role in maintaining the convergence of the temporal sensorimotor sequences. In this section, we are particularly interested in how the generalisation capabilities are brought by the recurrent connected hierarchical structure. We believed that part of these answers can be found by observing the detailed neural activities on each context layer given the selection of different inputs. The neural activities were therefore examined using the \(9 \times 9\) data-set, with a previously trained MTRNN with the parameter setting of (70, 3, 50, 120).

Table 9 Euclidean distances between partial input matrices and normal training matrix
Fig. 6
figure 6

Principle component analysis on the \(C_{f}\) neurons. With comparison, we can observe the differences in verbs (Fig. 6a) result in larger divergence than nouns and locations. a Neural activation \(C_{f}\) from selected sequences. It shows that the sequences with different nouns are clustered closer than those with different verbs. Particularly we can compare (verb-noun) combinations of \((0.3{-}0.5, 0)\) (red) and \((0.1, 0.0{-}0.2)\) (blue). b\(C_{f}\) with different nouns. c\(C_{f}\) With different object locations (Color figure online)

Fig. 7
figure 7

Principle component analysis on the \(C_s\) neurons. With comparison, we can observe the differences in verbs result in larger divergence than nouns and locations. a Neural activation \(C_s\) from selected sequences. It shows that the sequences with different nouns are clustered closer than those with different verbs. Particularly we can compare (verb-noun) combinations of \((0.3{-}0.5, 0)\) (red) and \((0.1, 0.0{-}0.2)\) (blue). b\(C_s\) with different nouns. c\(C_s\) With different object locations (Color figure online)

The following figures showed the PCA trajectories of the internal neural dynamics on the \(C_{f}\) (Fig. 6) and \(C_s\) (Fig. 7) layers. Since the complete \(9 \times 9\) data-set contains 486 sequences, whose patterns can hardly be observed in one single figure, only a few samples were presented in the following figures to clearly show the PCA trajectories. Figures 6a and  7a showed the selected PCA trajectories on the \(C_{f}\) and \(C_s\) layers. These trajectories mainly concern combinations of verb inputs and a few noun inputs. We can see that the verbs mainly determine the patterns of the trajectories, which implies that the motor processing of verbs mainly affects the temporal dynamics in the MTRNN. Since perception and action are intertwined, we expect such neural phenomenon about motor execution exist during both the action execution and observation since the system needs a number of neural dynamics to maintain such motoric memories.

The following figures mainly show how the differences in lexical structures and visual information result in the differences in the PCA trajectories. Figures 6b and  7b show the PCA trajectories of the internal dynamics on \(C_{f}\) and \(C_s\) layers, with different noun inputs; Figs. 6c and  7c showed the PCA trajectories with different object location inputs. We could observe that the differences of nouns on the \(C_{f}\) (Fig. 6b) cause divergences at the beginning of the trajectories, but not at the end. From Fig. 6c comparisons show the differences of visual inputs produce even smaller divergences in the trajectories, and that the divergences mainly occurred at the middle of the trajectories. Comparatively, from the activities on the \(C_s\) layer (Fig. 7b and c), the divergences of the trajectories from nouns and visual inputs were even smaller: the \(C_s\) layer mainly encoded the information from the verbs.

To summarise the MTRNN analysis, the model self-organises similar patterns on various levels for every sensorimotor sequence, reflecting the hierarchical structure for the vocal commands. Particularly, we can see that the difference between verb inputs results in larger divergence of the trajectories than noun and object-location differences. Due to the data structure of our input vectors, the IO layer represents a collection of each word. With a slower adaptation rate than the IO layer, the \(C_{f}\) represents the grounded meaning of each verb, noun, and visual information. This grounding process is learnt by all temporal sensorimotor sequences. Similarly, using slower changing neurons, the \(C_s\) layer represents the general motor behaviour (i.e. the verb) of the whole sensorimotor sequence.

Therefore, the \(C_{f}\) activation mainly represents the lexical structures (verbs and nouns). The visual location has a limited effect on the \(C_{f}\) activation, probably because the information of noun already has overlap with the object information about the visual location. As the main factor of the \(C_{f}\) layer, the same verbs are represented as a similar pattern on the fast context layer in all Fig. 6a–c. The difference from nouns can be observed at the beginning of the trajectories. It correspond to the difference of robot behaviours at the beginning of the time sequences, caused by the neck and eye tracking before the actual hand movement starts. Comparing with the \(C_{f}\) layer, the \(C_s\) activation changes even slower. It generally represents the motor behaviours; only the verbs are represented in different patterns.

5 Discussion

5.1 Functional hierarchy of RNN and its bifurcation

It has been reported that quite a few RNN models based on functional hierarchy, such as RNNPB, MTRNN and conceptors (Jaeger 2014), allow the bifurcation to occur in the RNN dynamics. We will give a brief discussion of how this bifurcation happens. Assuming we have a simple hierarchical RNN with an additional unit (which can be regarded as a simplified version of RNNPB) as depicted in Fig. 8. The system can be described as Eq. 16.

$$\begin{aligned} \left\{ \begin{aligned} \dot{x}_{1}(t)&= -x_1(t) + f(x_3(t)) \\ \dot{x}_{2}(t)&= -x_2(t) + a \cdot f(x_1(t)) + c \cdot PB \\ \dot{x}_{3}(t)&= -x_3(t) + b \cdot f(x_2(t)) \\ y(t)&= f(x_3(t)) \end{aligned} \right. \end{aligned}$$
(16)
Fig. 8
figure 8

A simple recurrent network with parametric bias units

There are three fixed points in this network. After the network has been trained, i.e. the weights a, b and c are fixed, the coordinates of fixed points only depend upon the value of PB. Furthermore, the coordinates of the fixed points \([x_1, x_2, x_3]\) are first-order functions of the value of PB units (please see appendix for the calculation in details). In other words, the coordinates of the fixed points further determine the domain of different bifurcation properties. This is the reason that changing the parameter of PB units will change the qualitative structure of the non-linear dynamics of the network. From the bifurcation explanation of the simplified RNNPB model, at the next step we can also extend this to other hierarchical RNNs such as MTRNN, as they are holding a fundamentally similar theoretical foundation (Tani 2014).

5.2 Generalisation ability of MTRNN

In our experiments, the MTRNN was trained under a particular input data structure: Firstly the language commands were recorded as auditory data and transformed into a discrete symbolic representation, and secondly, the object locations and the motor behaviours were also stored as the angles of motor joints. This unique structure is a simplified representation of the common coding theory, which proposes that perceptual inputs and motor actions are sharing the same format of the representation within the cognitive processes.

The neural dynamics in our MTRNN exhibited a dynamics which are different from those reported in Hinoshita et al. (2009) and Heinrich et al. (2015). Whereas the noun (or object perceptual inputs) play a significant factor in the dynamics of context layers in these two examples, our network has minimised the effects of nouns or the object perception. This is partly because of the input data structure where the motor joints of the iCub robot have much larger dimensions than the visual perception input. Also, the spatial information for objects in our experiment setting is much easier to learn, compared to our diversified motor behaviours. The generalisation here concerns more the inference of the symbolic meaning of a language command due to the composition of neural dynamics. During the training in a hierarchical network, such as MTRNN or RNNPB, the neural connections strengthen between a particular type of sensorimotor sequence and visual perception. Particularly, in our case of \(9 \times 9\) data-sets, most of our network weights store the memory of motor actions.

Note that the generalisation of commands in the verb-noun combinations is not the same as we usually do in the generic recurrent neural networks (e.g. Ito and Tani 2004; Pineda 1987; Zhong et al. 2014), which expect the network to do interpolation or extrapolation with a novel input value in either temporal or spatial space. While generalizing dynamical patterns by interpolation is a non-trivial task for training motor patterns in robots, our main concern is the novel combinations in the context of lexicon acquisition. In our case, the learning of verbs and nouns results in the emergence of different dynamics that are mostly stored in different synaptic weights, and thus their combinatorial composition is realised by the non-linearity of the recurrent connections. Considering the different generalisation abilities of generic RNN, RNNPB (Kleesiek et al. 2013; Zhong et al. 2014) and MTRNN (Heinrich et al. 2015), the hierarchical RNNs appear particularly suitable for the production of flexible motor behaviour and language expression simultaneously in the real-world social robot experiments.

5.3 Hierarchical recurrent networks and further development

The hierarchical architecure was proposed the capture the unpredict information in the hierarchical architecture. In our application, it mainly captures the verb/motor information.

Furthermore, some machine learning methods have recently been proposed based on the two Hierarchical Recurrent Networks together (Cho et al. 2014), which achieved great performance in machine translation (Sutskever et al. 2014), image captioning (Vinyals et al. 2014), etc. The Encoder-Decoder (ED) architecture usually consists of two recurrent neural networks. One deep RNN network encodes a sequence of input vectors with arbitrary length into a fix-length vector representation in a hierarchical way, while the other deep RNN network decodes this representation into a target sequence of output vector. This specific representation between the encoder and the decoder RNNs is called “thought vectors” which is claimed to represent the meaning of the sequence in a high-dimensional space. The training of such an architecture is done by maximizing the conditional probability of the target sequence. If the input sequence is denoted as \((x_1, x_2, \cdots , x_T)\) and the corresponding output sequence is \((y_1, y_2, \cdots , y_{T'})\) (T does not necessarily equal to \(T'\)), the next symbol generation is done by maximising Eq. 17.

$$\begin{aligned}&\prod _{t=1}^{T'} P(y_t|y_{t-1},y_{t-2},\cdots ,y_{1},c) \nonumber \\&\quad = P(y_{T'},y_{T'-1},\cdots ,y_{1}|x_{T},x_{T-1},\cdots ,x_{1}) \end{aligned}$$
(17)

Generic RNNs are not able to approximate the probability of the sequence with arbitrary length because of its vanish gradient problem, but other novel RNNs, such as LSTM, BRNN (Bi-directional Recurrent Neural Networks), have been successfully employed to construct the ED architecture to “understand” (encode) and to “generate” (decode) the temporal sequences. Furthermore, due to the recent popularity of parallel computation by GPU, it has become possible to train and use such architectures to solve problems such as machine translation and image captioning.

As the MTRNN can also avoid the vanish gradient problem, and larger MTRNN can be implemented via GPU, it is also possible to embed the MTRNN into the ED architecture. In fact, the context slow level \(C_s\) already exhibits a similar feature of “thought vectors”, using a stable neural vector to represent the basic profiles of motor actions and object instances (in our robotic experiment). They also have similar information bi-directional flows which allow the networks to recognise and to generate the time sequences. Despite their similarities, compared with LSTM, the MTRNN have other distinct features: First, from the above experiments and from other MTRNN experiments (Heinrich et al. 2015; Hinoshita et al. 2009), it has been shown that the fast context layers and slow context layers exhibit various dynamics to explicitly represent the relationship between the verbs and nouns. The deep LSTM, on the contrary, has not been reported to have similar dynamics. Second, differently from the static vector representation from LSTM, the context layers allow a “slow” change through time which is more realistic for an interaction environment, where it can be used to dynamically exhibit the meaning of sentences and sensorimotor information.

Admittedly, the training of deep RNNs, e.g. LSTMs and MTRNNs, costs a large amount of computational effort. But the recent development of GPU computing provides an opportunity to construct and test such a big scale neural network with a reasonable time and budget. The combination of MTRNN, the concept of “thought vectors” and its embodiment in robotic systems, will allow us to further explore issues such as:

  1. 1.

    The comparison of the performances of MTRNN, LSTM, and BRNN within the ED architecture and examine their performances in the robotic platforms.

  2. 2.

    The robot motor action, as a natural temporal sequence, can be further incorporated as the training of RNNs of ED architecture with connections to other modalities.

6 Conclusion

This paper presents a neurorobotic study on noun and verb generation and generalisation, utilising with the MTRNN networks, with a large data-set, consisting of vocal language commands, visual object, and motor action data. Although the generalisation abilities of hierarchical RNNs (RNNPB, MTRNN) have been reported in previous research, this is the first study to demonstrate its generalisation capability using such a large data-set, which enables the robot to learn to handle real-world objects and actions. These experiments showed that the generalisation ability of the network is possible even with a large number of test-sets (9 motor actions and 9 objects placing placed in 6 different locations). This is particularly important because the recurrent connections between the verbs and nouns are associated with different modalities of the training-data, which is strengthened during embodiment training by the sensorimotor interaction. Detailed analyses on the robot’s neural controller showed that the dynamics on different layers are self-organized in the MTRNN. These self-organised dynamics further constitute a functional hierarchical representation on different layers, which associate different lexical structures with different modalities of the sensorimotor inputs. The MTRNN showed how the embodied information about the verbs dominates a large portion of the network dynamics, since the proprioception information plays a significant role in the training sequences. As such, the hierarchical RNNs, such as MTRNN, are shown to be particularly beneficial in building a neurorobotics cognitive architecture about language learning for robotic systems, where the recurrent connections are able to self-organise and build associations between embodied information in different modalities and the lexical structure information.