Peers’ Experience Learning for Developmental Robots

Humans have a fundamental ability that is to learn others’ experience for their own use, while humanoid robots don’t have. Several attempts have been made for specific situations in evolution and study of developmental robots. However, such attempts have provided limitations, e.g. others’ experience learning get overlooked. The present article proposes peers’ experience learning method, which first reviewed the evolution and development of developmental robots as some typical studies revealed, moving from humanlike to developmental. These terms are then reconsidered from humanoid robots’ viewpoint, particularly with the developmental principles: the verification principle and the embodiment principle. Next, a conceptual model of peers’ experience learning is proposed based on the principles, and the simulation results show that robots can “copy” peers’ experience to cognize and develop automatically. Finally, a general discussion and proposals for addressing future issues are given.


Introduction
Peers' experience learning (PEL) helps humans to realize the contexts that they never heard before, which is important for humans' mental development. It is also regarded as an essential requirement of future intelligent robots. To humanoid robots, PEL has been recently addressed in brief studies from the perspective of mental development [1], and several attempts have been made to address specific context of PEL (e.g. [2,3]). Designers expect to manifest PEL behaviors towards humans, and to immigrate the mechanism to humanoid robots. Many efforts have been made in aiming at constructing a more diligent and more humanlike robot.
Elman et al. [4] attempted to define "development" to better clarify an approach while designing developmental robots (DR). However, their definitions are not precise and do not seem to be supported by simulations and experimental results. Thus Weng [5] gave an implemented developmental algorithm based on image analysis. They further proposed a developmental modeling [6] for agents, making developmental robots be able to learn internally via interaction with people.
Before long, a number of designs and applications integrating with developmental algorithms increased dramatically to make DR be closely to real scientific practice [7]. Major remarkable ones include Anticipating Rewards Robot proposed by Blanchard and Canamero [8], the developmental framework by Meng and Lee [9] SS-RICS by Kelley [10] and LOC based on ACT-R by Zelazo [11], etc. They incorporated many useful concepts that are necessary for human intelligence, providing diverse ways to give an impetus to DR according to brain mechanisms and biobehavioral studies. However, humanlike interactive learning is still a complex problem.
The importance of interactive learning is discussed by Asada [3], who together with his team have constructed a conceptual model based on an infant robot, providing more authentic experience sharing experiments [5,12,13]. Their work mainly focus on how can humans affective developmental processes by synthetic or constructive approaches, but PEL gets overlooked, especially transplanting and verifying peers' experience based on self-other distinction. Figure 1 shows a schematic depiction of evolution and development in the context of DR thus far. The horizontal axis is the time and the vertical axis indicates the "developmental level", this begins with a trajectory from cognitive and humanlike robots, via self-developmental robots, and ending with interactive and developmental robots. Figure 1 shows, accordingly, that research progresses over time.
Based on views from Stoytchev [14], there are two basic principles of DR that are verification principle and embodiment principle. So PEL is expected to be achieved through interacting with other experienced robots to engraft peers' experience, furthermore, to verify the experience knowledge based on its own embodiment for being self-serving. Asada et al. [15] have advocated that the key technologies of DR are interaction and development. However, such work has not been adequately precise from bio-behavioral and brainscience perspective.
The present paper proposes a DR based on PEL, in order to better understand experience sharing processes through synthetic and constructive approaches, especially regarding transplanting peers' experience as own experience knowledge.
The rest of the article is organized as follows. Section 2 introduces PEL from humans' perspective. Section 3 provides a conceptual model of DR based on PEL. And the simulation results are presented in Sect. 4. Finally, the conclusions and future developments are discussed.

Peers' Experience Learning of Humans
Humans have a fundamental function that is to learn from others, which means they share the experience though everyone is unique. Suppose that one need to go across a labyrinthine environment and there are many narrow paths. And he needs to ask his peers, who have experienced the environment before, for more information. His peers may tell him watch out the narrow paths because it is difficult to get passed. So he needs to make a comparison between him and his peers, to ensure the successful pass. And the own experience will be learnt when he get passed. It's a hypothetical scenario, but similar ones are happening with far too much regularity in PEL of humans. PEL between humans is a necessary means of everyday social communication and it paves the way for the development of humans [16].
In the mammalian brain, feedback connections in the brain play important roles in PEL [17]. Hideyuki et al. [18] points out that PEL obtained through social interaction with a variety of individuals uniquely modulate activity of brain network. Moreover, they found that there are two mental dimensions in brain when one interacting with others (see in Fig. 2): one represented "mind-holderness (the red in Fig. 2)" in which human borrow from the experiences of others by interacting, while the other dimension represented "mind-readerness (the blue in Fig. 2)" in which human use others' experiences for reference.
As a result, human first "read" peers' experience, then justify the reasonability and verify the feasibility by itself, finally "hold" the justified and verified experience into their own use [19]. So the "mind-readness" part and the "mindholdness" part are important to humans' PEL, then a PEL mechanism of human can be derived based on the studies of brain science [20], which is shown in Fig. 3. In Fig. 3, the black flow represents the PEL processes: (1) Learn. Human must have a way to learn peers' experience. (2) Sense. Human use eyes or hands, to see or to touch, so as to sense the environment [21]. (3) Preprocess. The information of obstacles is mass and unclassified, so the sensed information should be classified and preprocessed. (4) Think. Humans use brain to think whether to take the peers' experience or not so as to generate action strategies, and to develop its brain. (5) Pre-act. The action strategies are generated after thinking, but they cannot be decoded and acted by limbs, so they should be clarified so as to transform the strategies to body language. (6) Act. The limbs follow the commands from brain to execute action commands.

Conceptual Architecture for DR
We adhere to the developmental principles as outlined by Stoytchev [14] to meet the requirements of DR. The relevant points are as follows: (a) The verification principle: EDR can learn from and maintain peers' experience knowledge only to the extent that it can verify that knowledge (tried-and-true knowledge) itself. (b) The embodiment principle: a robot is always about the same from birth, but they are distinguishing and selfspecifying from each other. So the DR should have the ability to remodify peers' experience on the basis of its own embodiment parameters.
Based on the two principles and the PEL mechanism of humans, the DR architecture can be obtained as is shown in Fig. 4, where the PEL of DR is described as the black flow: (1) Learn. To DR, it transplants peers' experience via uniform protocol between each other. (2) Sense. The DR use sensors onboard to sense environment around. (3) Preprocessor. To preprocess the mass and unclassified information. (4) Brain. This part purposes on thinking the proper action strategies, and reserving the strategies to develop DR itself. (5) Pre-act. The action strategies cannot be decoded and acted by robots, so it should be clarified so as to be transformed to commands. (6) Act. The DR follow the commands to act in environment.

Peers' Experience Learning
PEL is conducted at discrete time instances ( t = 0, 1, 2 … ) through following definitions: Definition 1 A robot agent M may have several sensors (including exteroceptive and interoceptive sensors that sense stimuli from external and internal environment, see, e.g., Weng [22]) and effectors, whose sensory and control signal at time t is collectively denoted by S(t) and C(t) . M 's embodiment is denoted by E (it's not time-varying) and peers' experience denoted by P(t).
Definition 2 M has a "brain" denoted by B(t) , and the timevarying state-update function f t updates B(t) at each time t based on: (1) sensory input S(t) , (2) control output C(t) , (3) peers' experience knowledge P(t) , (4) Embodiment parameters E and (5) current "brain" B(t): is generated by the action generation function g t based on B(t + 1): As can be seen, DR not only matches the habit of human cognition and behavior (that is thinking before acting), but also implies a robot agent M cannot have two separate phases for learning and performing (that is learning while performing).
As a PEL example, consider the case that a robot has to go through a labyrinthine environment. And finally he succeeds, with large amounts of information. Definitely not all of the information should be transplanted and be bore in mind, all the robots need is the necessary information, i.e., as least as possible. So we introduce "nodes knowledge" to express peers' experience.

Definition 4
Nodes knowledge: the labyrinthine environment is modeled as three kinds of nodes. (1) turning nodes N t : N t is the crossings where the robot has to follow a circular arc trajectory. (2) straight nodes N s : N s is the straight ways. (3) key nodes N k : N k is where the robot has to arrive or pass by, i.e., target nodes.
So peers' experience is represented as P = N t , N s , N k , E . To the one who receives peers' experience, it is not a timevarying knowledge, however, he will develop it to his own knowledge as time goes by. So we revise P to P(t) in

Environment Peers experience Preprocessor
Brain Pre-act Act Fig. 4 The DR learning mechanism consideration of development over time. As shown in Fig. 5, the three kinds of nodes knowledge and embodiment comprise peers' experience.

Brain
The DR aims not simply at transplanting peers' experience but, more challengingly, at building a paradigm that provides a new understanding of how we design a humanoid robot that interacts with and learns from others, to enrich its brain. Roughly speaking, the brain of DR is consisted of two phases so as to produce own experience: from social development to individual development. More specifically, from justify peers' experience is proper or not to verify peers' experience is feasible or not.

PEL Justification
Unacquainted with the environment getting from peers, people may hold the question "if I was in his situation, what would I do?" So justify peers' experience on the basis of individual characteristics. What is called "mental rehearsal [23,24] " in terms of psychology (Fig. 6).
PEL justification is a mental rehearsal of peers' experience, which divides P(t) to feasible nodes P fr (t) and infeasible nodes P ir (t) : P(t) = P fr (t) + P ir (t).

PEL Verification
In PEL justification, the peers' experience knowledge is bound to mental rehearsal, anyway, "knowledge starts with practice". Peers' experience must be verified in real environment without being given any other information.
Let us consider the "brain" state of robot is denoted by a state vector B(t) , a random process as Eq. (1) implied, which are closely related to Markov decision processes (MDP) [25]. Taking the uncertainty in states into account, the state-update function f t can be: and the action generation function g t : where p(•) denotes the probability. Though both peers' experience and the verified information can be obtained by "brain", but only the verified information can be entered to the action generation function g t .

Brain
The DR needs to form its own "brain" to memorize high dimensional experience as well as peers' experience.One may tell others his own experience as well as the experience heard from others, so as to provide more information. Our proposed "brain" is constructed of the global states: where P pe and P oe are peers' experience and own experience, E pe and E oe are peers' embodiment and own embodiment.
However, experience accumulation over time induces a severe problem: "data disaster". Because peers' experience spreads from one to another, causing the P oe and E oe added to P pe and E pe each time. The brain at k-th transplanting is: As we can see, the "brain" may encounter data disaster when k has big gaps to its initial value. Inspired by a new hierarchical statistical modeling method [26], the "brain" dimension is reduced by using incremental hierarchical discriminating regression (IHDR) tree [27] as Fig. 7 shows, which brings together similar features to generate feature clusters so as to improve the convergence performance.
So the different peers' experience can be combined to reduce data in brain:

Embodiment for DR
Actions cannot be performed in the absence of embodiment, robots must have some ways to affect the world, i.e., it must have a body [28][29][30]. Damasio [31] holds the view that "the brain is the body's captive audience". In other words, all the commands must be applicable to the properties of its own body. Robots are created uniquely, they may have different properties even if with the same morphological structure (e.g., velocity and acceleration). In a robot agent M , the state E of embodiment is repres e n t e d d i st r i b u t e d ly by d i f fe re n t st a t e s e i : E= e 1 , e 2 , e 3 , … , e n . Take a differentia-drive mobile robot into consideration (see in Fig. 4), so the embodiment can be described as: where v l and v r are the left and the right velocities of the two wheels, and r is the radius (assuming the robot is a circle). The speed of two wheels control robot motion: when they are equal, straight-line motion is attained, while for different speeds the trajectory follows a circular arc.

PEL Algorithm Based on DR
The PEL approach is implemented and tested on eight challenging scenarios, which are classified into two categories (see in Table 1). In addition, simulations in the last two scenarios have been repeated for different environment to verify the effectiveness of proposed algorithm. For each case the state of "brain" obtained is measured as indicator of the development of the PEL algorithm performance.
Robot trajectories are shown in Fig. 8 for the 2-ascending scenario and 2-descending scenario. Later positions are drawn on top of earlier. By comparing the two scenarios, it appears that DR can transplant peers' experience for their own use, furthermore, once a robot encounters justified peers' experience, trajectories are obtained without any path planning algorithm.
The trajectories of the robots in the 3-ascending scenario and 3-descending scenario are shown in Figs. 9 and 10. Take the state of "brain" (see Figs. 9b, 10b, the brain is defined and quantified by the amounts of the trajectory nodes) into consideration, the two scenarios are quite opposite: the "brain" of 3-ascending scenario increases over time while  Big to middle, then middle to small 3-small-descending Small to big, then big to middle 3-big-ascending Big to small, then small to middle 3-mid-ascending Middle to small, then small to big 3-mid-descending Middle to big, then big to small 3-descending scenario stay the same. It is because we introduce IHDR algorithm which, as a result, effectively reduces the "brain" dimension, so the robustness has been improved. It is also shown that this makes it easier to implement PEL and incrementally developing.  The robots' trajectories in 3-ascending scenarios and developmental brain

3
The 3-small-descending and 3-big-ascending of Fig. 11, together with 3-mid-ascending and 3-mid-descending scenarios of Fig. 12 represent the out-of-order PEL flow sequence. For a more detailed comparison, simulations in Fig. 12 have been repeated for different environments, and Fig. 10 The robots' trajectories in 3-descending scenarios and developmental brain Fig. 11 The robots' trajectories in 3-small-descending and 3-bigascending scenarios 1 3 3 various values of the robot radius have been considered. Simulation results have shown that the robots can learn for its predecessors and turn PEL into its own knowledge, and then verify the knowledge in environment to develop its "brain". Furthermore, though the brain of robot 2 and robot 3 are almost the same (sees in Fig. 12b, they hold the same amounts of the trajectory nodes), but it turns out that the robot 1 follows robot 3, it is because we take safety factor into consideration, which means when a robot has two different peers' experience but with same trajectory nodes amounts, it tends to the larger and safer one (Fig. 13).

Comparison with Other Algorithms
In order to validate the feasibility and effectiveness of PEL algorithm based on DR, particle swarm optimization (PSO) [32] and ant colony (AC) [33] are chosen to compare against it. All the simulations are conducted in Matlab 2012a environment, Inter core-i5.
Suppose that before entering the labyrinth (the length and width are both 100 meters.), all environments are unknown, and the robot has to find a way out as fast as possible. The simulation results are shown in Fig. 14a and their time-consuming in Fig. 14b. Figure 14a shows that PEL, PSO and AC all can get a safe path, however, their time-consuming are different from each other as inferred from Fig. 14b. Though there is data fluctuation when sampling, the time-consuming stay stable within limits. The mean time-consuming of AC, PSO and PEL is 0.148 s, 0.117 s and 0.072 s respectively, which shows that PEL algorithm exhibits better performance than PSO and AC. The PEL algorithm has the least time consumed, PSO get the second place and AC needs the most. It is because PSO and AC requires continuously re-plan trajectory when they encounter with complex environment.
Even if step into the same environment, PSO and AC needs to repeat the previous calculation for they cannot learn to develop themselves. And it cost more time in calculating, causing instability of robots (The more time consumed, the more instable robots have). But PEL algorithm can get a safe path directly and dispense with the repetitive calculating to similar environment by referring to peers' experience.

Conclusions
In terms of humans, we have argued how DR can follow a developmental pathway similar to natural PEL. After reviewing terminology in the context of biobehavioral perspective, a conceptual constructive model for acquiring peers' experience as well as turning peers' experience for own use has been proposed. Following are some concluding remarks. Fig. 12 The robots' trajectories in 3-mid-ascending and 3-middescending scenarios 1. The "brain" of agent is closed once after the birth, which means B(t) cannot be altered directly by human programmers. Instead, it can only be updated through interaction with the outer environment according to Eq. (1). 2. The DR has also been proposed, and a conceptual constructive model for PEL has been devised in parallel with self-other cognitive development on the basis of peers' experience justification and verification. 3. The proposed constructive model is expected to shed new insight on our understanding of PEL for DR, which can directly reflected in the design of "brain".
Still, there are several issues in need of attention and further investigations, including practical studies and dynamic environment studies.