Paladyn Journal of Behavioral Robotics a Formalism for Learning from Demonstration *

The paper describes and formalizes the concepts and assumptions involved in Learning from Demonstration (LFD), a common learning technique used in robotics. LFD-related concepts like goal, generalization, and repetition are here defined, analyzed, and put into context. Robot behaviors are described in terms of trajectories through information spaces and learning is formulated as mappings between some of these spaces. Finally, behavior primitives are introduced as one example of good bias in learning, dividing the learning process into the three stages of behavior segmentation, behavior recognition, and behavior coordination. The formalism is exemplified through a sequence learning task where a robot equipped with a gripper arm is to move objects to specific areas. The introduced concepts are illustrated with special focus on how bias of various kinds can be used to enable learning from a single demonstration, and how ambiguities in demonstrations can be identified and handled.


Introduction
Learning From Demonstration (LFD) is a well established technique for teaching robots how to perform useful tasks. The basic idea is that the robot learns a behavior from one or several demonstrations performed by a, most often human, teacher. The research area is attractive, both in its intuitive approach to human robot interaction and as a framework for a theoretical analysis of knowledge representation and transfer of knowledge between intelligent agents.
Research on LFD is influenced by a variety of fields, including control theory, artificial intelligence, psychology, ethology, and neuro physiology. While primarily being a big asset, the multidisciplinary nature of LFD also contributes to the lack of a unified formalism for the different components constituting the research field. It should not come as a surprise that the terminology used differs for works conducted by researchers from various areas. In this paper, we define and formalize the common ideas and principles involved in LFD. The presented work is both a survey of how these concepts are used in research, and an attempt to describe them in the light of related concepts in machine learning, planning theory, and psychology. To our knowledge this has not been previously done in a unified way and the result can be used both as a theoretical introduction to the field and as framework for further development and research. In contrast to other surveys of the area [4,12], the present work specifically focuses on LFD where the robot is directly controlled during demonstration, e.g. via teleoperation or kinematic teaching. While this direction removes some of the hard and important issues in LFD, it allows increased focus on other aspects, specifically how bias is introduced into the LFD process. The formalism is applied to a sequence learning task in which the introduced concepts are illustrated with a special focus on how bias of various kinds can be used to enable learning from a single demonstration, and how ambiguities in demonstrations can be handled. The formal approach is inspired by the work on planning and actuation by LaValle [53] and therefore does not always follow the terminology and notation found in common literature on LFD. Where this is the case, it is highlighted and the commonly used terms are referred. In Section 2, a few fundamental concepts that form the basis for the rest of the paper are introduced. Section 3 gives a formal description of the learning process using these concepts. In Section 4, the introduced formalism is applied on a sequence learning task using a Khepera robot equipped with a gripper arm. Section 5 summarizes the paper and discuss directions for future research. A symbol index summarizing introduced notations can be found in Table 5.

Basic concepts 2.1. State space
One fundamental component in classical AI is the concept of a state space X , described by a world ontology [77, p.222]. The state space can be defined as a set of all possible situations that could arise in the world [53, p.17]. More specifically, the state space only includes the relevant aspects of the world, given a particular task or limited set of tasks. However, if the task is unknown it is very difficult to identify which aspects of the world are relevant. One could of course try to include all aspects that might be of interest, but even if possible, that would result in a huge and complex state space, implying tremendous sensing requirements when applied to a field such as LFD. Furthermore, defining a state space introduces many unnecessary assumptions about the world, and requirements for information which make the problem much more complex than necessary. This observation is nicely illustrated by 1

A u t h o r c o p y
Simons' ant [81] and is also related to the frame problem [47,62].
For these reasons, it is desirable to create new spaces, less taskspecific and sensor-demanding, in which behaviors can be represented. Such a redefined representation is referred to as an information space [53, ch.11]. The concept of information spaces is also common within LFD, but appears under different names. In order to facilitate learning, approaches to LFD often utilize so called primitives or skills. These primitives can be seen as building blocks from which more complex behaviors can be composed, which results in moving the learning process away from the state space into a new representational space composed of the available skills, e.g. [8,34,46,51,65,68].
Many of these approaches relate strongly to Behavior Based Control (BBC) [5,58,60]. BBC has its roots in the reactive paradigm, but emphasizes parallel, loosely connected behaviors for control of the robot as an emergent property, rather than a single stimuli-response loop. The possibility of applying the concept of information spaces within LFD is further investigated in Section 3, but first a few other basic concepts have to be introduced.

Sensing and acting
Imagine an agent interacting with the environment. It perceives the world through its sensors and acts upon the world with its actuators.
The sensors are defined as a function : X → Y transforming a state ∈ X into a sensor state ∈ Y [53, p. 561]. Y denotes the observation space, i.e., the set of all possible readings returned by the agent' s sensors. Each ∈ Y is a vector ( (1) (2) ) comprising simultaneous values from all sensors. Typical examples are a thermometer that maps physical temperatures to numbers (1) ∈ R or a GPS receiver that maps physical positions to latitude and longitude, (2) ∈ R 2 . Y corresponds to the stimulus domain in behavior-based robotics [5].
On the actuator side, actions can be said to transform a state into another state. Hence, actuators implement the function : X × U → X where U denotes the action space, i.e., the set of all possible actions the agent can execute. A typical example is the requested velocity for each motor of the robot. Note that this does not specify the actual motor velocity, and only the outgoing information is represented in U. The actual velocity is usually represented in state space X . Now a description of how the agent behaves, i.e. generates actions, can be introduced. In general, such a description is referred to as a controller, but is also known as a plan [53, p.560], behavior mapping [5,27,68,71], motor primitive [3], control policy [4] or inverse model [39]. Several important differences between these terms do exist, for example in terms of abstraction level and temporal extension, but for now they can all be said to implement the function π: Hence, π maps states ∈ X to actions ∈ U. As mentioned before, X is not explicitly represented in the agent. Still, the physical sensors and actuators can be said to implement the functions and , respectively. In contrast, π can not be implemented without an explicit definition of and access to X . To solve this issue, π is later redefined and then controls the agent based on the information space instead of the state space.

Information space
The observation and action spaces are widely used by the robotics community. These spaces are often combined into a information space I = U × Y , also known as the sensory-motor space [73].
In each stage the robot experiences a sensory-motor event = ( −1 ) ∈ I. The action at − 1 is used since changes the current stage to + 1. One approach that extensively uses representations in I is sensorymotor coordination (SMC) [72]. From an SMC perspective, sensing and acting are not two separate processes. In contrast to classical reactive systems, SMC does not view the information flow purely as going from sensors to actuators. Actions give rise to stimuli, just as much as stimuli influences actions. If the agent can predict these relations, it can intentionally control its interactions with the world. Hence, control is seen as a problem of coordination. Similar views are common within psychology, anthropology and cognitive science, [37,45,82].
where each vector ∈ Y is provided by the sensors at stage . Similarly, letŨ be the history action space, i.e., the set of all possible action histories until current stage : where each ∈ U is a particular action vector issued at stage . The histories˜ and˜ in combination with the initial conditions η 0 form a history information state η , also referred to as an event history. η includes all accumulated information up to stage [53, p.566 The initial conditions η 0 describe presumptions about the state of the world X before stage 1. The history information state is a central concept in the formalism since it represents all the information the agent has received, and as a consequence η is always known in stage . I is known as the history information space and should be understood as the set of all possible event histories up until stage [53, p.565]: where I 0 represents the set of all possible initial conditions. The definition of I becomes impractical in cases where the number of stages is not fixed. Instead, we normally refer to the information history space I , which has an unspecified length [53, p.657]: I includes all possible combinations of everything the agent could possibly observe and do. Most η ∈ I will of course never appear, due to limitations imposed by the environment and the physical 2 A u t h o r c o p y shape of the robot. For example, imagine a simple robot, equipped with a proximity sensor on each of its four sides, placed in an empty large square box. In this environment, the robot never observes a with high activation of all proximity sensors simultaneously. This is a simple consequence of physical properties of the environment and the robot itself. The same reasoning could easily be applied to a human agent. There is a huge amount of patterns the human senses theoretically could perceive, but only a fraction of these will actually be observed. Most of the formal definitions in this paper take place in history information space I . You might ask why representations take place in such a huge and complex space when only a fraction of its representational power is actually used. I should not be understood as the representational space, but a representational space, a very basic one.
Any information the agent can acquire is representable as an event history η ∈ I . Furthermore, I is, in contrast to state space X , both well defined and completely task invariant and is as such very suitable for learning purposes. However, in many other respects I is not the best representational space. I contains a lot of redundant information, making it difficult to extract features relevant to the specific task.
For this reason, a new derived information space I may be created. I should be seen as a simplification of I , where relevant features are represented, while irrelevant information is not contained, [53, p.571]. The observant reader may think this sounds disturbingly similar to the formulation of state space. This observation is highly relevant and reflects to some extent the purpose of inferring I . The use of derived information spaces as bias in learning, and its relation to the state space, is further discussed in Sections 3.2 and 3.4.

Controller
The controller defined in Equation 1 can now be reformulated in a form that allows it to be used without full access to state space X : where ∈ U is the action vector issued at stage and η ∈ I is the agent' s event history at stage . π is defined here as a function from information history space to action space: In simple cases, a controller can be modeled as a function of only the most recent sensory-motor event. Systems based purely on such single-event controllers are called reactive systems [21]. Formally, these systems implement π as = π ( ) (9) which can be seen as a special case of Equation 7. This definition of π is similar to Arkin' s behavior mapping β : S → R, where S and R are stimulus and response, respectively [5]. However, in the general case we use the definition of π given in Equation 7.

Behavior
The word behavior is commonly understood as an agent' s actions in relation to the environment, but in the robotics community it has many different meanings. In the present work, behavior is understood as a purposeful way of acting. This does not imply that behaviors include explicit representations of goals, but from an observer' s point of view, the behavior can be said to implement some kind of purpose, or goal. This argument is developed in Section 3.3.
Using the introduced terminology, a behavior B is defined as a subset of information history space B ⊂ I . Each element in B is an event history η that represents one instance of the desired behavior.
Often, no explicit distinction is made between the observable interactions with the world, and the mechanisms producing these interactions. However, B describes nothing about how the behavior is produced, and therefore this notion of behavior is different than the terminology commonly used within behavior-based robot architectures [5,27,58,68]. B is purely an intrinsic definition and describes exclusively the behavior from the agent' s perspective.

Learning From Demonstration
Learning From Demonstration (LFD) is a well established technique for robot learning. An overview of early work is found in the work by Bakker and Kuniyoshi [6] while recent work and classification of the field is found in the survey by Argall et al. [4]. Another excellent survey of the area can be found in a recent book by Billard et al. [12]. The basic idea in LFD is that the robot learns to do things by observing other agents, be it human beings or other robots. when and who to imitate [11,12]. What to imitate refers to the problem of identifying which aspects of the demonstration are relevant for the task [20]. How to imitate is the question of how the skill is to be encoded. A central part of this issue is the correspondence problem [66,67] which refers to the process of mapping the observed sequence of events to corresponding actions of the pupil. In most practical situations the pupil is not given an explicit set of demonstrations, but the pupil must detect when the teacher is doing something related to the task to be learned. This problem is known as when to imitate. Finally, who to imitate refers to the identification of the teacher, which is also a difficult issue in many applications. These four questions are very general and can also be applied to learning situations with human or animal pupils. In practice, what and how to imitate are the most frequently studied problems within LFD. New behavior can be demonstrated to a robot in many ways, for example by having the robot pupil watch the teacher demonstrate the desired behavior. Here we focus on LFD where the teacher directly controls the robot, e.g. by teleoperation. The recorded data sequence from such a control session, including both executed motor commands and sensor readings, is denoted demonstration. The purpose of LFD is to create a controller π capable of reproducing the demonstrated behavior. While there are many other ways to demonstrate a new behavior to a robot, LFD via teleoperation constitutes a well defined setting that can be generalized to many practical applications. Formally, a demonstration is, in this setting, an event history η ∈ I (refer to Equation 4) where˜ −1 is the sequence of actions issued by the teacher up to stage − 1 and˜ is the sequence of observations up to stage .
In this setting, a direct correspondence between recorded events in a demonstration and sensors and actuators is assumed (a direct record mapping and no embodiment mapping, following the terminology by Argall et al. [4]). The observations in the demonstration are assumed to correspond to the observations that are generated in realtime by the sensors and sent to the controller. Furthermore, the observed action variables are assumed to directly correspond to the actuator signals generated by the controller π. This relates to self-3

A u t h o r c o p y
imitation, i.e., the pupil learns by performing the actions itself, with help from a teacher [78,79]. Self-imitation, in contrast to imitation of others, avoids two difficult problems. Firstly, the problem of observing the teacher' s actions, and secondly, the correspondence problem. LFD has its roots in the more general approach to create computer programs from demonstrations, known as Programming By Demonstration (PBD) or Programming By Example (PBE), e.g. [26,54].
However, modern LFD is far from these general approaches. This paper presents a formalism for robot learning through demonstration, which, while it can be seen as the creation of a specific kind of computer programs, does not aim at the wider interpretations of PBD or PBE.
The goal of LFD is, in this context, to generate a controller π that enables a robot to repeat a demonstrated behavior B. π may be a stateaction mapping, a model of the world dynamics (system model) or a model of action pre-and postconditions (plans), see the work by Argall et al. [4] for details. If successful, the robot is said to have learned behavior B. Formally, the process of learning B from a set of N demonstrations is understood as selecting π from the controller space Π using a learning function λ: where is the set of event histories η that constitute the demonstration. The LFD process is illustrated in Figure 1. Π contains all possible controllers for a specific chosen observation space and action space. This is of course a huge space that is never computed explicitly.
The selected controller π must have specific qualities for the learning to be regarded successful. These qualities are related to the event histories η that may be generated by a robot using controller π. The realization space R ⊂ I for a π is defined as the set of all such event histories, generated by the realization function Λ: Λ can be seen as an abstraction of the physical robot placed in a particular environment and controlled by a specific π, able to produce the set of all possible trajectories through I . Of course, the robot can not control the produced event histories η ∈ R entirely on its own, but relies on an external component, the environment. This creates a nice analogy to λ, which also relies on an external component, called bias. Thus the learning function λ can be seen as the inverse function of the robot represented by Λ. λ maps a set of event histories to a controller and Λ maps a controller to a set of event histories. This is further developed in Section 3.2.
The process of selecting π has many similarities to system identification, where a model of the system is constructed from observed input and output data [55]. The system, consisting of the agent and its environment, is modeled such that the system output +1 can be predicted given a sequence of previous inputs and outputs η until stage . However, the aim of system identification is in one sense much more ambitious than LFD, since the system' s response to any input is to be predicted. In LFD, we are satisfied with a π producing an action that, if possible, leads to an event sequence η +1 ∈ B given that η ∈ B.
In other words, LFD does not necessarily model the outcome of all possible actions in each state, only the ones that occur for the robot in a particular environment.
B should be understood as the set of event histories the human teacher associates with a particular desired behavior. For example, if the teacher wants to teach the robot to move to a door, B would contain all event histories where the robot ends up by a door, in an acceptable way. The behavior must be formulated such that the robot is able to reproduce the behavior in all desired environments. There may be situations in which the robot can not distinguish between significant aspects of the world. In these cases, the robot' s sensing capabilities or other aspects of the behavior have to be modified. Assume that the move-to-door behavior is to be applied to a robot in a hotel environment. The robot must now be able to separate between doors. One alternative is to add a new sensor allowing the robot to directly identify each door it approaches, resulting in a redefined I . Another alternative is to change the behavior such that the robot can use existing sensors, e.g. wheel odometry, in order to distinguish different doors by their locations. This corresponds to a modification of B. The quality of the generated π is typically described as the ability to "repeat a behavior", which is the topic of the next section.

What does it mean to repeat a behavior?
The goal of LFD is to generate a controller π that enables a robot to repeat a demonstrated behavior B given a set of demonstrations .
This may sound like a well defined mission, but is actually both vague and ambiguous. Consider the following example of a seemingly trivial demonstration. Observe a sequence of sensory-motor events describing a robot arm moving over a table, finally stopping when positioned above a green cube ( Figure 2). What does it mean to repeat this sequence of events? 1. Assuming that the path is the important aspect of the demonstration, a successful controller may be written as = π PAT H ( ) where the function π PAT H computes an action for each pose , such that the arm follows the demonstrated path. This kind of learning scenario refers to traditional programming of industrial robot arms, as well as path-tracking autonomous vehicles, e.g. [43].
2. Instead, if the demonstration is seen as an example of how to reach the final position, the path itself becomes irrelevant and the controller described above would not be suitable. In this case, a successful controller could be written as = π TARGET ( ) where the function π TARGET uses inverse kinematics to compute actions such that the tip of the robot arm reaches the target. [22] where the robot carries out the same actions as the demonstrator.

Case 1 corresponds to what is often called action-level imitation
Case 2 is often called functional imitation [29] in which the robot is supposed to achieve the same effect on the environment [67]. In the work by Alissandrakis et al. [2], the quality of action-level imitation is measured in state and action metrics while functional imitation is measured in effect metrics. State and action metrics define the similarity of behaviors in terms of the state and/or actions of the agent, while effect metrics define behavior in terms of their effect on the environment. Within these two categories one could imagine a vast number of interpretations. Should the observed sequence of positions be understood as fixed coordinates, or relative to the robot arm's starting position? Is the green cube really the relevant target, or is the target defined by an absolute position? Is the target a cube of any color, or or is the target perhaps any green object? Using many demonstrations of the same behavior reduces some of the ambiguity, but in general it is impossible for the learner to tell which interpretation is "correct" without further information. In fact, the learner can not even enumerate a set of possible interpretations without a specification of state variables relevant for the task to be learned. The discussion about what it means to repeat a behavior becomes complicated further when the robot acts in a dynamic, non-deterministic and partially accessible [77, ch.2] environment. Demonstrated event sequences may be both incomplete and contain mistakes that should not be learned or repeated [28]. If the robot manages to successfully repeat a demonstrated behavior under different conditions than during the demonstration we say that the robot is able to generalize the demonstrated behavior. More specifically, we refer to the robot' s ability to produce an event history η ∈ B, under conditions η −1 not identical to the ones appearing during the demonstrations in . This can be formally described as how well the realization space R corresponds to the desired behavior B, e.g. as a minimization of R B and B R (refer to Figure 1).
Generalization can also be viewed as an extension of by interpolation or extrapolation of the demonstrated event histories. For this to work one has to specify the aspects of the demonstrated data that are important, i.e., the previously mentioned problem of what to imitate (Section 3). One approach is to introduce a metric of imitation performance [1,2,10]. Repeating a demonstration means minimizing the distance between the demonstrations and the repetitions using this metric. To find the metric, the variability in many demonstrations is exploited such that the essential components of the task can be extracted. One promising approach to construct such a metric is to use the demonstrations to impose constraints in a dynamical system [24,38,44]. Giovannangeli and Gaussier [35] use human-robot interaction to improve generalization when learning sensory-motor behaviors for homing and path following. In the described work, teaching by error correction (proscriptive learning), is shown to give superior generalization compared to a regular demonstration (prescriptive learning). The generalization problem is also acknowledged outside the LFD community. In Machine Learning, the term generalization performance of a learning algorithm relates to "its prediction capability on independent test data" [41, p.193] which is identical to the common usage of the term in robotics. The general problem with machine learning in high-dimensional spaces is often expressed as the curse of dimensionality [33, p.170], and is highly relevant also for robots with highdimensional observation and action spaces. Learning in such situations becomes inherently difficult since the demonstrated data fills history information space very sparsely and interpolation and extrapolation become highly risky operations. The situation is related to the No Free Lunch Theorem [85], which states that for a large class of machine learning algorithms, there is no universal best algorithm to solve all problems. Instead, an algorithm has to be specialized to the problem under consideration to guarantee its superiority over any random algorithm. This specialization consists of additional task-dependent information that has to be supplied to the learning algorithm as bias. In the case of LFD, possible sources of bias are the robot' s prior knowledge, feedback from the environment when the robot tries to repeat the demonstrated behavior and human feedback before, during, and after learning. The bias concept is further investigated in the next section.

Bias
The bias of a machine learning algorithm is defined as "any basis for choosing one generalization over another, other than strict consistency with the observed training instances" [63]. The basis may be seen as form of pre-evidential judgment, or prejudice regarding the structure of the data or the data generating process. In the case of numerical regression, assuming a linear relation between input and output corresponds to a high bias, while a cubic model corresponds to a lower bias. In the case of LFD, bias can be applied to three different parts of the problem definition: 1. Sensor variables. This can involve selection of relevant sensors, or extraction of specific features that are judged relevant for the specific task. It may also involve creation of intelligent sensors to facilitate feature extraction.
2. Action variables. Most often this involves restricting the output of the controller π to one or a few actuators. For example when learning a grip operation, the actions for moving the robot may be regarded irrelevant while the gripper motion is highly relevant. This reduces the size of action space.
3. Controller function π. Bias can restrict the functional form of π, e.g. to an artificial neural network of a specific size and architecture. Bias can also be expressed as general requirements of π, such as smoothness criterion or lower/upper bounds. The use of predefined skills as described below is another example.
Bias can be introduced into the learning process in a number of ways.
First of all, it may be hard-coded into the learning algorithm, e.g. by choosing a specific neural network [57] or rule based framework Hellström [42] to represent π. Another common and very powerful technique to introduce bias is to use predefined skills or behavior primitives. Besides being biologically motivated [36,64], the technique is commonly used in robotics research, e.g. [34,59,61,68]. Learning is in this case reduced to selection of the right primitives and parameter estimation to adjust the primitives to the demonstrated data. The

A u t h o r c o p y
introduction of primitives is a way to reduce the dimensionality of the learning problem (i.e. to deal with the curse of dimensionality mentioned above). The set of primitives is obviously much smaller than Π which clearly simplifies learning. An analogy is numerical regression with a large feed-forward neural network compared to a low-level polynomial. The polynomial introduces bias that makes learning much easier, at the price of limiting the solution to the specific functional form of the bias. Regarding bias for sensors and actuators, it is common to hard-code a set of relevant sensors and action variables for the task at hand, or to pre-process the data before feeding it to the learning algorithm. This kind of bias may also be introduced by interaction with the human teacher who tells the robot to use specific sensor modalities. Saunders and coworkers present an approach where relevant elements of the state vector are weighted based on their information gain and on manual selection from a teacher [70,79]. Bias may also be subject to meta learning, suitable sensors can for example be selected based on demonstrated data. This relates to attention and saliency which are important concepts in theories for human and animal learning. The term shared attention refers to a teacher' s and a learner' s simultaneous attention to the same objects. Scassellati used the Cog platform [80] to investigate shared attention between humans and robots. Saliency refers to the components of the environment that are important for a given task, and it clearly introduces a bias by reducing the size of observation space Y . Breazeal and Scassellati, [18] describe the relationship between attention and saliency and how the concepts can be used to facilitate learning in robotics.
These techniques relate to the psychological term scaffolding, which is used to denote interaction between caretakers and infants in order to reduce distractions, marking a task's important attributes and reducing the number of degrees of freedom in the learning task in general [19,87]. All these operations aim at simplifying the learning task by introducing bias to the problem definition. From a formal perspective, bias regarding sensor and action variables may be introduced by moving away from I into a new, derived information space I [53, p.571]. I is a reformulated or pre-processed version of the information in I . The mapping from I to I is denoted κ, and may have an arbitrary shape: An element of I is referred to as a derived event history η and can be generated from η ∈ I using the mapping κ. Therefore, I does not serve as a general purpose representational space as I does, but rather as a task-specific representation where relevant features become salient, while irrelevant information is not retained. The purpose of I is similar to the purpose of the state space X . In fact, a state space is one possible instance of I , but there are numerous other possible derived information spaces that do not aim at representing states in the world. The LFD process with bias included is illustrated in Figure 3. Various ways to introduce bias regarding the control function π result in a reduced set Π ⊂ Π. The learning function λ maps from the derived information space I instead of straight from I . This extended formulation of LFD is further discussed in Section 3.4. Figure 3, the what to imitate question shows up as a transformation problem from I to I , i.e., an identification of the relevant aspects of the task. Since we are focusing on a self-imitation setting, the correspondence problem is not present here. However, there is still the problem of selecting a controller π ⊆ Π based on , reflecting the remaining parts of the how to imitate question.

Referring to
When to imitate appears as ensuring that ⊆ B, i.e., that everything in the demonstration set is actually part of the desired behavior.  Our discussion about bias has so far been focused on knowledge intentionally introduced into the system to facilitate learning. We like to refer to this kind of information as ontological bias. However, there are also a vast number of restrictions to the problem introduced for other reasons. As mentioned before, selecting a specific type of algorithm to represent π will introduce bias. A particular configuration of the robot' s sensors and actuators restricts the ways in which it can solve a particular task. Often the choice of physical platform and software architecture is made for practical reasons rather than for an understanding of ontological implications. We like to phrase these kind of restrictions as pragmatical bias.
Independent of the type of bias being introduced into the system, it limits the behaviors the robot can learn. Consequently bias is not necessarily positive. Instead, one should aim at a suitable level of bias, such that the robot can learn as many interesting behaviors as possible, while still being able to generalize correctly. As mentioned above, using pre-defined skills or behavior primitives is a common way to define Π . The demonstrated data are in such cases used to identify a suitable primitive and may also be used to set parameters for the selected primitive. One way to define such primitives is to associate them with achievement of specific goals. This concept deserves special attention and is analyzed further in the next section.

Goal
The success or failure to repeat the demonstrated behavior is most often judged by the human demonstrator, and to describe the human intentions we use the word goal. The goal of a behavior is a human concept and can be of two major types [68]: 1. Maintenance goals. A specific condition has to be maintained for a time interval, such as the path-tracking scenario described in Example 1 in Section 3.1.
2. Achievement goals. A specific condition has to be reached, such as the motion to a green cube in Example 2 in Section 3.1.

A u t h o r c o p y
A behavior B was earlier introduced as a set of event histories that, from a teacher' s perspective, fulfills some common purpose. This can be understood as after performing B, specific conditions in the world are satisfied. This is analogous with the common goal formulation from classical AI, where a goal G is a set of states in state space [77]: All the information the agent acquires about G is accumulated over time in˜ and˜ . Therefore, any goal G which can be measured with the agent' s sensors can also be formulated as a set of event histories η ∈ I : This should be understood as after observing an η ∈ G I we know that G is satisfied. A consequence of this formulation is that behaviors and goals are represented in the same way, and since any η ∈ B by definition satisfies the goal of B, G I and B become identical: This may also be explained from the reversed perspective. When X is viewed as a derived information space, G will cast a pre-image into I which per definition will be identical to B. Still, this formulation of goals is not very satisfying. In state space, G most often has an intentional definition, a neat formulation that describes the minimum requirements.
However, in the task invariant I , a neat goal can not be formulated since no bias has been introduced. When a human teacher speaks about goals he or she uses task specific information which in principle could be transferred to the robot as bias. This is partly what is done when a state space is defined in classical AI. But the information a human uses to formulate goals may not be necessary for executing the same acts, maybe not even helpful. This argument is nicely illustrated in the frame of reference [14,73]. By assuming the necessity for a human goal formulation we impose our own frame of reference upon the agent, and may make representation of the behavior much more complicated than it may be from the agent' s perspective.
A common way to introduce this separation between the human' s and the robot' s frame of reference is to introduce pre-programmed primitives. The set of known primitives creates a space where the human teacher can easily get an understanding of what the robot is doing, while the specific controllers can create local information spaces suitable for the specific primitive. The use of primitives is further developed in the following section.

Learning with behavior primitives
Based on the concepts of behavior, bias, and goal introduced above, the learning task defined in Equation 10 is here refined. In Section 3.1 it was concluded that λ requires some bias to be able to find a suitable controller, as illustrated in Figure 3. In the most basic form of LFD, λ is simply learned by fitting the demonstrated data to a more or less general functional form, such as a neural network [57] or a rule base framework [42] which in such cases represents the reduced controller set Π P in Figure 3. The use of primitives, which was introduced in Section 2.1, is fully compatible with this description of learning bias such that learning consists of matching a demonstration with a predefined primitive. This process is denoted behavior recognition and can be approached in a number of ways as described below. The description of LFD given above is valid for demonstrations of behaviors that can be repeated by choosing one single primitive. More complex behaviors demand sequences or combinations of primitives. For a given robot and class of learning scenarios, the set of primitives Π P is normally chosen such that a demonstration may be divided into segments where each segment can be repeated by choosing the right primitive. The general LFD process illustrated in Figure 3 is here extended to include handling of such sequences. Some types of behaviors are better described as combinations of several primitives executed in parallel, e.g. [69]. This organization is common in behavior-based architectures, e.g. [27,58]. However, recognition of primitives executed in parallel is incredibly complex in the general case. Furthermore, these systems require a coordination function that integrate motor commands from parallel primitives. Due to these issues, parallel primitives are less common in LFD applications and we have therefore chosen to focus on the purely sequential case. Let us first look from a post learning perspective at how sequence control can be described for a robot using a set Π P of predefined primitives π . To include the assignment of parameters for parameterized primitives into the learning, Π P is in the following regarded as containing all possible parameterizations of primitives. Control can now be divided into two steps: 1. Action selection where a function π selects a primitive π ∈ Π P : π = π (η ) (16) where π performs the mapping π : I → Π P (17) η ∈ I is a pre-processed or derived version of the original event history η ∈ I , constructed by an information mapping function κ [53, p.571], defined in Equation 12.
2. Low-level control using the chosen controller π to generate an action .
Stepping back to the learning phase, the problem is now reduced to finding the action selection function π using demonstrated data pre-processed with the information mapping κ into the derived information space I (see Figure 4) 1 . In this way, the dimensionality of the learning problem is drastically reduced since λ is now selecting suitable π ∈ Π based on the pre-processed trajectory information in I rather than working on the full I and Π spaces. Compare with Figures 1 and 3. While the approaches to sequence learning with primitives vary widely, the process of finding π can be divided into three tasks: 1. Behavior segmentation where a demonstration η ( ) is divided into smaller segments, referred to as task segments.
2. Behavior recognition where each segment is associated with a primitive π ∈ Π P 1 By comparing Equations 16 and 17 with Equations 7 and 8, the primitives π may be seen as generalized actions, generated by a controller π . Another interesting analogy can be made between action selection and the correspondence problem, i.e., the problem of finding the action(s) that corresponds to an observed event sequence. Viewing the primitives as actions leads to an equivalent problem formulation for action selection; find the primitive that corresponds to an observed event sequence.  3. Behavior coordination, referring to identification of rules or switching conditions for how the primitives are to be combined.
Referring to Figure 4, these tasks are realized by the function λ. In practice, task 1 and 2 are often intertwined. For Task 1, several approaches exist, for example variance thresholding [46,51], repeated pattern correlation [49,75,76], thresholding mean velocity of joints [34,65] and entropy-based segmentation [25]. Auto-associative neural networks have also been used for segmentation, both by measuring network reconstruction performance [15] and by identifying bifurcations in the network attractor dynamics [49,50]. Calinon and coworkers [24] used Dynamic Time Warping in combination with Gaussian Mixture Regression to decompose movement trajectories of a humanoid robot. Task 2 is commonly seen as a classification problem. For example, Bentivegna [8] uses a nearest-neighbor classifier on state data to identify skills in a marble maze and an air hockey game. In both these setups, each primitive is assigned a query point in state space, which is compared with the current system state. Pook and Ballard [74] present an approach where sliding windows of data are classified using Learning Vector Quantization in combination with a k-NN classifier. The complexity of the distance measure is highly dependent on the complexity of B. For simple behaviors, a Euclidean distance function has been shown to work well [9]. However, for more complex behaviors, other measures are necessary. Tani [83,84] does both recognition of behavior primitives and segmentation with extended recurrent neural networks that model different behavior primitives depending on the parametric bias in the network model. Recognition is done by finding the optimal parametric bias for an observed sensory-motor sequence. Calinon and colleagues use Hidden Markov Models in combination with Principal Component Analysis to compute the likelihood that the observed data was generated by the model [23,24]. One approach that addresses the complexity of higher level primitives can be found in work by Nicolescu [68], where two behaviors are regarded as being similar if their respective preconditions and goals match, regardless of their internal differences. Nicolescu utilizes the postconditions to recognize primitives in demonstrated data, i.e., task 1 and 2 as described above. Recognized primitives are arranged in a behavior network and during execution the behaviors' preconditions in combination with the network links are used for behavior coordination, Task 3. Formally, any sequence of recognized primitives can be seen as an element in a derived information space I , and consequently a behavior network, represented as a set of behavior sequences, constitutes a subspace of that I . In this setting, the definition of postconditions for each primitive constitutes an information mapping κ from I to I and the preconditions take part in the implementation of the coordination function π . The primitive controller itself is represented by π ∈ Π . Compare with Figure 4.
Demiris and Johnson [31] present a different approach where all primitive controllers are continuously running in parallel, predicting actions in response to incoming sensor data. The prediction errors are then used to estimate how well each primitive represents the demonstrated behavior. This approach is similar to our own method β-comparison, which is also used for some primitives in the present example, c.f., Section 4. Even though theoretically appealing and with strong connections to biological findings, see [31] for details, direct comparison of predicted actions become infeasible for complex primitives. The method presented by Demiris and Johnson, as well as our β-comparison, has problems capturing the similarity of behaviors that may be executed in many different ways, leading to the same goal. One way to handle these issues is to move from a direct comparison of actions in I to more abstract concepts of actions or events in a derived event history η ∈ I . An evaluation of β-comparison and two other methods for behavior recognition can be found in [15]. In a generalized sense these methods should be seen as an attempt to create a metric of imitation performance, as discussed in Section 3.1.
Sometimes, a demonstrated behavior can not be decomposed into a sequence of known discrete primitives. Several metrics may conflict and cause ambiguities in behavior recognition. In these situations, continuous task representations are preferable since they can better describe a smooth transition from one metric to another, see for instance [24]. A distributed approach to Task 3 is presented by Maes and Brooks [56]. Global feedback is used, allowing the primitives themselves to learn suitable activation conditions by correlating particular stimuli with positive or negative feedback. The feedback functions in combination with the primitives themselves constitute the coordination function π . Another approach to behavior coordination is found in the MOSAIC architecture [39,40,86]. MOSAIC utilizes forward modes paired with primitive controllers. Each forward model computes a responsibility signal as a measure of how well the paired controller can handle the present situation. When combined with a responsibility predictor this architecture forms a powerful coordination system. MOSAIC is a theoretical framework but the HAMMER architecture [30,32], which has been implemented and tested on robots, captures many aspects of MOSAIC. Both these architectures are put in relation to LFD in our own recent work [13]. A key aspect of this approach is the pairing of forward models (predictors) and inverse models (controllers) in a model-free way. We are analyzing this issue deeper and propose a possible solution based on the algorithm Predictive Sequence Learning algorithm in other recent publications [16,17]. There are several approaches to identify relevant aspects of the task that do not employ behavior primitives. While we limit the present review to approaches using primitives, the work by Kulic et al. [52] is worth mentioning even though it does not directly apply behavior primitives. In this approach, demonstrations of movement patterns are encoded in Hidden Markov Models and then clustered into groups using Hierarchical Agglomerative Clustering. Groups are formed incrementally as new demonstrations are added, which makes this approach display many of the advantages with behavior primitives as described here. Furthermore, Kulic et al. put forward the advantage of a hierarchical organization of behavior, a claim we support strongly and discuss deeper in other work [13]. 8

A u t h o r c o p y
Adding to the motivations presented above, one important reason for the use of primitives in LFD is that primitives constitute high level representations of the demonstrated behavior. Primitives can be labeled in meaningful ways, which helps establish a common understanding between the human teacher and the robot pupil. It is natural for humans to break down sequences of actions into meaningful sections and adults appear to agree upon how segmentation should be made [7]. We therefore believe that identification and recombination of behavior primitives is a critical aspect of LFD.

Demonstrator
The concepts and theory introduced above are here illustrated with an experiment in which a Khepera robot [48] is used in an LFD setting. This experimental setup is on purpose simplified to illustrate how ambiguous even a very simple demonstration may be, and how the proposed formalism can be used to describe the LFD process. The Khepera robot has eight infra-red proximity sensors mounted around the rim of the robot. The limited sensing capabilities have for this experiment been augmented by an external camera mounted above the robot arena. The setup can be seen in Figure 5 and an example image from the top mounted camera can be seen in Figure 6. The robot is equipped with a gripper and is placed in an environment with a number of wood blocks and two colored areas located in one side of the scene. Figure 5. Experimental setup. In the center is a Khepera robot [48] with a gripper that can be raised and lowered. The objects around the scene are painted wood blocks. Rubber bands have been placed around the objects to facilitate gripping. A camera has been mounted directly above the scene, see Figure 6.
The experiment comprises a sequence learning task in which a human intends to teach a robot to pick up cubes and place them in the bluecolored corner area. To demonstrate the wanted behavior, the human tele-operates the robot towards a red cube, grips it, lifts it, moves to the blue area and drops down the cube. The robot should then be able to repeat the demonstrated behavior. The reader is referred to Figure 1 which summarizes much of the discussed formalism.
Observation space Y comprises the camera image ( Figure 6), data from the eight proximity sensors, position sensors for gripper and gripper arm and an optical barrier detecting objects in the gripper. Ac-  6). I is a huge space comprising all possible sensor and action sequences the robot in principle can experience. Given the task at hand, a more suitable derived information space I is defined. It comprises sequences of the following entities derived from Y and U: Object properties distance, direction, orientation, type, and color where type is either cube or cylinder. Directions and orientations are given in a coordinate system relative to the robot. Distance and direction to the centroids of the two colored areas are also extracted. Technically, these entities are extracted from the camera image using a combination of image analysis tools, including color segmentation, Sobel edge detection, Hough transform and mathematics morphology. Formally, these techniques are parts of the κ, defined in Equation 12. The generation of I should be seen as the first of many kinds of biases that we introduce in order to make the learning task feasible.
This bias depends on the available sensors and actuators and also on the task at hand. It is clear that the dimensionality of the learning problem is significantly reduced by replacing the camera image in I by a small number of object properties. The demonstrator performs the wanted task by tele-operating the robot as described above. The resulting recorded data ⊂ I is a set of event histories constituting the input to the learning function λ. To support this process, the universe Π of all possible controllers is reduced to a much smaller set Π that comprises pre-defined high-level behavior primitives. The following primitives are defined: move_to_object, move_to_area, grip, release, lift and put_down. The move_-to_object primitive takes two parameters color and type, where = C ⊆ { } and = T ⊆ { }. The move_to_area primitive takes one argument color just like move_to_object, but does not have any type parameter. One could of course imagine many other possible parameters for the these primitives, e.g. position and size, but the included parameters suffice for the present example. Referring to Section 3.3, each parametrization of the move_to_object and move_to_area primitives is associated with a specific goal G I (Equation 14). As been already 9 A u t h o r c o p y mentioned, this is a very efficient way of introducing additional bias in learning such that complex behaviors can be learned by few or even a single demonstration. Conceptually, Π comprises all possible parameterizations of the primitives. To keep the example simple, all primitives are hard-coded into the robot, i.e., both I and Π are defined manually. However, in a realistic setting primitives are often created during a previous learning phase, as has been shown in for example the work by Saunders et al. [70,79]. The use of primitives should be seen as a way to reuse knowledge that may come either from a programmer manually designing the primitive, or from a previous learning phase. Formally, this is described as a gradual redefinition of I and Π which corresponds to the concept of scaffolding described above.
To learn a sequence of these primitives, the three steps described in Section 3.4 are performed. Behavior segmentation and recognition are executed in one step by continuously matching each primitive against . The recognition method differs between different primitives. For grip, release, lift and put_down, the start and end positions of the gripper are used to indicate that the corresponding primitive has been executed. For move_to_object and move_to_area the behavior recognition method β − [15] is used. In this approach, an action vector for each parameterized primitive is computed, creating a set of hypothesis. Each action vector is then compared to the observed creating an error measure for each hypothesis. If the error remains low while the robot is approaching a specific target object, the hypotheses is confirmed and a move_to primitive with the corresponding parametrization can be inserted into the recognized sequence. Each primitive specifies a set of finish conditions, e.g. that the robot should be within gripping range of a target object for move_to_object to complete. The end result of the learning process λ is a function π ∈ Π that selects an appropriate primitive π ∈ Π given the current event history η (Equation 16). In this way, π acts as a sequencer and the actual control of the robot and gripper motion is done by the currently selected primitive π . Even with the bias introduced by the construction of I and by the pre-defined behavior primitives in Π a substantial uncertainty, similar to the one discussed in Section 3.1, remains. This is illustrated in Figure 7 where alternative interpretations at each step are drawn horizontally and time flows vertically from top to bottom. In the shown example, the first part of the demonstration may be interpreted in four ways; move_to_object(C,T), move_to_object(red,T), move_to_object(C,cube), move_to(red,cube). The second and third primitives grip and lift are uniquely identified while move_to_area is subject to similar ambiguity as move_to_object. Finally, the primitives put_down and release are uniquely identified. The alternatives for each ambiguity represent generalizations along different feature axes. In Section 3.1 this is described as interpolation or extrapolation of the demonstrated event histories η. Figure 7 illustrates a subspace of Π . With the ambiguities resolved, through human feedback or other kinds of bias, the resulting sequence represents an instance of π ∈ Π as defined in Equation 17. Various types of feedback from the human can be applied such that the ambiguous sequence collapses into a single well defined sequence of behavior primitives that will enable repetition of the demonstrated behavior according to the user' s intentions. In the described experiment, the human teacher manually selects the appropriate alternatives in a dialog system such that the generated π will execute the sequence žmove_to_object(C,cube), grip, lift, move_to_area(blue), put_down, release~. The robot is then able to repeat the intended sequence of primitives and autonomously move cubes of any color to the blue area. To sum up, the present example demonstrates how the huge and complex I can be transformed into a significantly smaller and more human interpretable space I . On the controller side, the set of all possible controllers Π have been reduced by introducing a set of primitives Π that can be composed into sequences by Π , c.f. Figure 4. A more detailed description of the experimental setup will be presented in future work, including a graphical interface in which the human user is able to give feedback during and after a demonstration, in order to resolve ambiguities such as the one illustrated in Figure 7.

Summary
A formalism for robot behaviors and Learning from Demonstration (LFD) is presented. Building on terminology from LaValle [53, ch.11], an agent' s sensory-motor history is conveniently described by an event history, and a controller maps event histories to actions in action space. As illustrated in Figure 1, a demonstration of a particular behavior can be seen as an event history η ∈ , and the behavior itself as the large set B of allowed event histories, i.e., all possible ways to realize the desired behavior. The quality of the learned controller can be judged by the similarity between B and the realization space R. The vague and ill-posed meaning of repeating a demonstrated behavior is discussed from a machine learning perspective. The concept of generalization is defined in the framework of event histories and leads to a discussion of bias in learning. In LFD, bias is essential and can be introduced before, during, and after demonstration as feedback from the human teacher. The huge information history space may be reduced to a derived space, suitable for a limited set of tasks. Behavior primitives are another common way to introduce bias, and are often associated with specific goals, which are explicitly or implicitly defined for each primitive. LFD can at a higher level be described as controller selection. In this context, learning consists of finding and tuning a suitable primitive. More complex behaviors can be created by combining several primitives into sequences. LFD can then be described in three 10

A u t h o r c o p y
steps, behavior segmentation, behavior recognition and behavior coordination.
When using primitives created during a previous learning phase, learning can be seen as an evolutionary process where new knowledge is gained through the use of previous knowledge as bias. Formally, this is described as a gradual redefinition of I , Π and Π which relates to the concept of scaffolding. The formalism is applied to a sequence learning task in which the introduced concepts are illustrated with focus on how bias of various kinds can be used to enable learning from a single demonstration. The experiment shows how even a simple demonstration contains almost unavoidable ambiguities that have to be handled one way or another. In context of the presented formalism, these ambiguities appear clearly as a transition problem from behavior B ⊂ I to controller π ∈ Π, or as a controller selection problem in Π . This research problem is believed to be crucial for the development of learning robots and is addressed in our ongoing research. The presented work is an attempt to structure and formalize general principles and assumptions in LFD. Our aim is not to present the single best way to talk about behaviors, generalization, goals, and other LFD related concepts. Rather, we want to point out the importance of defining these concepts clearly. It is our hope that the presented work will provide useful insights to the mechanisms involved in LFD and thus contribute to further development of this powerful and promising area of robot learning.