Situated robot learning for multi-modal instruction and imitation of grasping

https://doi.org/10.1016/j.robot.2004.03.007Get rights and content

Abstract

A key prerequisite to make user instruction of work tasks by interactive demonstration effective and convenient is situated multi-modal interaction aiming at an enhancement of robot learning beyond simple low-level skill acquisition. We report the status of the Bielefeld GRAVIS-robot system that combines visual attention and gestural instruction with an intelligent interface for speech recognition and linguistic interpretation to allow multi-modal task-oriented instructions. With respect to this platform, we discuss the essential role of learning for robust functioning of the robot and sketch the concept of an integrated architecture for situated learning on the system level. It has the long-term goal to demonstrate speech-supported imitation learning of robot actions. We describe the current state of its realization to enable imitation of human hand postures for flexible grasping and give quantitative results for grasping a broad range of everyday objects.

Introduction

How can we endow robots with enough cognitive capabilities to enable them to serve as multi-functional personal assistants that can easily and intuitively be instructed by the human user? A key role in the realization of this goal plays the ability of situated learning. Only when we can instruct robots to execute desired work tasks by means of a combination of spoken dialog, gestures, and visual demonstration, robots will lose their predominant role as specialists for repeatable tasks and become effective to support humans in everyday life.

A basic element of situated learning is the capability to observe and successfully imitate actions and  as a prerequisite for that  to establish a common focus of attention with the human instructor. For multi-modal communication, additional perceptive capabilities in the fields of speech understanding, active vision, and in the interpretation of non-verbal cues like gestures or body posture are essential and have to be included and coordinated.

We report on progress in building an integrated robot system within the framework of the Special Collaborative Research Unit SFB 360 ‘Situated Artificial Communicators’. In the course of this long-term program, many modules implementing partial skills were at first realized and evaluated as stand alone applications [4], [7], [18], [20], [34], but their integration is an additional research task and a key issue towards the realization of intelligent machines [25], [29].

As the development of integrated learning architectures for real world tasks poses an enormous challenge [35], there can hardly be found any efforts to scale learning from the lower level of training single skills up to a multi-stage learning across the overall system. A primary reason is that most learning approaches rely on highly pre-structured information and search spaces. Prominent examples are supervised learning of target outputs, unsupervised learning of clusters, or learning of control tasks with a (usually small) number of predefined variables (pole balancing, trajectory learning). Here exist well understood approaches like gradient based learning, support vector machines, vector quantization, or Q-learning, which yield for certain tasks remarkable results, e.g. in speech–image integration [26], trajectory learning [19], [22], [44], in object recognition and determination of grasp postures [28], sensor fusion for grasp planning [1], or grasp optimization [30].

In real world learning a well defined pre-structuring of the data with respect to the given task is an essential part of the learning itself; the system has to find lower-dimensional relevant manifolds in very high-dimensional data and detect important regularities in the course of learning to use these to improve its capabilities. Furthermore, for a sophisticated robot with many motor degrees of freedom or for a cognitive system  as the one discussed here  finding a solution by exploration of new actions is not suitable because the search spaces involved are extremely high-dimensional and by far too complex.

Current practice aims at developing well-scalable, homogeneous and transparent architectures to create complex systems. Somewhat ironically, successful examples of this strategy tend to cluster in the small- or mid-size range, while truly large and complex systems seem to defy our wishes for ‘formatting away’ their complexity by good bookkeeping alone. It seems not unlikely that it is one of the hallmarks of complex systems that they confront us with limited homogeneity, evolutionarily grown layers of overlapping functionality and bugs that may even amalgamate with features. Looking at biological systems with their enormous complexity, we see that these by no means resemble orthogonal clockworks; instead, they consist of a tangle of interwoven loops stabilized by numerous mechanisms of error-tolerance and self-repair. This suggests that a major challenge for moving to higher complexity is to successfully adopt similar approaches to come to grips with systems that we cannot analyze in their full detail.

In the present paper, we address these issues in the context of a longer-term research project aiming at the realization of a robot system that is instructable by speech and gestures (Fig. 1). For the aforementioned reasons, we have pursued the development of this system in an evolutionary fashion, without the requirement that a global blueprint had to be available at each stage of its development. In Section 2, we report our experiences with this approach and give an overview of the current stage of the evolved system.

In Section 3, we focus our discussion on the issue of learning within such a system and argue for three major levels at which learning has to be organized: (i) an ontogenetic level which exploits learning methods in order to create initial system functions (such as object classifiers) from previously acquired training data in an off-line fashion, (ii) a refinement level at which on-line learning is used locally within a functional module, with the main effect of increasing the module’s robustness or refining its performance, but with no or little need of explicit coordination with adaptive processes in other modules, and (iii) a situated level at which different learning methods are combined in a highly structured way in order to achieve short-term situated learning at the task level. While all three learning levels are important, undoubtedly it is the uppermost, situated level which currently poses the most exciting research challenge.

In Section 4, we propose an approach how to organize learning at this level. Our proposal is strongly motivated by the idea of imitation learning [2], [3], [6], [8], [23], [24], [32], which attempts to find a successful ‘action template’ from the observation of a (human) instructor. This requires to (i) endow the robot system with sufficient perceptive capabilities to recognize and observe the action to imitate; (ii) transform the observed action into an internal representation, which is well matched to the system’s own operating characteristics (in particular, its different ‘sensory perspective’ and ‘instrumentation’ with actuators); (iii) be able to physically execute a suitable action by itself. Focusing on the important task of imitation grasping, we describe in Section 5 an initial implementation of this scheme, using our current system as a platform for the necessary and considerable, perceptual and motor anchoring of such an imitation learner in its environment. Section 6 then presents some results on imitation grasping of common everyday objects with the system implemented so far.

At all levels, the results of learning can  by its very nature  at best be partially predicted, further eroding the idea of the availability of a fixed system blueprint. In Section 7, we therefore argue for a datamining perspective for coping with systems of such kind. As a concrete example, we briefly describe a powerful multi-modal monitoring system (AVDisp) that has been developed in our lab very recently and we report some experiences from applying this approach to our robot system. Finally, Section 8 presents some conclusions.

Section snippets

System design and overview

Due to the long-term development of our system, the ideal perspective to define constraints and a unified framework beforehand to facilitate building a cognitive learning architecture had to be replaced by an ‘evolutionary approach’ to also integrate modules that were developed in different research contexts and not necessarily designed in view of being utilized in the described system. This led to the development of a rather flexible architecture, based on a distributed architecture

Structuring of learning

Learning is a very multi-faceted phenomenon and its complexity is amply reflected in the numerous different proposals on how to relate and implement its various aspects. Theoretical considerations motivate a ‘horizontal subdivision’ of learning into the major types of unsupervised, supervised, and reinforcement learning (with still a substantial number of approaches distinguishable both at the conceptual and algorithmic level within each type). Recently there has been a stimulating discussion

An architecture for situated learning

As pointed out in the previous section, to enable learning at the short time scale of the situated level will depend in an essential way on the highly structured interplay of several functional loops, complementing each other in a tightly coupled fashion to cope with the joint constraints of high-dimensional search spaces and small number of training samples. In the following, we argue that a suitable interlocking of the three functional loops of observation, internal simulation, and

Towards imitation grasping: observation, simulation, and control of hand posture

While we do not have a full implementation of the described architecture yet, we can report an initial implementation of some of its major features for the scenario of situated learning of grasping of common everyday objects, such as depicted in Fig. 8. In this scenario, the observation component is a vision module permitting observation and 3D-identification of a human hand posture indicating to the robot a sample of the to-be-used grasp type. The identified hand posture is transformed to the

Results for imitation based grasping

Before implementation of the imitation grasping subsystem described above, our system had to use pre-programmed associations between known objects and suitable grasps that had been ‘hand-tuned’ for a limited range of objects in rather labor-intensive experiments. From this work, we also knew that our artificial hand  despite its serious limitations  can grasp a large number of real world objects. However, due to the enormous range of possible shapes, generalizing pre-programmed grasps to new and

A datamining perspective on robot learning

Regarding learning as a central ingredient to facilitate the construction of complex systems shifts our view from a complex robot whose behavior unfolds according to well-chosen, explicitly designed control mechanisms to a view in which a robot much more resembles a kind of ‘datamining engine’, foraging flexibly for information and regularities in the sensory images of its environment. This suggests to adopt a similar perspective as in the field of datamining, and exploit algorithms from that

Conclusions and outlook

Our initial assumption is that situated and multi-modal interaction is a key prerequisite for learning in artificial intelligent perception–action systems. Thus, we will proceed with the development of the current platform and use it as a basis for a systematic refinement of the described learning architecture. The longer-term goal is to demonstrate speech enabled imitation learning for instructing grasping tasks, because multi-fingered grasping combines many of the highly developed

Acknowledgments

Among many people who contributed to the robot system, we thank in particular G. Fink, J. Fritsch, G. Heidemann, T. Hermann, J. Jockusch, N. Jungclaus, F. Lömker, P. McGuire, R. Rae, G. Sagerer, S. Wrede, S. Wachsmuth, J. Walter. For further contributions of the SFB 360 ‘Situated Artificial Communicators’ and the neuroinformatics and applied informatics groups at the Faculty of Technology of the Bielefeld University see the references.

J.J. Steil received the Diploma in Mathematics from the University of Bielefeld, Germany, in 1993. Since then he has been a Member of the Neuroinformatics Group at the University of Bielefeld working mainly on recurrent networks. In 1995/1996 he stayed for 1 year at the St. Petersburg Electrotechnical University, Russia, supported by a German Academic Exchange Foundation (DAAD) grant. In 1999, he received the PhD degree with a dissertation on ‘Input–Output Stability of Recurrent Neural

References (44)

  • K Doya

    What are the computations of cerebellum, the basal ganglia, and the cerebral cortex?

    Neural Networks

    (1999)
  • Y Rui et al.

    Image retrieval: current techniques, promising directions and open issues

    Journal of Visual Communication and Image Representation

    (1999)
  • S Schaal

    Is imitation learning the route to humanoid robots?

    Trends in Cognitive Sciences

    (1999)
  • P.K Allen et al.

    Integration of vision, tactile sensing for grasping

    International Journal of Intelligent Machines

    (1999)
  • P Andry et al.

    Learning and communication via imitation: an autonomous robot perspective

    IEEE SMC

    (2001)
  • P. Bakker, Y. Kuniyoshi, Robot see, robot do: an overview of robot imitation, in: Proceedings of the AISB Workshop on...
  • C. Bauckhage, G.A. Fink, J. Fritsch, F. Kummert, F. Lömker, G. Sagerer, S. Wachsmuth, An integrated system for...
  • A. Bicchi, V. Kumar, Robotic grasping and contact: a review, in: Proceedings of the Conference ICRA, 2000, pp....
  • A. Billard, M.J. Mataric, A biologically inspired robotic model for learning by imitation, in: Proceedings of the...
  • H. Brandt-Pook, G.A. Fink, S. Wachsmuth, G. Sagerer, Integrated recognition and interpretation of speech for a...
  • C. Breazeal, B. Scassellati, Challenges in building robots that imitate people, in: K. Dautenhahn, C. Nehaniv (Eds.),...
  • C. Breazeal, B. Scassellati, A context-dependent attention system for a social robot, in: Proceedings of the IJCAI,...
  • Y. Chen, J.Z. Wang, R. Krovetz, An unsupervised learning approach to content-based image retrieval, in: Proceedings of...
  • J.A. Driscoll, R. Alan Peters II, K.R. Cave, A visual attention network for a humanoid robot, in: Proceedings of the...
  • G.A. Fink, Developing HMM-based recognizers with ESMERALDA, in: V. Matoušek, P. Mautner, J. Ocelı́ková, P. Sojka...
  • G.A. Fink, N. Jungclaus, H. Ritter, G. Sagerer, A communication framework for heterogeneous distributed pattern...
  • L Han et al.

    Grasp analysis as linear matrix inequality problems

    IEEE Transactions of Robotics and Automation

    (2000)
  • T. Hermann, C. Niehus, H. Ritter, Interactive visualization and sonification for monitoring complex processes, in:...
  • Ch. Borst, M. Fischer, G. Hirzinger, Calculating hand configurations for precision and pinch grasps, in: Proceedings of...
  • G. Heidemann, D. Lücke, H. Ritter, A system for various visual classification tasks based on neural networks, in: A....
  • A.J. Ijspeert, J. Nakanishi, S. Schaal, Trajectory formation for imitation with nonlinear dynamical systems, in:...
  • N. Jungclaus, R. Rae, H. Ritter, An integrated system for advanced human–computer interaction, in: Proceedings of the...
  • Cited by (63)

    • Knowledge-based reasoning from human grasp demonstrations for robot grasp synthesis

      2014, Robotics and Autonomous Systems
      Citation Excerpt :

      When dealing with Learning-by-Demonstration strategies, we can find different works in the literature where the robot observes a human performing a task and afterwards it is able to perform the task by itself. Examples of kinaesthetics demonstrations for learning and generalization, and situated multimodal interaction to teach a robot are demonstrated by [4,5], respectively. Mirror neurons modelling is also a possibility when observing an action in grasping context as demonstrated by [6].

    • Representation and learning in motor action - Bridges between experimental research and cognitive robotics

      2013, New Ideas in Psychology
      Citation Excerpt :

      Other studies have explored the link between semantic and motor memory, especially in the case of grasping (Weigelt, Rosenbaum, Hülshorst, & Schack, 2009). Cognitive robotics has a strong interest in questions regarding the segmentation of arm movements, the control of robot actuators via neural networks (e.g. Self-Organizing Maps or SOM; Barreto, Araujo, & Ritter, 2003), and the combination of learning strategies for motion primitives in grasping movements (Steil, Röthling, Haschke, & Ritter, 2004). To strengthen links between experimental research and robotics, we have studied cognitive representations not in isolation but as parts of a cognitive action architecture (Maycock et al., 2010; Schack & Ritter, 2009).

    • Contrastively Learning Visual Attention as Affordance Cues from Demonstrations for Robotic Grasping

      2021, IEEE International Conference on Intelligent Robots and Systems
    View all citing articles on Scopus

    J.J. Steil received the Diploma in Mathematics from the University of Bielefeld, Germany, in 1993. Since then he has been a Member of the Neuroinformatics Group at the University of Bielefeld working mainly on recurrent networks. In 1995/1996 he stayed for 1 year at the St. Petersburg Electrotechnical University, Russia, supported by a German Academic Exchange Foundation (DAAD) grant. In 1999, he received the PhD degree with a dissertation on ‘Input–Output Stability of Recurrent Neural Networks’; since 2002 he has been appointed Tenured Senior Researcher and Teacher (Akad. Rat). J.J. Steil is Staff Member of the special research unit 360 ‘Situated Artificial Communicators’ and the Graduate Program ‘Task-oriented Communication’ heading projects on robot learning and intelligent systems. Main research interests of J.J. Steil are analysis, stability, and control of recurrent dynamics and recurrent learning, and cognitive oriented architectures of complex robots for multi-modal communication and grasping. He is Member of the IEEE Neural Networks Society.

    F. Röthling has completed a 3-year apprenticeship in the communication electronics industry working in the fields of remote communication at Deutsche Telekom. He received his Diploma in Computer Science from the University of Bielefeld, Germany, in June 2000. He is currently pursuing a PhD program in Computer Science at the University of Bielefeld, working within the Neuroinformatics Group and the Collaborative Research Center 360 ‘Situated Artificial Communicators’. His field of research is in multi-fingered hand-control architectures and the development of grasping algorithms using tactile and visual sensors.

    R. Haschke received the Diploma in Computer Science from the University of Bielefeld, Germany, in 1999. Since then he has been a Member of the Neuroinformatics Group at the University of Bielefeld working on recurrent neural networks and virtual robotics. He just finished a PhD program within the graduate program ‘Structure forming phenomena’. Currently he works on learning architectures in the special research unit 360 ‘Situated Artifical Communicators’. His main interests of research are analysis and control of recurrent dynamics, learning architectures for complex cognitive robots, and models, strategies and evaluation criteria for grasping.

    H. Ritter studied Physics and Mathematics at the Universities of Bayreuth, Heidelberg and Munich and received a PhD in Physics from the Technical University of Munich in 1988. Since 1985, he has been engaged in research in the field of neural networks. In 1989 he moved as a Guest Scientist to the Laboratory of Computer and Information Science at Helsinki University of Technology. Subsequently he was Assistant Research Professor at the then newly established Beckman Institute for Advanced Science and Technology and the Department of Physics at the University of Illinois at Urbana-Champaign. Since 1990, he is Professor at the Department of Information Science, University of Bielefeld. His main interests are principles of neural computation, in particular self-organizing and learning systems, and their application to machine vision, robot control, data analysis and interactive man–machine interfaces. In 1999, Helge Ritter was awarded the SEL Alcatel Research Prize and in 2001 the Leibniz Prize of the German Research Foundation DFG.

    View full text