1 Introduction

Robots are coming into everyday life, not only as factory robots but also as service robots helping people to increase the performance of their work and improve the quality of their life. While factory robots work in well-structured environments, service robots or personal robots will work in human environments that are less predictable and less structured. These robots will need the ability to adapt to changing situations and continuously learn new information about the surrounding environment. Moreover, these robots should be able to learn without constant human supervision, and learning should be autonomous and continuous, with the possibility of using discontinuous interactions with humans and performing autonomous actions in order to acquire information. Among many different skills, a robot working in a human environment should be able to perceive the space around it in order to identify meaningful elements such as parts of its own body, objects, and humans. In this paper, we focus on the issue of learning the appearances and recognizing the elements that appear in the working space of a humanoid robot, and we call these elements physical entities. Learning is performed based on both passive observation when a human manipulates objects in front of the robot and interactive actions of the robot (Fig. 1).

Fig. 1
figure 1

The main modules of the proposed approach: learning through passive observation when a human manipulates an object in front of the robot and learning through interactive actions of the robot

Various computer vision approaches achieve good performances for detecting specific physical entities of particular classes, like human faces (Viola and Jones 2004), skin parts (Yang and Ahuja 1999), coloured (Gevers and Smeulders 1999) or textured (Belongie et al. 1998) objects. Most of these approaches are based on prior knowledge, either assuming very specific objects (such as human hands (Wersing et al. 2007) or robot hands (Nagi et al. 2011) of particular color, or by using artificial markers (Fiala 2005)) or requiring carefully created image databases, where images of each object are labeled in order to perform supervised learning. For example, the organizers of the Pascal VOC challenge (Everingham et al. 2014) put a lot of effort in creating and improving image databases that were very beneficial to algorithms performance over years. Other approaches include a specific object learning phase, for example using a turntable to rotate an object and learn its appearance from different viewing angles. Prior knowledge and supervision facilitate object detection, but they are not easily applicable for autonomous robots that need to adapt to different human users and new objects at any time. Indeed, in such setup, specific or supervised approaches limit the adaptability of the robot, since it is difficult to extend these approaches for online continuous detection and learning of new objects without specific human supervision. Therefore, we propose that object recognition in this context should be based on general high-level representations and learning methods that could be applied to all physical entities of the environment and could support learning by observation and by interaction.

The human development is a very motivating example of efficient learning about the environment without explicit supervision. Indeed, object representation is considered as one of the few core knowledge that form the basis of human cognition (Spelke and Kinzler 2007). It is interesting to note that these capabilities are acquired progressively through a long period during infancy that plays an important role in human life. At first, a baby learns mostly through observation, because of its limited manipulation capabilities, in an environment where the parents are present most of the time. Thus the social environment is the cause of a large part of the sensory stimulus, even if the social engagement of the baby remains limited. Progressively, the baby learns about his own body, and its control, which then makes it possible to manipulate objects (Piaget 1999). It has been shown in many studies (e.g. Harman et al. 1999) that such capability improves knowledge of the surrounding world and in particular the objects. The social interactions then take a growing importance as learning focus on more complex activities. Infant development has inspired a variety of research studies on autonomous robots learning. The characteristics of infants learning process, such as being continuous, incremental, and multi-modal are reflected in different approaches in developmental robotics (Weng et al. 2001). In contrast to traditional robotics, a developmental approach does not focus on a fast achievement of predefined goals, but rather on an open-ended learning process, where the performance improves over time, the learning process being flexible and allowing to adapt to changing circumstances.

In this paper, we propose a developmental approach taking inspiration from the human development related to object appearance learning and recognition (Spelke 1990). We describe a perceptual system that makes it possible for a robot to learn about physical entities in its environment in a two stage developmental scenario (Fig. 1):

  1. 1.

    learning by observation: the robot learns appearance models of moving elements, where the motion is mostly produced by a human partner who demonstrates different objects,

  2. 2.

    interactive learning: the robot interacts with objects in order to improve its knowledge about objects appearances after having identified the parts of its own body, parts of a human partner, and manipulable objects.

Our main contribution consists in the integration of a generic perception capability, self- and others- identification, and interactive actions for active exploration of the surrounding environment and its objects. Our algorithm requires very limited prior knowledge and does not require predefined objects, image databases for learning or dedicated detectors, such as markers or human face/skin/skeleton detectors. Instead, using a color and depth camera (hereafter called RGB-D sensor), the visual space is autonomously segmented into physical entities whose appearances are continuously and incrementally learned over time and synthesized into multi-view representation models. All entities are then categorized into parts of the body of the robot, human parts, and manipulable objects, which make it possible to correctly update objects models during their manipulation. Note that even if the social interactions may take a large part in the learning of objects, it is not the subject of the current paper, but we refer to our previously published work on socially-guided learning, where the robot learns objects with a human partner providing additional feedback used to guide learning (Ivaldi et al. 2013; Nguyen et al. 2013).

The paper is organized as follows: Section 2 gives a brief overview of related work on unsupervised learning and interactive learning including self- and others-identification; the proposed perceptual approach of learning through observation is detailed in Sect. 3 and the interactive learning approach is described in Sect. 4; the experimental evaluation is reported in Sect. 5; and Sect. 6 is devoted to discussion of the results.

2 Related work

We are working on unsupervised object learning and interactive perception as a generic approach towards autonomous learning integrating perception and control. Object learning has been addressed in a huge number of computer vision approaches whose exhaustive review is outside the scope of this paper (see Grauman and Leibe (2011) or Everingham et al. (2014), for example). We will therefore restrict ourselves to the approaches closely related to our algorithmic choices. Interactive perception has been used for detecting and segmenting objects in a scene, for learning objects properties and appearances, or exploring affordances. Moreover, some studies on interactive perception integrate identification of parts of the robot (especially hands) and use their localization to improve object segmentation or learning algorithms. We will not cover the more general area of learning by demonstration as our approach depends only on the entities motions produced by humans manipulating objects and used to learn appearances models, but does not rely on a detailed analysis of the human demonstrations and does not try to imitate the human behaviour.

2.1 Unsupervised object learning

In our approach, as suggested by studies on the development of object perception capabilities in humans (Spelke 1990), the perception of the environment begins by detection of meaningful elements in the visual field of the robot. These elements are detected from generic principles such as cohesion and continuity, while most traditional object detection approaches are based on prior knowledge or dedicated algorithms providing robust detection of specific objects of particular categories. More generic approaches segment a scene into coherent image regions and further segment objects from the background based on consistency of visual characteristics (Southey and Little 2006) or motion behaviour (Prest et al. 2012). Similar principles have also been used to detect and model objects using laser range finders (Modayil and Kuipers 2008). Other unsupervised approaches in vision are aimed at detecting not a concrete object, but an evidence of an object existence or a proto-object (Pylyshyn 2001; Rensink 2000). Taking inspiration from human vision, a proto-object is defined as a unit of attention or a localized visual area with certain properties, representing a possible object or its part. Proto-object detection is often based on biologically motivated mechanisms of selective attention, for example visual saliency (Orabona et al. 2007; Walther and Koch 2006).

Once an object or a proto-object is detected, its visual appearance is analyzed and often encoded within more compact descriptors characterizing local features or general visual content, like color or texture (Burger and Burge 2008). While balancing between robustness, speed, and the ability to preserve information, a good descriptor should allow to discriminate different objects and accommodate intra-object variations. Based on extracted features, an efficient object representation should characterize a significant part of the visual content in a short description. In order to improve recognition, object representation can combine several types of visual features. In this case, the efficiency of object recognition will be higher with complementary descriptors characterizing different types of visual data while avoiding redundancy (Dickscheid et al. 2011).

A widely-used object representation methods is the Bag of Words (BoW). It represents objects or images as collections of unordered features quantized into dictionaries of visual words, and each object is encoded by its visual words. In this case, the learning procedure consists in training a classifier on extracted visual words, and the recognition procedure consists in applying the classifier on extracted visual words (Sivic and Zisserman 2003). Among existing studies, there are many variations of BoW based on a pixel-level description (Aldavert et al. 2010), image patches (Shotton et al. 2008), or local features for example, keypoints (Sivic and Zisserman 2003; Filliat 2007), edges (Fergus et al. 2005), and regions (Russell 2006). Instead of using a simple list of visual words, the importance of each visual word can be taken into account by using term frequency-inverse document frequency (TF-IDF) approach (Sivic and Zisserman 2003). In this case, an object is encoded by occurrence frequencies of its visual words, and TF-IDF approach is used to evaluate the importance of words with respect to objects and give higher weights to distinctive visual words. An inverted index allows to quickly compare each set of extracted visual words with all memorized objects.

The main weakness of BoW approaches is the absence of spatial relations between visual words inside images. This limitation is resolved in variations of BoW, like part-based models such as the Constellation model (Fergus et al. 2003), or the k-fans model (Crandall et al. 2005). Part-based models combine appearance-based and geometrical models, where each part represents local visual properties, and the spatial configuration between parts is characterized by a statistical model or springconnections representing “deformable” relation between parts. These models are based on learning the geometrical relations between image parts or features, like local features (Fergus et al. 2003) or edges (Fergus et al. 2005).

2.2 Interactive learning

In the context of learning about the surrounding environment, some knowledge can be acquired through simple observation, without performing any action, through the image processing techniques reviewed in the previous section. However, it is not easy to bind all gathered information into coherent objects representations and learn the overall appearances of the objects. Actions of the robot provide an ability to detect manipulable objects in a scene, segment them from the background, better learn their overall appearances and properties, thus allowing to find out an appropriate way of interaction with these objects. Interactive actions are useful for both object learning and also object recognition in ambiguous situations, when dealing with several similar objects and when more evidences are needed for object identification (Browatzki et al. 2012).

Several approaches have been proposed to take advantage of interactive actions and various perceptual channels. For example, in Torres-Jara et al. (2005), an unknown object is manipulated and tapped with the robot finger in order to produce a sound that is used to recognize this object. The authors of Sinapov et al. (2011) propose a more complex approach that integrates auditory and proprioceptive feedback when performing five different actions on a set of 50 objects, showing very high recognition rates. In Chu et al. (2013), an advanced tactile sensor is used with five different exploratory procedures in order to associate haptic adjectives (i.e. categories like hard, soft,...) to objects. And in Griffith et al. (2011), the evolution of the visual motion of objects during robot actions are analysed to classify objects into two categories as a container/non container. All these approaches take advantage of the behaviour of the object during or after manipulations, and therefore they are not applicable in the scenario based on observation that we use as a first stage in this paper. They however could be used as an interesting complement with our system for integration of multi-modal information whenever the visual information is not sufficient for recognition.

We therefore focus on studies of interactive approaches aimed at learning visual objects appearances. In Natale et al. (2005), an object model is learned when the robot approaches the object closer to the visual sensor and captures images at four positions and orientations of the object. In Ude et al. (2008), an object representation is generated from snapshots captured from several viewpoints, while the object is intentionally placed by the robot to the center of its visual field, rotated, and segmented from the background using the pre-learned background model. In Browatzki et al. (2012), an object representation is learned as a collection of its views captured at orientations that are selected to maximize new information about the object. The object segmentation consists of cropping a central part of a captured image and subtracting the pre-learned background. As a common limitation of these approaches, a robot does not detect or grasp an object by itself, but the object is provided by a human partner placing the object directly in the hand of the robot. This scenario simplifies the system, since it does not require object detection, localization, or a grasp planning.

Perception and action can be also integrated into autonomous object exploration performed without human assistance. In Schiebener et al. (2013), a pushing behaviour is used to move objects lying on a table in order to improve visual object segmentation and observe different views. The resulting images are used to train a classifier using a Bag of Words representation. In Gupta and Sukhatme (2012), two simple actions primitives are used to spread piled lego blocks in order to be able to sort them. A more advanced scheme is proposed in van Hoof et al. (2014) to decide which pushing to perform in order to segment cluttered scenes on a table using a complex probabilistic model. However, these two last approaches do not integrate object learning and recognition. In Kraft et al. (2008), a sophisticated vision system provides a set of 2D and 3D features that makes it possible to generate object grasping hypothesis. The successful grasping then allows to achieve precise object motion used to integrate features from several views in order to produce coherent 3D models. In Krainin et al. (2011), object manipulation is used to generate autonomously complete 3D models of objects using a RGB-D camera. An initial grasp is performed through heuristics, before moving the object following an algorithm optimizing the information gained by the new view. This approach relies mainly on 3D model matching using the dense data provided by the camera. In contrast, our interactive learning approach is not designed to improve object segmentation (and thus is limited in its capacity to segment clustered scenes), but to improve the object appearance models in order to provide additional representative views. Moreover, we do not seek to produce precise 3D models of objects, but rather use multiple view appearance models for their adaptability in presence of changing observation condition and capacity to represent deformable objects (which is however not tested in the current paper).

Most interactive object exploration approaches make use of knowledge about the body of the robot. This knowledge can concern the body structure for control and correspond to the concept of body schema, or the appearance of the body and correspond to the body image (Hoffmann et al. 2010). In Natale et al. (2005), the hand tracking is used for fixation on the object during manipulation, whereas in Metta and Fitzpatrick (2003), the hand localization is used to improve object segmentation. In Krainin et al. (2011), the precise 3D model of the robot hand is used to precisely localize objects and remove the robots parts from the objects models. Therefore, in interactive scenarios, the self-identification and localization of the parts of the robot in the visual field allow more efficient processing of visual information during and after interaction with objects. In our approach, we assume a very limited prior knowledge of the body of the robot and we show that, as far as perceptual learning is concerned, the raw motor values are sufficient to learn and continuously adapt a body image that is sufficient to learn about objects during manipulation.

2.3 Self- and others-discrimination

As explained before, knowledge about the body image of the robot provides advantages for interactive exploration of the environment. As an inspiration, child development and especially sensorimotor developmental stages demonstrate the importance of own body exploration. An infant starts to learn about the world from developing a sense of his own body, and later on performs interactive actions directed to exploration of the environment (Piaget 1999).

2.3.1 Robot self-discovery

Among the variety of studies on self-discovery for robots learning its body image, most of them are based on prior knowledge or resort to local approaches. Some strategies exploit a predefined motion pattern of the robot, a predefined appearance of the body, or a known body schema, such as the joint-link structure. For example, in Hulse et al. (2009), the hand of the robot is detected based on a grasped object of a known appearance, and the hand tracking is based on tracking the object. In Nagi et al. (2011), the identification of the hand of the robot is based on wearing a glove of known color. These techniques simplify the robot self-identification but impose some limitations. Since these algorithms are dependent on a fixed appearance or behaviour, they cannot be easily adapted to changes in the appearance or motion pattern of the robot. The independence on prior knowledge would enable to overcome these limitations and generalize the self-identification over new appearances and new end-effectors, like grasped tools.

In early studies, the detection of the hand of the robot was based on its motion (Marjanovic et al. 1996). The important limitation of this approach is an assumption of a single source of motion. However, in real environments, visual motion can be produced not only by the robot itself but also by other agents that can be robots or humans.

Considering visual motion as a response of an action, the visual motion that follows almost immediately after an action of the robot can be used as a cue to localize the parts of the robot in the visual field. Based on this principle, self-identification based on the time-correlation between an executed action and visual motion is performed in Metta and Fitzpatrick (2003), Michel et al. (2004), and Gold and Scassellati (2006). In Michel et al. (2004), localization of the hand is based on a pre-learned time delay between the initiation of an action and the emergence of the hand in the visual field. Assuming a single source of motion at a time, the hand is identified as a moving region appearing first within the pre-learned time window after the initiation of the action. In Metta and Fitzpatrick (2003), localization of the hand is based on the amount of correlation between the velocity of the movement and the optical flow in the visual field. This method allows to identify the hand among multiple sources of motion without requiring a priori information about the hand appearance.

A developmental approach of identification of the body of the robot based on visuomotor correlation is proposed in (Saegusa et al. 2012). Visuomotor correlation is estimated from proprioceptive and visual data acquired during head-arm movements. In the learning stage, the robot performs motor babbling and gathers the visual and proprioceptive feedback in terms of visual motion and changes of motors states. In case of high correlation, the moving region is identified as a part of the robot, and the visuomotor information, such as the body posture and visual features, is stored in the visuomotor memory. This self-identification method is also adaptable to extended body parts.

2.3.2 Identification of self and others

A generic method aimed at understanding a dynamic environment based on contingency is proposed in (Gold and Scassellati 2006). The method allows to discriminate actions performed by the robot from actions performed by other physical actors considering the time delays between the actions and the responses and their respective durations. Autonomous identification of the hand of the robot during natural interaction with a human is proposed in Kemp and Edsinger (2006). The approach is based on mutual information estimated between the visual data and proprioceptive sensing. The value of mutual information is used to identify which visual features in a scene are influenced by actions of the robot. Since the system is aimed at detecting parts of humans and robots, it is mainly focused on visual regions that are close to the visual sensor and regions moving with a high speed.

We are interested in the identification of the parts of the robot during natural human-robot interaction and also in the identification of the parts of human partners and possible objects. We therefore propose a generic identification algorithm that is independent on the appearance and motion pattern of the robot and capable of identifying these three categories. This algorithm will be integrated with interactive object exploration in order to enhance the learning process.

3 Learning by observation

In this section, we describe the first stage of the proposed developmental approach that allows a robot to detect physical entities in its close environment and learn their appearances during demonstration by a human partner. Our approach is based on online incremental learning, and it does not require image databases or specialized face/skin/skeleton detectors. All knowledge is iteratively acquired by analyzing the visual data. Starting from extraction of low-level image features, the gathered information is synthesized into higher-level representation models of physical entities. Given the localization of the visual sensor, the visual field of the robot covers the interaction area including parts of the body of the robot, parts of a human partner, and manipulated objects.

In this work, we have chosen to use a Kinect RGB-D sensor (Kinect, Zhang (2012)) instead of using stereo-vision based on the cameras in the robot eyes. Our choice is justified by the efficiency and precision of the RGB-D data, since the Kinect sensor allows fast acquisition of reasonably accurate depth data as will be discussed in Sect. 6. Both RGB and depth data are acquired with the OpenNI library.Footnote 1 Depth data are only used during proto-object detection procedure to refine boundaries of possible objects. The overall algorithm of object learning can therefore work without the optional step requiring depth data, but rather based on RGB data only that could be performed with the embedded visual sensor.

3.1 Segmentation of the visual space into proto-objects

Learning about the close environment of the robot begins by segmenting the visual space into proto-objects as salient units of attention that correspond to possible isolated or connected physical entities. The main processing steps towards detection and segmentation of proto-objects are shown in Fig. 2.

Fig. 2
figure 2

Detection and segmentation of proto-objects

Our proto-object detection approach relies on motion-based visual attention, since motion carries a significant part of information about events happening in the environment and their actors (Goldstein 2010). In our scenario, moving regions in the visual field correspond mainly to parts of the body of the robot, parts of a human partner, and manipulated objects, which are the entities we seek to learn. Moreover, in the case of motion-based visual attention, a human partner can attract the attention of the robot by simply interacting with an object in order to produce visual motion.

Our motion detection algorithm is based on the Running averageFootnote 2 and image differencing. After detecting moving pixels, we fill holes and remove noisy pixels by applying the erosion and dilation operators from mathematical morphology (Shih 2009). Further, based on the constraints of the working area of the robot, we ignore the visual areas that are unreachable for the robot.

The detected moving regions of the visual field are analyzed as probable locations of proto-objects. Inside each moving region, we extract Good Features to Track (GFT) Shi and Tomasi (1994) developed especially for a tracking purpose. The extracted GFT points are tracked between consecutive images using the Lucas-Kanade method Lucas and Kanade (1981) chosen due to its small processing cost, accuracy, and robustness. We analyse the motion behaviour of tracked points in order to detect areas of uniform motion, which allow to isolate proto-objects inside moving image regions. Tracked points are grouped into clusters based on their relative position and velocity and following the agglomerative clustering algorithm. Initially, each tracked point composes its own cluster; then, at each iteration, we merge two clusters with the smallest distance given in the equation:

$$\begin{aligned} d(c_i,c_j) = \alpha *\Delta V(c_i,c_j) + (1-\alpha )*\Delta L(c_i,c_j); \end{aligned}$$
(1)

where \(d(c_i,c_j)\) is the distance measure between two clusters \(c_i\) and \(c_j\), \(\Delta L(c_i,c_j)\) is the Euclidean distance between the clusters’ mean positions, \(\Delta V(c_i,c_j)\) is the difference in the clusters’ mean velocities, and \(\alpha \) is a coefficient giving more importance to one of the characteristics. We set this coefficient to \(\alpha =0.8\) (giving more importance to velocity) by optimizing the proto-objects detection rate (see Sect. 5.2) on a set of objects demonstrations.

We continue to merge GFT points into clusters until a specified threshold on the minimal distance is reached. This threshold is set to 0.0087, also by optimizing the proto-objects detection rate. Each resulting cluster of coherent GFT points is a proto-object and it is the basic element of our following processing. Each detected proto-object is tracked over images considering as tracked from the previous image in the case of tracking more than a half of its GFT points.

Each proto-object can be segmented from the background based on a convex hull of its GFT points. However, this convex hull does not always correspond to the real object boundary. If a convex hull is based on few GFT points, it often cuts the proto-object border or captures the background and surrounding items. In order to improve the proto-object segmentation, the results of tracking performed on RGB images are consolidated with processing of the depth data, and the depth variation in the visual field is used to obtain more precise boundaries.

When processing the depth data, at first, the Median blur filter (Huang et al. 1979) is applied to smooth depth values and reduce the noise in the data. Then, the Sobel operator (Duda et al. 2000) based on the first derivative is used to detect horizontal and vertical edges allowing to reveal the depth variation in the visual field. Noisy and non-significant edges are filtered out by thresholding the obtained results, then the dilation and erosion operations (Shih 2009) are used to close broken contours. The obtained continuous contours are transformed into binary masks. An additional interest of this step is its advantage in segmenting several static physical entities localized close to each other; so if a convex hull of GFT points groups together several static entities, the processing of the depth data allows to isolate the corresponding proto-objects inside a single convex hull.

3.2 Entity appearance representation

The appearance of each of the proto-objects regions obtained in the previous section should then be characterized in order to be learned or recognized later on in our system. For this objective, we use complementary low-level visual features that are further organized into hierarchical representations, as shown in Fig. 3. The appearance of a proto-object corresponds to a view, i.e., the appearance of an entity observed from one perspective. The view representation is based on the incremental Bag of visual Words (BoW) approach (Filliat 2007) extended by an additional feature layer incorporating local visual geometry. An entity then gathers the different appearances of the physical entity in a multi-view model encoded as a set of views. Note that a view can appear in several entities when two different objects share a common appearance from a particular point of view.

Fig. 3
figure 3

Construction of an entity representation model

The robot should be able to deal with various entities, ranging from simple homogeneous objects with few features, to complex textured objects. We choose a combination of complementary visual features that could represent all these objects. As a local descriptor, we use SURF (Bay et al. 2008) due to its efficient and accurate characterization of local image areas, thus providing a good description of objects with many details. In order to deal with both textured and homogeneous coloured objects, we develop an additional descriptor operating on the level of regularly segmented image regions. The superpixels algorithm (Micusik and Kosecka 2009) is used to segment images into relatively homogeneous regions by grouping similar adjacent pixels. For segmentation, we use the watershed algorithm (Beucher and Meyer 1993) on the image convolved with Laplacian of Gaussian, initialised with regularly spaced seeds. Each resulting superpixel is characterized by its average color encoded in the HSV space (hue, saturation and value). Note that this segmentation if used to represent a proto-object as a set of colored regions and does not modify the proto-object segmentation obtained in Sect. 3.1.

The extracted low-level feature descriptors are incrementally quantized into dictionaries of visual words (Filliat 2007). Starting with a dictionary containing the first feature, each new feature is assigned to its nearest dictionary entry (a visual word) based on the Euclidean distance between their descriptors. If the distance between the current descriptor and each dictionary entry exceeds a threshold, a new visual word is added to the dictionary (see Algorithm 1). The quantization procedure provides two dictionaries, one for SURF descriptors and one for superpixel colors. The thresholds for the dictionaries were empirically chosen by optimizing the object recognition rate (see Sect. 5.3) on a small set of representative objects (both textured and textureless).

figure d

The size of the color dictionary remains relatively stable after processing several objects, since colors repeat among different objects quite often. However, the SURF dictionary grows continuously with the number of objects. In order to avoid the rapid growth of the SURF dictionary, we filtered the SURF features before including them in the dictionary. Only features that are seen over several consecutive frames (we use three consecutive frames) are stored in the dictionary which is used in the following processing as a ground level for view representation.

The low-level features are grouped into more complex mid-level features defined as pairs of low-level features. This feature layer incorporates local visual geometry and allows not only to characterize views by a set of features, i.e. isolated colors or SURF points, but also synthesize information into a more robust description considering relative feature position. For both types of features, each low-level feature is used to construct mid-features with 4 neighboring low-level featuresFootnote 3 that are the closest in terms of the Euclidean distance in the image space. Thus, each mid-feature \(m_{k}\) is a pair of visual words, implicitly encoding the corresponding visual features that have been perceived close in the image space:

$$\begin{aligned} m_{k} = (w_{a}, w_{b}), \end{aligned}$$
(2)

where \(m_{k}\) is a mid-feature, \(w_a\) and \(w_b\) are two visual words corresponding to neighbouring visual features.

Mid-features are incrementally quantized into dictionaries following the same concept used for quantization of low-level features. The dissimilarity measure between two mid-features is estimated as the minimum of pairwise Euclidean distances between their descriptors (Eq. 3). The quantization procedure provides dictionaries of SURF-pairs and superpixel-color-pairs.

$$\begin{aligned} \Delta (m_1, m_2) = min {\left\{ \begin{array}{ll} \Delta F(a_1, a_2) + \Delta F(b_1, b_2),\\ \Delta F(a_1, b_2) + \Delta F(b_1, a_2), \end{array}\right. } \end{aligned}$$
(3)

where \(m_1\) and \(m_2\) are two compared mid-features, and each mid-feature is a pair of features a and b; \(\Delta F\) is the dissimilarity between two features (one feature from the first pair and another feature from the second pair).

According to our representation model, all constructed mid-features are used to characterize proto-objects appearances, i.e., views, and each view is encoded by the occurrence frequencies of its mid-features:

$$\begin{aligned} v_{j} = \{m_{k}\}, \end{aligned}$$
(4)

where \(v_{j}\) is a view and \(m_{k}\) is a mid-feature.

In images captured by a visual sensor, a 3D object is perceived as its 2D projection depending on its position and viewing angle. These projections can differ significantly depending on the object appearance and shape and can also depend on the illumination when reflected light produces shadows and saturations making invisible some parts of the object (Goldstein 2010). In our approach, the overall appearance of each physical entity is characterized by a multi-view representation model (see Fig. 4) that covers possible changes in the appearance of an entity emerging from different viewing angles and varying illumination. Each entity is encoded as a collection of views, where each view characterizes the appearance of one perspective of the entity:

$$\begin{aligned} E_i = \{v_{j}\}, \end{aligned}$$
(5)

where \(E_i\) is an entity and \(v_{j}\) is its observed view. Note that one view may be a part of several entities.

Fig. 4
figure 4

Examples of representation models of four different entities (each model with its views is shown in one line)

3.3 View learning and recognition

Each proto-object detected in the visual space is either recognized as a known view or learned as a new view. The view recognition procedure consists of a likelihood estimation using a voting method based on TF-IDF (term frequency-inverse document frequency) (Sivic and Zisserman 2003) approach followed by a Bayesian filter estimating a posteriori probability of being one of the known views.

The voting method (see Fig. 5) is used to estimate the likelihood of a set of mid-features (extracted from the proto-object region) being one of the known views. Each mid-feature quantized into a visual word votes for a view where it has been seen before with its TF-IDF score. The TF-IDF score is aimed to evaluate the importance of visual words with respect to views and give higher weights to distinctive visual words. The voting method is fast, since it uses an inverted index allowing to consider only the views that have at least one common mid-feature with the analyzed proto-object. The advantage of this approach with respect to supervised algorithms, like support vector machines or boosting, is the ability to learn new views incrementally by updating mid-feature occurrence statistics, without knowing the number of views in advance and without re-processing all the data while adding a new view.

Fig. 5
figure 5

The voting method: each extracted mid-feature votes for views, where it has been seen before

More formally, the likelihood of a mid-feature set \(\{m_k\}\) being the view \(v_j\) is computed as a sum of products of mid-features frequencies and the inverse view frequency:

$$\begin{aligned} L(v_{j}) = \sum _{m_{k}} tf(m_{k})idf(m_{k}), \end{aligned}$$
(6)

where \(tf(m_{k})\) is the occurrence frequency of the mid-feature \(m_{k}\), and \(idf(m_{k})\) is the inverse view frequency for the mid-feature \(m_{k}\).

The occurrence frequency of the mid-feature is computed as:

$$\begin{aligned} tf(m_{k}) = \frac{n_{m_{k}v_{j}}}{n_{v_{j}}}, \end{aligned}$$
(7)

where \(n_{m_{k}v_{j}}\) is the number of occurrences of the mid-feature \(m_{k}\) in the view \(v_{j}\), and \(n_{v_{j}}\) is the total number of mid-features in the view \(v_{j}\).

The inverse view frequency \(idf(m_{k})\) is related to the occurrence frequency of a mid-feature among all seen views; it is used to decrease the weight of mid-features, which are often present in different views, and it is computed as:

$$\begin{aligned} idf(m_{k}) = log\frac{N_v}{n_{m_{k}}}, \end{aligned}$$
(8)

where \(n_{m_{k}}\) is the number of views with the mid-feature \(m_{k}\), and \(N_v\) is the total number of seen views.

The estimated likelihood is used for appearance-based recognition of views. However, views of different objects can be similar, and one object observed from a certain perspective can resemble another object. The recognition becomes even more difficult if an object is occluded that often happens during manipulations. In our approach, the temporal consistency of recognition is improved by applying a Bayesian filter in order to reduce the potential confusion between entities recognized on a short time scale. Based on tracking, we predict the probability of recognizing the view from the a priori probability computed in the previous image and the probability of being tracked from the previous image. The final a posteriori probability of recognizing a view is estimated recursively using its likelihood and its prediction:

$$\begin{aligned} p_{t}(v_{j}) = \eta L(v_{j})\displaystyle \sum \limits _{l}p(v_{j}|v_{l})p_{t-1}(v_{l}), \end{aligned}$$
(9)

where \(L(v_{j})\) is the likelihood of recognizing the view \(v_{j}\), \(p(v_{j}|v_{l})\) is the probability that the current view is \(v_j\) if the view \(v_l\) was recognized in the previous image (we set \(p(v_{j}|v_{l})\) equal to 0.8 if \(v_{j}=v_{l}\) and \(0.2/(N_v-1)\) otherwise, with the total number of views being \(N_v\)), \(p_{t-1}(v_{l})\) is the a priori probability of the view \(v_{l}\) computed in the previous image, and \(\eta \) is the normalization term.

Depending on the highest a posteriori probability obtained among all known views, the proto-object can be

  • stored as a new view with the set of current mid-features, if the highest probability is lower than the threshold \(th_{v.n.}\),

  • recognized as the view with the highest probability and updated with the current set of mid-features, if the probability is higher than the threshold \(th_{v.u.}\),

  • recognized as the view with the highest probability but not updated, otherwise.

The thresholds \(th_{v.n.}\) and \(th_{v.u.}\) allow to perform only stable updates in case of high confidence of recognition and create new views only in case of low probability of recognition, thus allowing to avoid duplicating views in memory. The update of the recognized view consists simply in updating the number of occurrences \(n_{m_{k}v_{j}}\) and \(n_{v_{j}}\) of each mid-feature in the view and the number of views containing the mid-feature \(n_{m_{k}}\) used for computing the \(tf-idf\) score (Eqs. 7 and 8).

3.4 Entity learning and recognition

The multi-view appearance model of the corresponding entity should finally be updated with the current view. Each identified view is therefore associated with an entity either using tracking, or appearance-based recognition. In the case of successful tracking from the previous image, the current view is simply associated with the entity recognized in the previous image (see Fig. 6). When the entity is not tracked from the previous image, because the entity just appeared or because of tracking failure due to motion blur for example, the entity is recognized using a maximum likelihood approach based on a voting method similar to the one used for recognizing views.

Fig. 6
figure 6

The main steps of the entity learning/recognition; where \(v_{j}\) is the current view, \(E_i\) is the entity corresponding to \(v_j\) with the maximal likelihood \(L(E_i)\), \(E_l\) is the entity tracked from the previous image

The likelihood of the view \(v_j\) being a part of one of already known entities is computed as:

$$\begin{aligned} L(E_{i}) = tf(v_{j})idf(v_{j}), \end{aligned}$$
(10)

where \(tf(v_{j})\) is the occurrence frequency of the view \(v_{j}\), and \(idf(v_{j})\) is the inverse entity frequency for the view \(v_{j}\).

The occurrence frequency of the view is computed as \(tf(v_{j}) = \frac{n_{v_{j}E_{i}}}{n_{E_i}}\), where \(n_{v_{j}E_{i}}\) is the number of occurrences of the view \(v_{j}\) in the entity model \(E_{i}\), and \(n_{E_i}\) is the number of views in the entity model \(E_{i}\).

The inverse entity frequency is related to the view occurrence among all entities; it is used to decrease the weight of views, which are often present in models of different entities: \(idf(v_{j}) = log\frac{N_E}{n_{v_{j}}}\), where \(n_{v_{j}}\) is the number of entities with the view \(v_{j}\), and \(N_E\) is the total number of seen entities.

The entity recognition decision is based on several thresholds (similar to the recognition of views). The entity can be

  • stored as a new entity with the current view, if the maximal likelihood is lower than the threshold \(th_{e.n.}\),

  • recognized as the entity with the maximal likelihood and updated with the current view, if the likelihood is higher than the threshold \(th_{e.u.}\);

  • recognized as the entity with the maximal likelihood but not updated, otherwise.

By identifying physical entities and tracking them over time, their multi-view representation models (see Fig. 4) are constructed and updated with the observed views.

3.5 Connected entities recognition

In our scenario, objects are explored through manipulation. As we have observed during our experiments, object manipulation introduces additional difficulties in processing of the visual data: both the hand and the grasped object are detected inside a single proto-object and moreover, a hand holding the object produces multiple occlusions and sometimes divides the grasped object into parts. Therefore, we process each proto-object in a way allowing to recognize it as several connected entities. This problem requires object segregation, as it is called in psychology. The object segregation capability is an important aspect of our approach which is capable of segmenting connected entities based on already acquired knowledge about entities seen alone.

In our approach, each proto-object is recognized either as a single entity or two connected entities based on the following double-check procedure (see Fig. 7):

  1. 1.

    all mid-features of the proto-object are used for recognition of the most probable view among all known views, as described in Sect. 3.3,

  2. 2.

    the mid-features that do not appear in the most probable view are used for recognition of a possible connected view using the same procedure. The connected view is recognized if its probability is higher than \(th_{v.c.}\). If this recognition probability is low and more than 20 % of mid-features do not correspond to the first recognized view, then a new view is stored with these mid-features.

Fig. 7
figure 7

Connected entities recognition: a all extracted mid-features (HSV pairs); b the mid-features of the first recognized view, c the mid-features of the first recognized view (shown by pink color) and the mid-features of the connected view (shown by blue color) (Color figure online)

Further, each identified view is associated with one of physical entities as described earlier. If both the object and the hand have been already seen separately, the corresponding entities exist in the visual memory, and they can be recognized as connected entities.

The ability to recognize connected entities is really important in scenarios with object manipulation. It helps preventing erroneous updates of views and entities models when the object is grasped. If both the object and the hand are identified as connected entities, then the view of the object will not be updated with the mid-features of the hand. Furthermore, the information about connected entities is also used during the entity categorization and interactive object learning presented in the following section.

4 Interactive learning

In this section, we describe our second developmental stage where the robot manipulates objects to improve their models. As a pre-requisite, our approach first categorize all entities into parts of the robot, parts of a human partner, and manipulable objects. This categorization process makes our approach robust to changes in the robots effector appearance and allows to update objects models efficiently without adding parts of the robot or human hands to the objects models. Note that all the processes described in the previous sections are still active, thus making it possible to interleave the two developmental stages by introducing new entities at any time.

4.1 Entity categorization

The categorization procedure is aimed at identifying the nature of physical entities detected in the visual field in the interactive scenario when the robot and a human partner manipulate objects. Each physical entity is classified into one of the following categories: a part of the robot \(c_{r}\), a part of a human partner \(c_{h}\), an object \(c_{o}\), an object grasped by the robot \(c_{o+r}\), or an object grasped by a human partner \(c_{o+h}\). Before identification of the body of the robot, which is a requirement for the identification of other categories, all entities are temporally associated to the unknown category \(c_{u}\), and their correct categories are identified later. Therefore, within the categorization procedure, at first, the parts of the body of the robot are discriminated among all physical entities, and then, the rest of single entities are distinguished either as a human part, or a manipulable object category, as shown in Fig. 8.

Fig. 8
figure 8

The categorization flowchart: parts of the robot \(c_{r}\) are discriminated based on mutual information (MI) between the visual and proprioceptive data; parts of a human partner \(c_{h}\) and objects \(c_{o}\) are distinguished based on both MI and statistics on entities motion; connected entities are categorized either as an object grasped by the robot \(c_{o+r}\) or an object grasped by a human partner \(c_{o+h}\)

4.1.1 Robot self-identification

Our goal is to implement a strategy that requires minimum prior knowledge and avoids the need for a predefined appearance of the robot, a joint-link structure, or a predefined behaviour. The independence on the appearance should allow a robust recognition of the hands of the robot in the case of changing their appearance, in the case of occlusion, and in the case of extension of the hands by grasped tools. The independence on the behaviour enables to perform recognition at any time, during a variety of interactive actions without requiring a specific identification phase.

Therefore, during the motor activity of the robot (the actions performed by the robot will be described in Sect. 5.1.3), the visual information is gathered together with the proprioceptive data, and based on mutual information (MI) between these senses, the system identifies the parts of the body of the robot among detected physical entities. As the input data, we acquire and process:

  • visual information: the position of detected entities in the visual field,

  • proprioceptive information: joints values of the robot’s motors accessed through YARP ports:Footnote 4

    • arms joints: shoulder (pitch, roll, and yaw), elbow, and wrist joints (pronosupination, pitch, and yaw),

    • torso joints: pitch, roll, and yaw.

The data acquisition is driven by the visual perception, and the states of the motors are acquired after receiving a new image from the visual sensor. The motors states are acquired as a set of arm-torso joints values without considering the functionality of each joint, nor the character of its impact on the displacement of the hands of the robot. We acquire one set of joints values per arm with the torso joints in each set. The head motors however are not analysed, since they do not effect on the position of the hand observed from our external visual sensor.

Both visual and proprioceptive data need to be quantized in order to compute mutual information. For the visual space being only of dimension 2, a simple regular grid is used: for each detected entity, its position in image space is quantized into one of the visual clusters obtained by dividing the image space with a regular grid of 12 columns and 10 rows, producing 120 rectangular visual clusters. The joint space however has a higher dimensionality, and it is not possible to use a regular discretization along all the dimensions. The joints values are therefore quantized into a dictionary of arm-torso configurations with each entry encoded as a vector of joints values. The quantization is incremental, i.e., it adds new clusters as required by the data, and we use the same algorithm as for visual dictionary creation (Algorithm 1). This leads to a sparse representation of the joint space that will adapt to any new joint configuration experienced by the robot. In our experiments, the mean number of arm-torso configurations generated by this procedure was about 40.

Mutual information is used to evaluate the dependencies between the arm-torso configurations \(A_k\) (either left or right arm) and the localization of each physical entity \(E_i\) in the visual cluster \(L_{E_i}\):

(11)

where \(L_{E_i}\) is the position of the entity quantized into the visual cluster, \(A_{k}\) is the state of the arm k of the robot quantized into the arm-torso configuration, \(H(L_{E_i})\) is the marginal entropy, and \(Hc(L_{E_i}|A_{k})\) is the conditional entropy computed in the following way:

$$\begin{aligned} H(L_{E_i})= & {} - \sum \limits _{l} p(L_{E_i}=l) log(p(L_{E_i}=l)), \end{aligned}$$
(12)
$$\begin{aligned} Hc(L_{E_i}|A_{k})&= - \sum \limits _{a} p(A_{k}=a) \sum \limits _{l} p(L_{E_i}=l|A_{k}=a) \nonumber \\&\quad \times \, log(p(L_{E_i}=l|A_{k}=a)) \end{aligned}$$
(13)

where \(p(L_{E_i}=l)\) is the probability that the localization of entity \(E_i\) is the visual cluster l; \(p(A_{k}=a)\) is the probability that the arm-torso configuration \(A_{k}\) is the configuration a, and \(p(L_{E_i}=l|A_{k}=a)\) is the probability of the entity being in the cluster l when the arm k is in the cluster a.

While the robot moves its hands in the visual field, gathering statistics about the localization of entities and the occurrences of arm-torso configurations, MI grows for the physical entities that correspond to the hands of the robot. When the MI value reaches a specified threshold \(th_{r}\), the entity is identified as the robot category \(c_{r}\). On the contrary, the human and the object categories should have smaller MI due to their independence from the motors of the robot. The threshold for identifying the robot category was empirically chosen based on MI distribution obtained on a small labelled set of robot and non-robot entities. Thereby, the physical entity is identified as the robot category \(c_{r}\), if its MI is higher than \(th_{r}\), and otherwise, it is considered as one of the non-robot categories that will be identified in the following processing as described in the next subsection.

In case of changing the appearance of the hand of the robot (like wearing gloves, that we do during our experiments presented in Sect. 5.4), the robot category \(c_{r}\) can be associated with several entities with each entity characterizing a different appearance of the hand (see Fig. 9).

Fig. 9
figure 9

Examples of multi-view models of three entities characterizing different appearances of the hand of the robot (each model with its views is shown in one line)

4.1.2 Discrimination of manipulable objects and human parts

Among the non-robot physical entities, human parts and objects are discriminated based on their motion behaviour. Most objects, like the one used in our experiments, are static most of time, and they are displaced by the robot or its human partner. Among categories analyzed in this work, only the robot and the human categories can move alone (while not connected to other entities). Thus, we accumulate the statistics on entities motion and use it to distinguish the object category as a mostly static entity that moves only when connected to other entities (see Algorithm 2). Note that this definition is linked to our scenario and is not universal: we would recognize objects moving autonomously or animals as human parts, while a human moving his left hand only when it is touched by his right hand would see the right hand categorized as a human part and the left hand as an object.

While detecting physical entities, we accumulate the statistics on their motion over time. Based on these statistics and the output from the self-identification algorithm, the following probabilities are estimated:

  • \(p_s = p(E_{i}|c_{E_{i}}\not =c_{r})\) the probability of seeing the entity \(E_i\) moving as a single entity while being identified as a non-robot category,

  • \(p_c = p(E_{i}|c_{E_{i}}\not =c_{r},c_{E_{i2}}=c_{r})\) the probability of seeing the entity \(E_i\) identified as a non-robot category and moving together with the connected entity \(E_{i2}\) identified as a robot category.

Analysing the motion statistics of single entities, the probability \(p_s\) should be lower for the object category, since object entities usually do not move alone, as discussed earlier. Analysing the motion statistics of connected entities, the probability \(p_c\) should be higher for the object category, since object entities often move together with other entities for example, when objects are manipulated. Thereby, each non-robot entity is categorized as:

  • the object category \(c_{o}\), if the probability \(p_c > th_o.c.\) and \(p_s < th_o.s.\);

  • the human category \(c_{h}\), otherwise.

figure e

Following our approach, the parts of the body of the robot are identified first, so that before the robot starts interaction with objects it has already accumulated some statistics on entities motion. Once the robot starts interaction with objects, it accumulates statistics on motion of entities together with its hands. While applying our categorization algorithm to each detected entity, we identify each single entity as one of the following categories: \(c_{o}\), \(c_{h}\), or \(c_{r}\) (see Fig. 10). Connected entities are identified either as an object grasped by a robot category \(c_{o+r}\) or an object grasped by a human category \(c_{o+h}\) based on the categorization statistics gathered when the corresponding entities have been seen alone.

Fig. 10
figure 10

Entity categorization examples: a the human hand identified as \(c_h\); b the hand of the robot identified as \(c_r\) and the object identified as \(c_o\); c the object grasped by the robot identified as \(c_{o+r}\); d the object grasped by the human identified as \(c_{o+h}\)

4.2 Interactive object learning

Once the robot is able to detect and categorize physical entities in the visual space, it starts to interact with object entities (see Figs. 15 and  16). The actions executed on the robot are described in Sect. 5.1.3. While interacting with an entity, the system each time remembers the grasped entity as \(E_g\) and the model of this entity is updated during the action of the robot. This is a kind of self-supervision, where the object is supposed to remain the same during its manipulation.

According to our algorithm, the system continuously detects entities in the visual space and categorizes them. While the robot interacts with an object, we are able to discriminate between the object entity and the robot entity, when they move separately or together (e.g. when the object is grasped). The information about identified categories of entities is used by our interactive learning algorithm summarized in Fig. 11. If during interaction with an object, the system detects connected entities categorized as the object grasped by the robot, we verify the categories of connected views. For this purpose, we retrieve the set of entities \(\{E_{i}\}\) with the current view in their models. For each entity, we retrieve its category \(\{c_{E_{i}}\}\) from the statistics stored in the memory, and based on these categories the view is identified as:

  • a robot view, if at least one corresponding entity is identified as the robot category (\(\exists i, c_{E_{i}}=c_{r}\));

  • a non-robot view, if none of corresponding entities is identified as the robot category (\(\forall i, c_{E_{i}}\ne c_{r}\)).

If connected views are identified as a robot view and a non-robot view (see Fig. 12), the model of the grasped entity \(E_g\) is updated with the non-robot view.

Fig. 11
figure 11

Improving the object representation model during interaction. During the action of the robot, the manipulated entity \(E_g\) can be detected either as an entity connected to the hand of the robot and identified as the object + robot category \(c_{o+r}\), or as a single entity identified as the object category \(c_{o}\). In both cases, the manipulated entity \(E_g\) can be updated with the non-robot view \(v_j\) recognized in the current image (see text for details)

Fig. 12
figure 12

Examples of connected views and their mid-features (HSV pairs) during interactive object learning: the red mid-features correspond to one of connected views (in this case, the hand of the robot), and the blue mid-features correspond to another connected view (in this case, the object)

While finishing the object manipulation process, the robot releases its hand, and the grasped object falls down. In this case, if the object is detected as a single entity with a unknown view (corresponding to a perspective that was not yet observed), a new view will be stored in the memory. The model of this entity could be updated with this new view based on tracking in the following images. Thereby, the robot can explore an object appearance by grasping and throwing it, while updating the model of the manipulated entity with the observed views.

After manipulations with objects, the system performs a check of the visual memory and cleans the dictionaries of entities and views. The entity dictionary is cleaned by suppressing the noisy entities that have no proper views (these entities have only views common with other entities). The view dictionary is cleaned by suppressing the views that have no associated entities; such views could be created during interaction with an entity but never added to its model. Finally, the cleaning of dictionary makes the knowledge about physical entities more coherent and improves object recognition as shown in the next section.

5 Experimental evaluation

The proposed perceptual system is evaluated on the iCubFootnote 5 (see Fig. 14b) and the MekaFootnote 6 (see Fig. 14a) humanoid robots exploring their environment in interactive scenarios. Precisely, all quantitative data reported in this paper were acquired on the iCub robot, in its first version (Natale et al. 2013), with a mean frame rate of 10 Hz. In our experiments, at first, the robot learns about its close environment through observation, while a human partner demonstrates objects to the robot, and then, the robot explores its close environment and surrounding objects through interaction. First actions of the robot are aimed at identifying the parts of its own body, then it discriminates manipulable objects and parts of human partners. Once the robot is able to categorize the entities in its visual field, it starts learning objects appearances through manipulation.

The whole set of objects used in our experiments is shown in Fig. 13. We choose both simple homogeneous objects (like toys) and also more complex textured objects (like everyday products including bottles and boxes).

Fig. 13
figure 13

The 20 objects used in our experiments. The objects are numbered from 1 to 20, from top left to bottom right, and this order is preserved in the reported experiments. These images are the real images acquired by the Kinect sensor and used by our system

5.1 Experimental setup

In our setup, the robot is placed in front of a table, and the visual input is taken from the external Kinect sensor mounted above the head of the robot, as shown in Fig. 14. In case of using an external visual sensor, interaction with entities requires their localization not only in the image space but also with respect to the robot. Therefore, at the beginning of our experiments, the visual sensor is calibrated with respect to the robot. During experiments, each detected entity is localized in the operational space of the robot and characterized by its orientation and size.

Fig. 14
figure 14

a The experimental setup for the Meka robot with the relative position of the sensor, the robot, and the table. b The experimental setup for the iCub robot. c The acquisition of the position of the pattern in the operational space of the robot, shown for the iCub robot

5.1.1 Visual sensor calibration

The calibration of the visual sensor relative to the base of the robot is performed with a calibration pattern, a chessboard, and the OpenCV library is used to compute the position of the chessboard relative to the sensor. The computation of the transformation matrix requires both the position and orientation of the chessboard in the operational space of the robot. The orientation of the chessboard is known, since it is placed horizontally in front of the robot. In order to obtain the position of the chessboard, we place the hand of the robot above the origin of the chessboard (see Fig. 14) and acquire the position of the hand. Then, the transformation matrix is computed in the following way:

$$\begin{aligned} T_{sensor\rightarrow robot} = T_{sensor\rightarrow chessbrd} \times T_{chessbrd\rightarrow robot}. \end{aligned}$$
(14)

5.1.2 Entity localization

For each detected entity, its 3D position in the visual space is estimated with respect to the sensor by processing the RGB-D data as a point cloud and computing the average position of its 3D points. The orientation of the entity is estimated based on eigenvectors and eigenvalues of the covariance matrix of the points. The eigenvectors correspond to three orthogonal vectors oriented in the direction maximizing the variance of the points of the entity along its axis. The eigenvectors are used as the reference frame of the entity. A quaternion is chosen to represent the orientation of the entity, since this representation is compact, fast, and stable (Gaël and Benoît 2010). The position and orientation of the entity is then estimated in the reference frame of the robot using the transformation obtained through the calibration and the Eigen3 library.Footnote 7

5.1.3 Actions

The interactive actions of the robot are aimed at achieving two main goals: categorization of entities (including self-identification and discrimination of manipulable objects) and learning objects appearances. Both simple action primitives and more complex manipulations have been implemented and used in Ivaldi et al. (2012, 2014). In this paper, we use two complex manipulations aimed at observing an object from different viewing angles and at different scales:

  • TakeLiftFall manipulation (see Fig. 15) consists of reaching an object from above, taking it with a three finger pinch grasp, lifting, and releasing. This action generates a random view of the object, when the object falls on the table,

  • TakeObserve manipulation (see Fig. 16) consists of reaching an object from above, taking it with a three finger pinch grasp, turning the object and approaching towards the camera, and returning it back to the table. This action allows to observe several object perspectives from different viewing angles and also at a closer scale.

Fig. 15
figure 15

TakeLiftFall manipulation: the object is a grasped, b lifted, and c released; d when the object falls on the table, it makes it turning into a random perspective

Fig. 16
figure 16

TakeObserve manipulation: the object is a grasped, b lifted and approached to the camera, c turned around, and d returned back to the table

These “complex” manipulations are encoded as sequences of simple “atomic” action primitives, such as reach or grasp. Based on the current state of an object (i.e., its position on the table) and the robot (i.e., the position of its hands and its joints values), actions could have different durations and the speed of fingers movements. In order to grasp an object, the robot approaches its hand towards the top of the object, estimated by the visual system, and executes a three-finger pinch grasp from top. The grasp is pre-encoded and it is designed to be robust for different kinds of objects. As the fingers are tendon-driven, the grasp is naturally compliant, adapting to the shape of the object. Once the object is grasped, the robot continues the sequence of actions to execute the required manipulation.Footnote 8

5.1.4 Evaluation methodology

Since our work is aimed at interactive learning about the close environment of the robot, it makes difficult to evaluate the learning performance using existing image databases. Moreover, as learning is incremental and iterative, it is difficult to have a precise evaluation of the performance at a given time during real-time operation. Thus, the performance is evaluated on several stages of developmental learning, and the evaluation is based on pre-recorded sequences of images labelled with a reference ground truth. The evaluation procedure includes estimation of the following characteristics:

  • detection rate is obtained based on manually labelled images,

  • categorization rate: self-identification is evaluated using forward kinematics model as a reference, while discrimination of objects and human parts is evaluated based on manually labelled images with the correct categories,

  • recognition rate is obtained using a separate evaluation image database.

In order to evaluate the object recognition, we make a database with 50 images for each object used in the experiments, and each object is shown from different perspectives. During evaluation, the perceptual system assigns the images of objects from the database to physical entities, and then, we compute the number of entities and views assigned to each real object. The object recognition rate is estimated based on the following entities chosen for each object:

  • a major entity as the most frequently associated entity among all entities associated with this particular object,

  • pure entities as the entities associated with this particular object, but never with other objects.

Examples of major and pure entities are illustrated in the association matrix in Fig. 17. The object recognition rate is computed as a percentage of the object instances associated with its major / pure entities, with respect to the total number of images with the object.

Fig. 17
figure 17

The association matrix obtained for the 20 objects (shown in rows) and the corresponding physical entities (shown in columns); the color range (from while 0 % to black 100 %) represents the percentage of object instances associated with each entity; the columns are sorted by the order of created entities that nearly follows the order of learned objects. Among entities associated with each object, we distinguish one major entity that was the most frequently associated (for example, the entity 19 for the object \(o_4\), shown in red solid line) and pure entities that were associated with one object, but never with other objects (for example the entities 19, 20, and 21 for the object \(o_4\), shown in green dashed line) (Color figure online)

For all the thresholds used in our algorithms (Sects. 3.4, 4.1, and 4.2), we ran a first experiment with 10 objects and the initial appearance of the robot and experimentally varied the thresholds to optimize the recognition rates and the categorization performance. We then kept these thresholds for all reported experiments.

5.2 Evaluation of detection and tracking

In this experiment, the robot learns about its close environment through observation, while a human partner demonstrates the 20 objects (see Fig. 13) one by one. Each object is manipulated for about 1 min (that corresponds to about 600 images) allowing to observe different perspectives of the object. In total, the experiment lasts about 20 min and contains about 12,000 images.

The object detection rate is estimated as a percentage of images with segmented objects, with respect to the total number of images with the object. On average, our system shows an object detection rate of 98 % in case of segmenting entities based on depth-contours. We have also compared the detection rate with and without using the depth data. Using motion only (without using the depth data) we obtained a detection rate of 86 %, showing that our system could also work using embedded cameras with a loss of performance.

The tracking rate is estimated as the percentage of tracked instances of the object with respect to the total occurrence of the object in consecutive images. On average, our system shows a tracking rate of 77 % that does not depend on the use of the depth information. Note that tracking failures mainly happen with few objects (\(O_1\), \(O_3\), \(O_{10}\), and \(O_{15}\)) that have only few features.

5.3 Evaluation of learning through observation

Using the same experiment as in the previous section (i.e., after observing each object for 1 min), the object recognition rates computed on the separated evaluation database are reported in Table 1. The average recognition rate based on pure entities (i.e., the set of all entities associated only with one object)is 85.7 %. The average recognition rate based on major entities (i.e., the pure entity the most frequently associated with the object) is 55.8 %. The obtained recognition rates differ between objects. Intuitively, objects with different appearances have been recognized better than objects which are similar to each other. From the association matrix (see Fig. 17), the maximal confusion has occurred between the objects \(O_{11}\) and \(O_6\), which have similar colors and similar lego-parts. However, the two identical objects \(O_1\) and \(O_3\) which differ only by color, have been distinguished rather well.

Table 1 Performances of object learning: each value is presented in a pair comparing the results of learning through interaction (2nd stage) / with respect to learning through observation (1st stage)

The objects of our dataset that show lower tracking rates (\(O_1\), \(O_3\), \(O_{10}\), and \(O_{15}\)) also show smaller recognition rates based on major entities (see Table 1 column 2) comparing to other objects. This is caused by the fact that a tracking failure often leads to the creation of a new entity and prevents to associate several views to a single entity.

From Table 1 and Fig. 17, most objects have been associated with several entities, with an average of 4.1 entities per object. This is a common limitation of unsupervised learning approaches, where the robot decides itself if it observes a new object or a known object. We will see that interactive learning makes it possible to reduce this segmentation of objects into several entities.

We also evaluate our system for simultaneous processing of multiple objects in a single image. The system has been tested with up to 10 objects demonstrated at the same time (see Fig. 18), and all objects have been detected and recognized.

Fig. 18
figure 18

Simultaneous processing of several objects: a 10 objects detected and recognized in the visual space of the robot, b the resulted segmentation of the objects

During our experiments on object learning, the average processing time was 0.13 s for images with one object. The time required to process one object varies significantly between objects and it depends on their complexity and the number of extracted features. Among all processing stages, the highest computation cost belongs to recognition and learning of views, and in particularly to searching features in dictionaries. Moreover, this cost increases with the dictionaries growth which was observed to be approximately linear in our experiments. Other processing stages (object detection, segmentation, feature extraction, tracking, and categorization) take all together about 0.06s per image, and this processing cost stays relatively stable over time.

5.4 Evaluation of entity categorization

The categorization performance is evaluated in the interactive scenario where both the robot and the human partner perform actions aimed at exploration of the objects close to the robot.

5.4.1 Evaluation of self-identification

In this experiment, the iCub robot performs free hand motion and interactive actions described in Sect. 5.1.3, while the human partner also moves its hands in the visual space. In total, the experiment lasts about 12 min and contains 7200 images. The identification of the body parts of the robot was evaluated using forward kinematics model as a reference. Our approach was evaluated with the robot normal hand appearance and also while changing its appearance by wearing coloured gloves (see Fig. 9). The categorization procedure was able to identify the hand appearances after a duration varying between 5 and 12 s of their motion in the visual field (\(c_u\)\(c_r\) in Fig. 19), corresponding to the processing of between 50 and 120 images (see Fig. 19). These variations depend on the particular motions performed by the robot: motions of the hand across the whole visual field are more informative than motions that produce little visible variations and therefore lead to a faster increase of mutual information and a faster hand identification. Once the hand of the robot was first identified, the system has shown an average self-recognition rate of 98.2 % for the initial appearance of the hand. The self-recognition rate for the other appearance was 98.1 % for the blue glove and 98.0 % for the pink glove. Similar results confirming the independence of our approach on the hand appearance were obtained with the Meka robot wearing coloured gloves.

Fig. 19
figure 19

Categorization of entities performed while both the human partner and the robot (with three different appearances of the hand) perform free hand motion and the human partner also interacts with first five objects: the graph shows the normalized MI value for each entity; each entity appears in the timeline as an unknown category \(c_{u}\), and once it is categorized, its category is marked in the timeline (in this case, the category \(c_{r}\)). The curves corresponding to the five objects do not appear in the graph as their probability remains close to 0 and they are hidden by the curve corresponding to the human

5.4.2 Evaluation of categorization of objects and human parts

Once the robot identifies its hands among the physical entities detected in the visual field, it continues interactive exploration of other entities. While both the robot and its human partner perform interactive actions with the objects, the perceptual system continuously analyses the entities behaviour and categorizes them. In total, this experiment lasts about 60 min and contains about 36000 images, where the human manipulates each of 20 objects (in total, about 20 min), and the robot manipulates each of 20 objects (in total, about 40 min). The ability to discriminate the objects and human parts is evaluated a posteriori based on images labelled with the correct entities categories. During the experiment, each object has been successfully identified in the object category within 5–10 seconds of motion during interaction (corresponding to 50–100 images), leading to a total correct categorisation rate of 84 %. Human parts have been categorized correctly in 89 % of all images. Figure 20 shows the evolution of the probability of each non-robot entity being an object. It also shows the probability of being a human, given that the two probabilities sum to 1.

Fig. 20
figure 20

Categorization of entities performed while the robot interacts with the first five objects: the graph shows the probability of being in the object category based on \(p_c\) and \(p_s\) for each entity. Each entity appears in the timeline as an unknown category \(c_{u}\), and once categorized as an object is marked \(c_{o}\). The entities with the probability below the threshold fall in the human category

5.5 Evaluation of interactive object learning

Once the robot is able to categorize physical entities detected in the visual field, its focuses on interactive object exploration. The robot manipulates each object following the TakeLiftFall or TakeObserve schemes, described in 5.1.3. Each manipulation lasts about one and a half minute (corresponding to about 900 images). In total, the experiment lasts about 30 min for each type of manipulation and contains about 18,000 images. The performance of interactive learning is evaluated using the database described in Sect. 5.1.4. The evaluation results are reported in Table 1, where each value is presented in a pair with the corresponding result obtained during learning through observation presented in Sect. 5.3.

For most of objects, the interactive learning shows an improvement of the recognition rate based on a major entity with respect to the results of learning through observation (see Fig. 21). The recognition rate based on pure entities remains nearly stable in comparison to learning through observation. These results can be explained by the concept of the learning algorithm aimed at updating the best model of a grasped entity during its manipulation. Thus, interactive learning procedure improves mostly the major entity, while leaving other pure entities without significant changes.

Fig. 21
figure 21

Improvement of the object recognition rate: the recognition rate (based on major entities) obtained during learning through observation is shown in blue, its improvement during interactive learning is shown in orange, and the final recognition rate (based on pure entities) is shown in yellow (Color figure online)

Interactive learning allows to obtain enhanced objects models with an increased number of views. For objects whose appearances significantly vary between perspectives, interactive learning is especially useful. While manipulating an object, the perceptual system integrates the recognized views into the representation model of the entity thus enhancing the model and making it more complete. Moreover, the system creates new views when it observes previously unknown perspectives of the object. From our experiments, interactive learning results in enhancement of the entities models of the objects \(O_1\), \(O_2\), \(O_3\), \(O_8\), \(O_9\), and \(O_{11}\). The examples of improvements of some models (in particularly, the views added to these models) are shown in Fig. 22.

Fig. 22
figure 22

The representation models of the major entities of the objects \(O_1\), \(O_2\), and \(O_3\) (each model with its views is shown in one line), where the views added during interactive learning are shown after the \(+\) sign

As discussed in Sect. 5.3, learning through observation results in association of some objects with several physical entities. However, interactive learning allows to consolidate the knowledge about an object within its major entity and decrease the number of entities associated with the object. The total number of entities and views decreases mostly due to cleaning dictionaries performed after manipulation and described in Sect. 4.2. Cleaning dictionaries makes the knowledge more coherent by removing noisy entities and thus leading to the improvement of the object recognition rate based on major entities as less views are associated to the noisy entities.

6 Discussion

We have evaluated our system with a set of objects varying in color and texture, showing its ability to integrate both information for recognition, and its capability to recognize and learn object even when manipulated. However, the choice of the bag of word approach for object representation and hand-crafted features could probably be improved, for example using even more geometric information than we have used in feature pairs. Another interesting approach would be to learn the visual features themselves, which proved to be efficient in a number of applications (LeCun et al. 2004). Regarding the kind of objects our system can learn, our multi-view model should be well adapted to objects changing shapes, such as articulated objects. The different appearances corresponding to the change in the articulated objects would be integrated as other views, as long as object tracking is possible during object modification.

From a computational point of view, scaling our approach to a larger set of objects will face the issue of feature dictionary growth (Sect. 5.3) that increases the view learning and recognition time. In our system, the mean computation time for 20 objects is 0.07 s for view learning and recognition and 0.06 for all the other processing steps which are independent on the number of objects. Assuming a linear growth of dictionaries, our system could recognize 40 objects with a mean computation time of 0.2 s. In order to learn a much larger set of objects, the dictionary growth should be limited by introducing additional filtering of dictionaries in order to keep only the most frequently repeated features. Another approach could be to learn a fixed dictionary of visual features in a first phase, before learning the objects. Such approach would not be incremental as ours, but would make it possible to use much more efficient data structures as used in image retrieval (e.g., Jégou et al. (2010)) that would scale to a much larger number of objects.

The object representation and learning approach presented in this paper takes advantage of social interactions as these interactions produce object motions that are important in our system, but does not explicitly engage in such interactions. In a related work however, our system has been integrated within a curiosity-based active object exploration architecture (Ivaldi et al. 2013; Nguyen et al. 2013) that took advantage of the social environment by asking the human partner to manipulate a particular object. This was possible because our approach provides an assessment of an object model quality through its number of views and its recognition probability. This quality measure has been used to guide the choice of an object, an action, and an actor (i.e., the robot itself or the human partner) in order to explore based on the achieved learning progress.

This work made a number of engineering choices whose consequences can be questioned. Among these, the choice of a fixed external RGB-D sensor made it possible to simplify implementation, improve the quality of the data, and therefore the system performance. In particular, it avoids the complex problem of learning gaze control that involves eyes and neck joints that have not been considered in this work (Law et al. 2014). However, this removes the possibility for the system to control its gaze direction. Imagining the implementation of our system with a gaze-controlled camera on the robot head, our image processing stream should not be strongly affected (beside the loss of performance as illustrated in Sect. 5.2) as long as object tracking remains possible. The entities classification however will require improvements as it currently depends on the fact that the camera is static to analyse entities motions. A new component computing entities motion in the robot body frame would therefore be required. As an alternative for entity categorisation, we could extend our algorithm by including the head pose (the states of neck joints) and the gaze direction into the arm-torso dictionary. This modification will allow to consider the relation between the entity localization over time relative to the camera pose, thus allowing the camera motion. The calibration of the sensor currently performed by an initial calibration procedure could also be performed in a more natural way, following for example approaches learning visuo-motor coordination (e.g. Chinellato et al. 2011; Chao et al. 2014).

Concerning gaze control on the actual robot as a social cue, our engineering solution indeed makes it possible to make the robot look at an object or at humans for social interaction (thanks to the position of the object in the robot reference frame given by the RGB-D camera). However, the fact that the robot point of view from the external camera is not the point of view from the robot eyes which is assumed by humans can cause problems in human-robot interactions scenarios. Indeed, the human could assume that the side of the object seen by the robot is different from the one actually observed by the overhead camera.

Several parts of the proposed approach could also be extended by the use of more general learning approaches than the current hand-designed algorithms. For example, an interesting future work could be to replace the algorithm for entities categorisation proposed in Fig. 8 by a more adaptive approach. A first step would be to learn the thresholds used in this procedure from data, but a more generic approach learning the entity behaviours and performing unsupervised categorisation of these behaviours to define the entities categories would be more appealing.

7 Conclusion and future work

We have developed a perceptual approach that enables a humanoid robot to explore its close environment in an interactive scenario, following the context of developmental learning. Without the use of image databases, pre-specified objects, known robot appearance or direct supervision but rather taking inspiration from infants development, the robot first learns by observing its surroundings, and then using its own interactive actions thanks to the identification of its own body.

This was achieved thanks to the integration of a generic physical entity appearance representation, a self- and others-identification capability, and actions for active exploration of the objects. The main lessons learned from this system are that:

  • it is possible to make efficient models of all physical entities in front of a robot with a unified appearance model that can represent both textured objects such as the robot hands or soda cans and textureless objects such as toys or human hands,

  • it is possible to categorize objects, human parts, and parts of the robot without prior knowledge on their appearances and using only their motion behaviour and its correlation with the robot proprioceptive sensing,

  • the knowledge of these three categories are sufficient to update object models during manipulation, even when the object is in the robot hand, without the need of a precise body schema, nor initial knowledge of the robot appearance.

An interesting extension of this work would be to improve the integration of experience gathered by the robot through interaction with its environment into the processing pipeline itself. In infants, the development of capabilities to manipulate objects has an influence on their perception and especially attention (Needham et al. 2002). It would be advantageous to implement a similar feature: once the robot has explored an object manually at a close scale, it has acquired more knowledge about the importance of its visual features for interaction or correct recognition. This experience could provide a feedback to the perceptual system, for example by changing the attention model or notion of saliency to be able to detect these objects at a greater distance.

Our developmental approach could be further extended by learning action primitives instead of using hand designed actions. While we focus on perception in this work, infants develop simultaneously their recognition and action capabilities. It would be interesting to work on a more complete developmental approach for robots by learning the appropriate actions to manipulate the objects (following for example Law et al. 2014) at the same time as learning to recognize these objects or as learning the affordances that make it possible to decide which actions apply to a given object. Learning these actions should be coupled with learning a more complete body schema than the simple partial body image that is learned in our current approach. Learning the full body schema would make it possible to extend self-recognition to more complex parts of the body of the robot, and would make it possible to perform more efficient manipulation actions.

Finally, it would also be interesting to extend our approach by integrating the audio information in our system. While seeking the multimodality of learning and taking inspiration from infant-directed interaction, when an adult names an object while showing it to the infant, we could learn about objects not only from visual data but also from audio information. This can be viewed as a step towards the development of common language between the robot and its human partner, where the robot is able to learn objects associated with any names that its user would like to use, while it could help to improve object recognition in more complex interactive scenarios.