Psychologically Inspired Sensory-Motor Development in Early Robot Learning

We present an implementation of a model of very early sensory-motor development, guided by results from developmental psychology. Behavioural acquisition and growth is demonstrated through constraint-lifting mechanisms initiated by global state variables. The results show how staged competence can be shaped by qualitative behaviour changes produced by anatomical, computational and maturational constraints.


Introduction: developmental learning
In the last five years developmental robotics has emerged as a vibrant new research area.Previously many research projects have explored the issues involved in creating truly autonomous embodied learning agents but only recently has the idea of a developmental approach been investigated as a serious strategy for robot learning.For a review of developmental robotics see (Lungarella et al., 2003) and for recent results see new conference series such as (Epigenetics, 2004).
In this paper we describe an approach to sensory-motor learning and coordination that draws from psychology rather than neuroscience.There have been many models of sensorymotor coordination (Lungarella et al., 2003) but most of these have been based on specific, usually connectionist, architectures and tend to focus on a single behavioral task.We are interested in exploring mechanisms that can support not only the growth of behavior but also the transitions that are observed as behavior moves through distinct stages of competence.
Developmental psychology concerns the study of behavior and changes in behavior over time and attempts to infer internal mechanisms of adaptation that could account for the external manifestations.We are interested in very early development, in particular, the control of the limbs during the first three months of life.The newborn human infant faces a formidable learning task and yet advances from undirected, uncoordinated, apparently random behavior to eventual skilled control of motor and sensory systems that support goal-directed action and increasing levels of competence.This is the kind of scenario that will face future robots and we need to understand how some of the infant's learning behavior might be reproduced.
Our main inspiration comes from Jean Piaget's emphasis on the importance of sensory-motor interaction, staged competence learning and the sequential lifting of constraints (or scaffolding) (Piaget, 1973).Others, such as Jerome Bruner, have reinforced this by suggesting mechanisms that could explain the plasticity seen in infant studies (Bruner, 1990).We agree that sensory-motor coordination is likely to be a significant general principle of cognition (Pfeifer and Scheier, 1997) and we are investigating mechanisms for development in terms of stages (periods of similar behavior) and transitions (phases where new behavior patterns emerge).
We describe an experimental framework that acts as a substrate for building models that can shed light on the key requirements.Our objective is the implementation of a flexible learning framework for an embodied hand/eye system which exhibits a prolonged epigenetic developmental process.The aim is to approach some of the skills achieved by the newborn human infant.This includes discovering the structure of the various local representations of space (visual, tactile and motor), learning how to integrate these, and how to master their coordination for the control of action.

An Experimental System for Development
We now set the context by describing the features and organization of our laboratory system.Our robot consists of two manipulator arms and a visual sensor that acts as an "eye".These are configured in a manner similar to the spatial arrangement of an infant's arms and head -the arms are mounted, spaced apart, on a vertical backplane and operate in the horizontal plane, working a few centimeters above a work surface, while the "eye", which is a color imaging camera, is mounted above and looks down on the work area.Figure 1 shows the general configuration of the system.The effector part of the system comprises two industrial quality Adept robot arms, each with six degrees of freedom.In the present experiments only two joints are used, the others being held fixed, so that the arms each operate as a two-link mechanism consisting of "forearm" and "upper-arm" and sweep horizontally across the work area.The plan view of this arrangement is shown diagrammatically in figure 2.
The camera is mounted on a computer-controlled pan and tilt head.This allows fast scanning of the work space (saccades) and vision processing software is used to detect shape and color patches from the pixels within a central image region.
The arm end-points can each carry a "hand" i.e. an electrically driven two-finger gripper fitted with tactile sensing contact pads on all surfaces.However, for the present experiments we fitted one arm with a simple probe consisting of a 10mm rod containing a small proximity sensor.This sensor faces downwards so that, as the arm sweeps across the work surface, any objects passed underneath will be detected.Normally, small objects will not be disturbed but if an object is taller than the arm/table gap then it may be swept out of the environment during arm action.
This experimental setup provides a set of rich visual, tactile and motor spaces, which are crucial for our experimental program.

The Motor Coordination Problem
Even before any cross-modal spatial integration can begin it is necessary to first discover the structure of the local spaces within each modality.By virtue of its given physical structure and constraints, each modality will have its own coding of space.Thus, when the eye refers to a spatial location then that data will only have meaning in terms of the actions required to move or direct the eye to that position.Similarly for a hand; locations in end-effector space are encodings of signals that correspond to the hand being at a certain location.
During the first months of life the neonate may seem to show no purpose or pattern in motor acts, but actually the infant displays very considerable learning skills: from spontaneous, apparently random movements of the limbs the infant gradually gains control of the parameters and coordinates sensory and motor signals to produce purposive acts in egocentric space (Gallahue, 1982).Various stages in behavior can be discerned and during these stages the local egocentric limb space becomes assimilated into the infant's awareness and forms a substrate for future cross-modal skilled behaviors.This essential correlation between proprioceptive space and motor space seems to be a foundation stone for development, and occurs at many levels (Pfeifer and Scheier, 1997).Sensory-motor growth in the limbs appears to precede visual development (it may begin in the womb) and even when it can continue concurrently with visual development, in the first few months, the eye is too functionally restricted (tunnel vision) to correlate with other modalities (Westermann and Mareschal, 2004).For this reason, in the experiments reported here we do not involve the eye system.Also, there is no experimental advantage in driving two arms and so, for simplicity, we use only one arm.

Motor Coordination in a Single Modality
A two-section limb requires a motor system that can drive each section independently.A muscle pair could actuate each degree of freedom, i.e. extensors and flexors, but this can be abstracted into a single motor parameter to define the overall applied drive strength.As we are operating in two dimensions, two motor parameters are required, one for each limb section: M 1 and M 2 , which are real valued in the range +1 to −1 (zero represents no actuation).We recognize Bernstein's valuable observation that motor control is an ill-posed problem because there can be no simple one-to-one relation between the motor cortex neurons and individual muscle fibers (Bernstein, 1967).This is because the external forces generated by dynamics and gravity require continual compensation.However, if we operate the arms at a slow rate we do not need to take account of these effects and we can use our motor abstraction to capture an overall representation of output motor activity.
The sensing possibilities for a limb include internal proprioception sensors and exterior tactile or contact sensors.The actual biological mechanisms of proprioceptive feedback are not entirely known but a simple and very "natural" method would be to sense the angles of individual joints.Thus if we assume proprioceptive neurons generate joint related signals, then these can be represented by S 1 = f (θ 1 ) and S 2 = f (θ 2 ), where θ 1 is the angle between the upper-arm and the body baseline and θ 2 is the angle between the upper-arm and the axis of the forearm, (see figure 2), and f is a near linear or at least monotonic function.We refer to this encoding as a joint angle coordinate scheme.
However, there are other, more complex, possibilities.If the location of the limb end-point can be sensed then the end-effector can be positioned at a desired spatial location; this would be very useful for many actions.In this case the feedback signals could be as follows: where l 1 and l 2 are the lengths of the upper-arm and forearm respectively, and S 1 is the effective length of the arm axis from shoulder to hand and S 2 is the angle the axis makes at the shoulder.We can refer to this coordinate frame as a shoulder encoding.
Another even more attractive scheme would be to relate the arm end-points to the body center line.This body-centered encoding would be be appropriate for a "mouth-centered" space in accordance with early egocentric spatial behavior.To obtain this encoding we shift the shoulder vector given above (S 1 and S 2 ) by the distance B which is the separation distance between the shoulder and the body center, then: One other notable spatial encoding is a Cartesian frame where the orthogonal coordinates are lateral distance (left and right) and distance from the body (near and far).The signals for this case are simply the location values of the endpoints in a rectangular space, thus: S 1 = x and S 2 = y.This encoding, referred to as Cartesian encoding, seems the most unlikely for a biological system, however we include it due to its importance in human spatial reasoning (Newcombe and Huttenlocher, 2000).
Before vision comes into play, it is difficult to see how such useful but complex feedback as given by the three latter encodings could be generated and calibrated for local space.The dependency on trigonometrical relations and limb lengths at a time when the limbs are growing significantly makes it unlikely that these codings could be phylogenetically evolved.Only the joint angle scheme could be effective immediately but the others may develop through growth processes.Recent research (Bosco et al., 2000) on the hind limbs of adult cats has discovered that both joint angle and shoulder encodings can coexist, with some neuronal groups giving joint angle outputs while other neurons give foot/hand position encodings independently of limb geometry.We investigated all four systems as candidate encodings for proprioception signals.

Mappings as a Computational Substrate for Sensory-Motor Learning
We have developed a computational framework for investigating this problem based on a two-dimensional mapping scheme.Our mappings consist of two-dimensional sheets of elements, each element being represented by a patch of receptive area known as a field.The fields are circular, regularly spaced, and are overlapping.Every field in a map has a set of associated variables that can record state and other properties during operation: F {s, e, f, m}.These variables are described as follows: 1. Stimulus value F (s), This is the value experienced, e.g. a color or shape value for an eye map or a contact value for a proprioceptive map.
2. Excitation level F (e) ∈ {0, 1}, This is the current degree of stimulation of a field, as a result of excitation or inhibition.
3. Frequency level F (f ), This records how often the field has been accessed or visited.
4. Motor values F (m), This records the motor parameters that were in force when this field was stimulated.
The stimulus values held in a map's fields are effectively a form of short-term memory.If a stimulus value changes, i.e. is different from that expected by the map, then this field is excited.For the first stimulation the excitation level is set to 1.0 but repeated stimulations are reduced by a habituation function (Stanley, 1976) that recovers when stimulation ceases (Meng and Lee, 2005).Also a very slow decay function causes all excitation levels to fall over time.By this means, those fields with the highest excitation levels are those that have most recently experienced unexpected change.The immediate neighbors of stimulated fields also receive a proportion of the excitation levels.
The above variables are local to individual fields, but some global variables are obtained by simple summation over the map of various field properties.Global excitation, G e is a measure of total excitation and is the normalized sum ∈ {0, 1} of all field excitations above a nominal lower threshold.Global familiarity, G f ∈ {0, 1} is a normalized sum of field access frequency.Notice that initially all F (f ) = 0 and so G f will rise from 0, asymptotically towards 1.0 when all fields have been visited many times.Such global indicators can be used to signal when changes have effectively ceased and the map has become saturated.
We assume that basic uniform map structures are produced by prior growth processes but they are not pre-wired for any spatial system.Our arm system has to learn the correlations between its sensory and motor signals and the mapping structure is the mechanism that supports this.We use two variables, X, Y , to reference locations on any given map; these simply define a point on the two-dimensional surface -they do not have any intrinsic relation with any external space.

System organization
The software implemented for the learning system is based on a set of five modules which operate consecutively.The modules are: Action Selection This module determines which motors should be executed, i.e is the process of setting values for M 1 and M 2 .When a target field is nominated by the Attention Selection module then both the target field and the field which corresponds to the current arm state are addressed and motor values are extracted and passed to Motor Driver.If no potential targets are specified by Attention Selection then random values will be selected.
Motor Driver This module executes an action based on the supplied motor values.For non-zero values of M the arm segments are started moving at constant speed and continue until either they reach their maximum extent or a sensory interrupt is raised.The ratio between the values of M 1 and M 2 determine the trajectory that the arm will take during an action.Note that a small but varying degree of noise is added to the output values.
Stimulus Processing Upon interrupt or at the completion of an action this module examines the position of the arm and returns values for proprioception, i.e. S 1 and S 2 .A contact value, S(c), is also returned.

Map Processing
Using S 1 and S 2 as values for X, Y , this module accesses the map and identifies the set of all fields that cover the point addressed by S 1 and S 2 .A field selector process is then used to chose a single key field, F , from the set (we currently use a nearest neighbor algorithm).Any stimulus value is then entered into the field, F (s) = S(c), and the excitation level is computed.The field frequency level is then incremented.
Attention Selection This module directs the focus of attention based on the levels of stimulation received from different sources.All fields are scanned and the field with the highest level of excitation becomes a candidate target for the next focus of attention.In this way, motor acts are directed towards the most stimulating experiences in an attempt to learn more about them.
Two special regions of local space form part of the system structure.We assume that the arm starts from a "rest position" (equivalent to arm being in the lateral position) and the result of driving the motors 'full on" (M 1 = M 2 = +1) brings the hand to the body center-line in a position equivalent to the "mouth".These areas each cover a group of fields as shown in figure 2. The rest area provides a fiducial point for the start of actions and the mouth area provides the first target.

Constraint lifting and reflexes
Human cognitive development is characterized by progression through distinct stages of competence, each stage building on accumulated experience from the level before.This can be achieved by lifting constraints (removing "scaffold") when high competence at a level has been reached (Rutkowska, 1994).Any constraint on sensing or action effectively reduces the complexity of the inputs and/or action, thus reducing the task space and providing a scaffold which shapes learning (Bruner, 1990, Rutkowska, 1994).Such constraints have been observed or postulated in the form of sensory restrictions, environmental or anatomical limitations, and internal or computational limits (Hendriks-Jensen, 1996).
We have several possible constraints available in our system: the availability of contact sensing, the resolution of the proprioception sense, and the parameters of the motor system.Of course, another constraint could be not having a visual sense but this very early stage of infant growth does not rely on vision (Piek and Carman, 1994).Transitions must be related to internal global states, not local events, and we use global state indicators to lift constraints in two ways: finer resolution sensory maps are used when global familiarity is high, and the degree of motor randomness increases with very low global excitation.
Novelty is the motivational driver for our system and the motor system attempts to repeat actions that cause stimulation.But without an initial stimulus there would be no reason to act and hence we provide a basic "reflex" to initiate the system.For the first action M 1 and M 2 are set to +1 and the hand then moves to the mouth region.From then on, the stimulated fields drive the system.A similar reflex returns the arm to the rest position.

Experiments and results
Given the single modality arm described above we can now logically examine all the experimental parameters that we may vary and experiment with relevant combinations.There are five areas to be considered: environmental structure, sensing schedule, proprioception encoding, map field sizes, and attention/excitation parameters.
As the hand contact sensor is binary valued there is little scope for any environmental scaffolding to occur through different object regimes: objects are either present or not.However, the contact sensor can be turned off, in which case a contact event does not interrupt movement and some objects may be moved or even pushed out of the environment.This is an internal constraint and so we should investigate active/inactive contact sensing.
Regarding proprioception, we have four candidate encoding schemes (Section 3.1) and can arrange that the signals S 1 and S 2 are computed from each of these in turn.
The effects of different field sizes need to be examined.We achieved this by creating three maps, each with fields of different density, and running the learning system on all three simultaneously.Each map had a different field size: small, medium and large, see figure 5, and the S and M signals were processed for each map separately and simultaneously.However, only one map can be used for attention and action selection, because different field locations may be selected on the different maps.So by running off each map in turn (starting with the largest fields) we can observe the behavior and effectiveness of the mapping parameters.
Finally we need to experiment on the possible excitation schedules for field stimulation.In the present system this consists of the habituation time constants.The first trials used no contact sensing and objects on the table were either ignored or pushed out of range.Figure 3 illustrates behavior as traces of movements.As the stimulation levels of the mouth area fall due to the habituation function so random motor signals are introduced, which produce hand sweeps to points on the extreme boundary.When contact sensing is active, figure 4 then shows rest/mouth moves being interrupted by contact with an object on the path.These results are further illustrated in figures 5 and 6 which show the field maps produced by each of the above cases respectively.
From these figures we see that the arm moved between mouth and rest areas first, but as these became less stimulated so random moves were introduced and fields on the boundary for the local body space were explored.Then, when contact sensing was allowed (a constraint lifted), internal fields and their neighbors were stimulated by object contact.Figure 7 shows map growth in terms of four "types" of fields.
The observed behavior is seen as series of stages: first a "blind groping" mainly directed at the mouth area, then more groping but at the boundary, these are accompanied with unaware pushing of objects, then follows more directed and repeated "touching" of detected objects.If more than one object is detected then attention will shift to each object in turn,  as they become habituated, so that a roughly cyclic behavior pattern is produced, similar to eye scanpaths.All these behaviors, including motor babbling and the rather ballistic approach to motor action, are widely reported in young infants (Piek and Carman, 1994).Regarding proprioception, we did not observe any clear advantage in any one encoding scheme.As they all are two-dimensional and map into the X, Y , space of the mappings there seems no crucial differences.We recognize that when operating in the more restricted zones of the nonlinear encodings there may be difficulties, see the operating space in figure 8, but these are at the extremities where mobility is restricted and humans actually avoid these areas (Bernstein, 1967).It is likely that the encoding scheme will matter much more when hand/eye coordination is to be learned, and this may account for the presence of two or more encodings in animals (Bosco et al., 2000).From the field size experiments we see a trade off: speed of exploration versus accuracy.When larger fields are used they cover more sensory space and thus the mapping is learned much faster.If smaller fields are used then movements to reach these locations are more likely to be accurate but more exploration is needed to map out the fields.Figure 9 shows how the system started on a coarse map and progressively transitioned to a finer scale map as the global familiarity variable reached a steady plateau.It is interesting that the receptive field size of visual neurons in infants is reported to decrease with development and thus lead to more selective responses (Westermann and Mareschal, 2004).
Regarding the excitation parameters, we did not find any significant advantage in quite large variation in these.The main effects are to vary the persistent actions or number of repetitions performed on a stimulus and to alter the order in which attention is given to different objects.Neither of these had much effect on map generation for the single limb case.For more details of the excitation and habituation model used see (Meng and Lee, 2005).

Discussion and conclusions
There have been many models of sensory-motor coordination, frequently using connectionist architectures (Kalaska, 1995).For example, Baraduc et al designed a neural architecture that computes motor commands from arm positions and desired directions (Baraduc et al., 2001).Other models use basis functions (Pouget and Snyder, 2000) but all these involve weight training schedules that require in the region of 20,000 iterations (Baraduc et al., 2001).They are also tend to use very large numbers of neuronal elements.While "motor babbling" is seen in the behavioral output of many systems, very few are inspired by the psychological literature on development and even less deal with transitions between more than one behavioral skill pattern.As had been said: "their behavioral capacity is usually limited" (Kalaska, 1995).
The system described here records sensory-motor schemas in topological mappings of sensory-motor events, pays attention to novel or recent stimuli, repeats successful behavior, and detects plateaus in experience which correspond to competence being achieved at a given level.The behavior observed from the experiments displays initially spontaneous movements of the limbs, followed by more "exploratory" movements, and then directed action towards contact with objects.Our approach has been supported by the findings cited and reports such as (Gomez, 2004) who show that starting with low resolution in sensors and motor systems and then increasing resolution leads to more effective learning.
From the experience of motor acts leading to spatial locations, the S-M maps support the generation of motor commands to achieve an action (from a given a start field to a destination field).We note that they could also be used to support "higher-level" cognitive functions by allowing rehearsal of motor acts, without actual performance, and thus lead to the processing of patterns of sensory-motor behavior that are "perceived", "imagined" or "desired", rather than actual.
The reported work is part of a larger program.For the eye system we have already achieved a similar mapping between the image space and the motor drive for the camera.The next stage will be to allow cross-modal mappings to develop between the eye and hand mapping frames.This will use Hebbian cross-links between the associated map fields and should allow unskilled reaching to seen objects to develop.This will produce a further rich range of attentional, action selection, and sensing issues to deal with but the foundations laid by this work will provide a logical framework.

Figure 1 :
Figure 1: The laboratory robot system used in experiments

Figure 2 :
Figure 2: A plan view of the arm spatial configuration.

Figure 3 :
Figure 3: Arm movements with no contact sensing.Initial moves between the rest and mouth areas (lower right and upper left respectively) gradually changed to random moves that explored the boundaries of the motor space.

Figure 4 :
Figure 4: Arm movements with active contact sensing.An object (near the center of the diagram) caused sensory interrupts which were followed by rest/object moves.

Figure 5 :
Figure 5: Three-scale mapping with sensor off.The highlighted fields indicate the fields visited

Figure 7 :
Figure 7: Growth of S-M map.Only initial field visits are counted; repeated visits are ignored.The figure shows the number of fields visited in the rest area, mouth area, boundary and the internal area

Figure 8 :Figure 9 :
Figure 8: Nonlinear relationship between joint space and cartesian space of the robot arm