A Neurodynamic Architecture for the Autonomous Control of a Spatial Language System

In order to communicate about the locations of objects in the world, humans commonly employ relational spatial language . Descriptions like “the pencil is to the right of the monitor” provide a very flexible means to convey spatial information f r object-oriented action. This makes spatial language under standing and production also relevant for robotic applicat ions [1], where it can provide a natural way of communication. Spatial language has also been a focus of psychological research [2] , [3], as it allows to investigate the connection between symbolic linguistic representations and metric sensorimotor repre sentations formed for spatial perception. We have recently made steps towards a neurally grounded mode l of fl xible spatial language behavior that is on the one hand capable of solving practical tasks in a robotic environment a d on the other hand provides explanations for psychophysi cal findings in this field of research [4], [5]. This is accomplish ed by a neurodynamic architecture that links simple visual p erception with spatial selection and transformation processes and di screte verbal representations. The system is capable of solving different types of tasks – de termining the spatial relation between two objects in a scen e, localizing and identifying an object based on a given relati onal spatial description, or producing a full description o f an object’s location in a scene. This is achieved by providing a sequence of symbolic task inputs that identify objects or spatial rel ations together with task-specific control signals that modulate t he global activation levels within the architecture. These inputs produce a series of local decisions, which are reached through trans i io between different attractor states of the neurodyna mic system, and which ultimately lead to the selection of a response. In our previous work, we have used a fixed, externally control led timing of the input sequence. While this is sufficient to produce the desired behavior under standard conditions, a m ore flexible control of the timing becomes critical if the sys tem has to solve tasks autonomously in more open settings. For insta nce, to determine a spatial relation the system first has to lo calize the reference and target object. The time required for these subtasks may vary considerably depending on object salienc y a d scene layout. An autonomous system must be able to pursue suc h a subtask until it is completed, detect its completion, and only then proceed to the next subtask. We have in a different line of work developed a neurodynamic m echanism for sequence generation that can provide this kind of autonomy and flexibility in timing [6]. We present her e the integration of the two systems. With this work, we also provide a general account of how complex cognitive tasks tha t flexibly combine elementary operations can be controlled i n a neurally grounded architecture.


I. INTRODUCTION
In order to communicate about the locations of objects in the world, humans commonly employ relational spatial language.Descriptions like "the pencil is to the right of the monitor" provide a very flexible means to convey spatial information for object-oriented action.This makes spatial language understanding and production also relevant for robotic applications [1], where it can provide a natural way of communication.Spatial language has also been a focus of psychological research [2], [3], as it allows to investigate the connection between symbolic linguistic representations and metric sensorimotor representations formed for spatial perception.
We have recently made steps towards a neurally grounded model of flexible spatial language behavior that is on the one hand capable of solving practical tasks in a robotic environment and on the other hand provides explanations for psychophysical findings in this field of research [4], [5].This is accomplished by a neurodynamic architecture that links simple visual perception with spatial selection and transformation processes and discrete verbal representations.
The system is capable of solving different types of tasks -determining the spatial relation between two objects in a scene, localizing and identifying an object based on a given relational spatial description, or producing a full description of an object's location in a scene.This is achieved by providing a sequence of symbolic task inputs that identify objects or spatial relations together with task-specific control signals that modulate the global activation levels within the architecture.These inputs produce a series of local decisions, which are reached through transitions between different attractor states of the neurodynamic system, and which ultimately lead to the selection of a response.
In our previous work, we have used a fixed, externally controlled timing of the input sequence.While this is sufficient to produce the desired behavior under standard conditions, a more flexible control of the timing becomes critical if the system has to solve tasks autonomously in more open settings.For instance, to determine a spatial relation the system first has to localize the reference and target object.The time required for these subtasks may vary considerably depending on object saliency and scene layout.An autonomous system must be able to pursue such a subtask until it is completed, detect its completion, and only then proceed to the next subtask.
We have in a different line of work developed a neurodynamic mechanism for sequence generation that can provide this kind of autonomy and flexibility in timing [6].We present here the integration of the two systems.With this work, we also provide a general account of how complex cognitive tasks that flexibly combine elementary operations can be controlled in a neurally grounded architecture.

II. MODEL ARCHITECTURE
The spatial language system [4] consists of multiple interconnected Dynamic Neural Fields (DNFs), which describe neural activity at the population level through continuous activity distributions over different feature spaces.These activity distributions evolve continuously in time under the influence of external inputs and lateral interactions, governed by a set of differential equations.The field interactions promote the formation of localized bumps of activity, or peaks, that are stabilized through local excitation and surround inhibition and form the units of representation in the system.
The architecture comprises two modules that each performs one elementary operation, a space-color association and a reference frame shift, respectively, that can be performed in different directions and be combined to solve different tasks.The system receives minimally preprocessed visual input from a camera and symbolic inputs relfecting the elements of a spatial language task.For a task of the type "Where is the red item relative to the green one?", a sequence of elementary operations is performed that emulates the processing steps proposed by Logan and Sadler [2]: The locations of the target and reference object in the scene are sequentially determined by the space-color association module based on symbolic color inputs, resulting in the formation of activity peaks in two DNFs to represent those locations.This location information, which is in the reference frame of the camera image, is autonomously transformed into a representation of the relative position of the target to the reference object.Once an activity peak has formed indicating this relative position, a competition between a set of discrete nodes that reflect the available spatial terms is initiated to select the response.
Each elementary operations is initiated by globally boosting the activity in one or more DNFs in the architecture, and is completed once an activity peak has formed as a result of a local selection decision.To endow the system with the desired autonomy, we link these initiation events and the test of completion to a neurodynamic sequence generation system [6].This system represents each step of a sequence through a discrete ordinal node that follows the same dynamic principles as the DNFs.Self-excitatory and mutually inhibitory interactions ensure that only a single ordinal node is active at any time, indicating the current step in the sequence.Each ordinal nodes has an associated memory node, which indicates that a certain step has previously been activated, and a condition-of-satisfaction (CoS) node that drives the transition between steps.
For the spatial language system, we generate one set of ordinal nodes for each task the system can perform.Excitatory connections are defined from the ordinal nodes to different DNFs of the spatial language architecture to provide the boosting inputs that initiate the elementary operations of each behavior.To test the completion of these operations, we introduce a peak detector node for each DNF, that is driven by summed-up output from the whole field and signals that an activity peak has formed.Each CoS node receives input from one or few peak detectors, reflecting which local decisions have to be reached for the completion of a particular step.When a CoS node is activated, it inhibits all ordinal nodes.This in turn deactivates the CoS node, which allows the ordinal nodes to become active again.A biasing input from the memory nodes ensures that the ordinal node for the subsequent step is selected next.

III. RESULTS
We tested the combined system on the same tasks that were previously performed with fixed timing of inputs in [4].The system produced the desired sequence of local decisions required for each task, reliably detecting the completion of each elementary operation and proceeding to the next step.The duration of each step was autonomously adjusted to different time courses for the formation of activity peaks.For instance, in scenarios where the involved items were salient and could quickly be localized in the scene, the system advanced more quickly through the steps, whereas it allowed more time for localization if items were less salient.This led to an overall faster completion of the tasks compared to the variant with fixed timing, as the latter had to introduce sufficient delays in each step to ensure its completion even under adverse conditions.
To further test the robustness of this system, we introduced a situation in which the reference object of a spatial description was initially not visible in the camera image, and was only uncovered after a substantial delay.With a fixed timing of inputs, the system progressed to the next step without having localized the reference object, leading to the selection of a random response when a selection between the spatial terms was initiated.With the sequential dynamics, the system succesfully delayed the subsequent step until the reference object was localized, and was able to ultimately produce the correct response.

IV. CONCLUSION
We have outlined how a neurodynamic mechanism for sequence generation can be used in a system for spatial language behavior to control the execution of complex cognitive tasks.The tasks are broken down into sequences of local decisions, represented by attractor states in the dynamical neural model.The sequence generation mechanism drives the transitions between these attractor states by applying homogeneous inputs to individual structures in the architecture, and detects the arrival at a certain state through a condition-of-satisfaction system.Beyond the realm of spatial language, this model demonstrates a general neural mechanism for the control of complex cognitive behaviors that builds on general purpose structures for sensorimotor processing and sequence generation.It also provides a starting point to explain the learning of abstract task competence independent of the concrete content of the task.In this architecture, the acquisition of new tasks only requires to recruit a new set of ordinal neurons and set up the connectivity for the control inputs and condition-of-satisfaction system.This may generally be acquired from few examples.It separates this problem of task learning from the formation of the underlying sensorimotor representations, which typically occurs over a much longer time scale during development.