A developmental model of initiating joint attention through constructing state space

Tsukasa Nakano∗, Shinya Fujiki∗, Yuichiro Yoshikawa† and Minoru Asada∗ ∗Dept. of adaptive Machine Systems, Graduate School of Enginering, Osaka University 2-1 Ymadaoka, Suita, Osaka, Japan 565-0871 Email: {tsukasa.nakano, shinya fujiki, asada}@ams.eng.osaka-u.ac.jp †Dept. of Systems Innovation, Graduate School of Enginering, Osaka University 1-3 Machikaneyama, Toyonaka, Osaka, Japan 560-8531 Email: {yoshikawa}@sys.es.osaka-u.ac.jp


I. INTRODUCTION
Inferring other's intention is an important factor that can accelerate the development of joint attention (JA).In this paper, we aim to reveal the developmental mechanism of JA based on such inferring capability by a synthetic approach.
Human infants have gradually come to follow the gaze direction of caregivers from 6-months of age [1].Until 12-month, they have accurately been able to find an object that her gaze and/or pointing are referring to [2], [3].These behaviors are called responding to JA (RJA).Meanwhile, they have developed the ability to lead her attentions to an object by using their pointings, utterances, etc [2].These actions are called initiative behaviors of JA (IJA).However, the internal mechanism which enables such a development of both RJA and IJA has not been well understand.
On the other hand, some studies have synthetically suggested the developmental models of JA in human infants as advocated in cognitive developmental robotics [4].Those showed that a robot acquires gaze-following only through statictically mappping gaze and locations when it finds something salient [5], [6], and that a robot acquires gaze-alternation in JA through discovering and reproducing the contingency of caregiver-infant interactions based on information theory [7].However, those have not deal with the development of inferring caregiver's internal states, nor explain about the relationship between it and JA.
This work presents our working model for a learning mechanism of optimal actions for JA in a scheme of reinforcement learning (RL).It integrates methods for the autonomous state space construction and the hidden state estimation.It seems an important but formidable step for infants to estimate the caregiver's hidden state, whether she is interactive for JA or not, since it can not be directly observable from them but largely affects their future rewards.To overcome this, the proposed method has three steps.The first step is to divide state space, with consideration of the differences between the caregiver's hidden states, based on the extended method of ishiguro et al. [8].The second step is to learn the probablistic model of state transition based on HMM.Through above two steps, the learner can acquire the ability to estimate the hidden states according to not only the directed observation at the time but also the history of their interactions.The final step is to learn the optimal action on each state based on RL.We applied the proposed model to the computer simulation, and try to reproduce the developmental processes of both RJA and IJA in human infants.

II. LEARNING MODEL OF JA
Here, a certain type of caregiver-infant interaction was supposed.The learner first observes the caregiver and the objects in their environment, estimates her attentions and the hidden states.Then it selects one from possible actions: lookings at the caregiver, looking at an object, and uttering.If the JA between them is accomplished, the caregiver gives a reward.After experiencing those interactions, the learner constructs the state space, and learn the state transition model and optimal actions on each state.By repeating such learning processes, it can gradually understand her hidden states, and acquire the behaviors of RJA and IJA.
In this work, we assum that the caregiver' hidden states are represented as three kinds of the modes: initiative mode, responsive mode, and non-interactive mode.These modes affect her policies of giving rewards and state transition.When she is the initiative mode or the responsive mode, the learner can get rewards by JA.On the other hand, it can't get rewards when she is the non-interactive mode, even if it can follow her gaze directions.However, the learner can let her change mode from the initiative, or non-interactive, to the responsive one by its paticular behaviors: gaze-alternations and speakings to her.Therefore, it should induce her responsive mode in order to get rewards when the she is non-interactive mode.

A. System for JA
The system has three modules: state estimation module, action selection module, and memory module.The learner infers the caregiver's modes by state estimation module, selects an action to get rewards by action selection module, and stores the interaction data for learnings in memory module.
The state estimation module is composed of two parts that are sensor classifier and state transition model.The sensor classifier calculates the current state probabilities from observable sensor data: gaze directions of the caregiver's and the learner's, and locations of two objects.Meanwhile, by the state transition model, the learner also calculates those probabilities from the prior data: the state probabilities, the learner's action and the caregiver's reward value at the last interaction.The learner estimates the current state by integrating the state probabilities based on two different way.Note that the way to give rewards is different depending on the caregiver's mode, the learner's action, and the current state.Therefore, the state transition model must be utilized for estimating her current mode.By this method, the learner identifies the current state according to her modes.
In the action selection module, an action is stochastically selected based on the indentified state and state-action values.The state-action value based on RL means the expected reward of taking an action in a state.

B. Learning algorithm
The learner needs to recognize a variety of states to achieve JA depending on the caregiver's mode which change following to the infant's behavior.In order to discriminate these modes which are not observable at the moment, the system has two stages: (1) state space construction for RL to find adequate features based on an extension method of ishiguro et al. [8] and (2) parameter estimation of the state transition model based on HMM.
At the state space construction stage, the system first divides the continuous sensor space into two kinds of areas after each action, according to whether the learner is given a reward in the area or not.Moreover, the divided area may have more division if it has multiple peaks in the histogram of reward values depending on the different modes.As a result, the learner constructs the state space which is expected to reflect the differences among the modes.
At the parameter estimation stage of the state transition model, the system re-estimates the states from the past interactions.Based on the HMM, it calculates the probabilities of the state transition in terms of the learner's prior action, the caregiver's prior reward , and the previous state.The learner consequently understand how the states and modes change depending on its own behavior.
Moreover, the learner needs to learn adequate behaviors to achieve JA in each the caregiver's mode.In action learning, the system calculates state-action values by Q-learning of RL.The method of RL can be used to consider the near future rewards.

III. SIMULATION OF DEVELOPMENT OF IJA
By analizing the learner's action selection in the computer simulation, we studied how did the learner's behaviors of JA change along with the development of hidden state estimation.As the results, the learner first learned gaze-following, and then learned initiative behaviors of JA after acquiring the hidden state estimation.This was because the learner first found gaze-following which is a behavior connected to rewards directly.After learning the hidden state estimation, it found that it should induce the caregiver's mode to the responsive one by initiative behaviors of JA, which are connected to rewards undirecly but needed to get rewards, when she is the non-interactive mode.

IV. CONCLUSION
In this work, we proposed the method that can acquire optimal action based on reinforcemnt learning, which integrates methods for state space construction and hidden state estimation.Results of computer simulation showed that the proposed method can learn not only gaze-following but also initiative behaviors of joint attention by estimating the caregiver's hidden states.Those results indicated the possibility that the learning method reproduced one of the developmental processes of joint attention affected by understanding other's intention in human infants.
The caregiver's model in this work was a specific setting.Therefore, the future work will investigate what kinds of parameters in caregiver models are the important factors for the bahevior acquisition of joint attention in human infant.