Personalizing a Service Robot by Learning Human Habits from Behavioral Footprints Robotics—Article

For a domestic personal robot, personalized services are as important as predesigned tasks, because the robot needs to adjust the home state based on the operator ’ s habits. An operator ’ s habits are composed of cues, behaviors, and rewards. This article introduces behavioral footprints to describe the operator ’ s behaviors in a house, and applies the inverse reinforcement learning technique to extract the operator ’ s habits, represented by a reward function. We implemented the proposed approach with a mobile robot on indoor temperature adjustment, and compared this approach with a baseline method that recorded all the cues and behaviors of the operator. The result shows that the proposed approach allows the robot to reveal the operator ’ s habits accurately and adjust the environment state accordingly.


Introduction
Traditionally, a personal robot is designed to provide standard services in different scenarios. For example, by incorporating a door recognition and manipulation algorithm, the robot can open various kinds of doors in different houses in exactly the same way. This strategy, combined with commands from the operator, allows the robot to complete each task consistently in different environments. This feature is desirable when the robot is used in fixed and repeating scenarios, but if the operator requires personalized services, this strategy is inadequate.
The requirement for personalization is particularly evident in a smart home, where the robot needs to both monitor and adjust the home state intelligently. For example, the robot may need to open a door to different extents, as some operators like it to be fully open, while others may prefer it to be half open. This kind of state adjustment, if designed in an offline way, requires a remarkable amount of manual work. To solve the problem, the robot must be personalized by having it learn the habits of the operator, in order to adjust itself according to the habit of each operator.
To learn a habit, the robot needs to observe the environment and extract information related to the habit. A habit is determined by three factors: the cue, the behavior, and the reward [1]. After sufficient experiences with the three factors, the operator behaves involuntarily when seeing the cue, instead of acting intentionally to collect the maximum reward. For a robot to understand the operator's habit, it may collect all pairs of cues and behaviors from the observations to guide its future actions, or it may try to learn the rewards based on the observations in order to determine its future actions. The first solution is straightforward, because the robot can iterate its memory to find the best-matching behavior when it faces a cue, but it is inefficient in dealing with newly emerging cues; the second solution requires an additional learning process, but the learned reward can guide the robot's action when new cues occur. In this work, the first solution is implemented as the baseline method, and we focus on the second solution.
In this article, we propose a method to learn the habit of an operator based on observations, in the framework of inverse reinforcement learning. The behavior is described by the environment state changes due to the behaviors, namely the behavioral footprints. Meanwhile, the robot observes the cues based on the contacts between the operator and the objects, and learns the habits as a reward function based on the operator's behaviors in the house. After that, it uses the reward function to guide its future actions, in order to serve the operator autonomously. This method is implemented with a case study of autonomous indoor temperatures adjustment. Our contributions include the incorporation of behavioral footprints to represent the operator's behaviors and a proposal of robot personalization based on the operator's habits.

Related work
Traditional research on personal robots focuses on designing

Robotics-Article
Research hardware and software to make each robot generally applicable. For example, Meeussen et al. [2] develop a personal robot to open the door and charge itself. Rusu et al. [3] develop a perception system with visual sensors to guide the robot's motion in different environments. Gorostiza et al. [4] use multiple sensors to develop a framework for human-robot interaction.
Wyrobek et al. [5] develop a personal robot that is both safe and useful. In a domestic environment, Falcone et al. [6] develop a personal rover that can serve both children and adults.
Many publications cover the putting of a personal robot in a house. For example, in Ref. [7], an electroencephalography signal is used to control a tele-presence robot and assist motordisabled people. In Ref. [8], a tele-presence robot is designed to help the elderly with interpersonal communications. In Ref. [9], a tele-medicine system is designed to monitor the health and activity of the elderly. To include robot actions during home monitoring, in Ref. [10] the service robots use sensor networking and radio frequency identification to guide their actions.
With different types of sensors installed in a house, the environment state can be described using hierarchical states, and its changes can be described with a layered hidden Markov model [11], where multiple layers of hidden Markov model are stacked to describe the hierarchical state transition; and a hierarchical hidden Markov model [12], where each state of the higher layer incorporates a hidden Markov model in the lower layer.
To personalize the robot's service, the robot needs to learn the operator's habits from observation. To combine robot actions and environment state modeling, many methods have been proposed within the framework of reinforcement learning [13]. Besides, learning by demonstration technique [14] allows a robot to imitate an operator and learn different behaviors. In our applications, the robot can observe the behavior of an operator; thus it adopts inverse reinforcement learning [15] to encode the operator's habits.
In this work, we use inverse reinforcement learning to enable a robot to learn a reward function as the operator's habit. During learning, the operator's behaviors are represented with behavioral footprints, and after collecting a set of observations on these behaviors, the robot tries to learn the operator's habits.

Behavioral footprints
A habit is determined by three parts: the cue, the behavior, and the reward. To learn the operator's habit, the robot must observe the environment to obtain the cues, and observe the operator to get the behaviors; thus it can learn the reward function to describe the operator's habits. For this purpose, the robot needs to represent the environment accurately, and in this work, we use the objects inside a room to describe the home state: where E denotes the environment states and C i (i = 1, ..., n) denotes the ith object in the environment. An illustration is shown in Figure 1.
To represent the operator's behaviors, A, we adopt behavioral footprints, defined as the changes of object states due to

Cues and behaviors
With the behavioral footprints, the robot can observe the operator's behaviors, along with the cues that trigger the behaviors.
The behaviors are represented by changes in object states due to the operator's contact. However, some of the behaviors are random, and do not follow the operator's habits, and these need to be excluded. To evaluate the regularity of the operator's behaviors, we use the following measurements: where r measures the standard deviation of the cues leading to behavior A. With the measured regularity level and a  the operator's actions, because this representation can describe different types of behaviors more meaningfully, and exclude those behaviors that do not change the environment states: where E i and E j denote the transition of home states due to the operator's behaviors. An illustration is shown in Figure 2.
threshold value selected based on experiments, the robot only keeps the samples with regular behaviors.
Another important factor of a habit is the cue, defined as the environment state when the behaviors occur. The cue is identified by collecting data samples right before the operator's emergence. where D i , i = 1, ..., m denotes the moment when the operator appears, and each (S tk , ..., S tn ) denotes a set of home states following the operator's emergence.
Two types of cues exist: agreeable ones, where the operator does not change the environment states, and disagreeable ones, where the operator manually changes certain object states. Based on the observations, the samples are assigned with binary indicators of agreeability:

Rewards
Using the samples of the operator's regular behaviors and the binary indicators of the environment's agreeability, the robot infers the operator's habits. This problem is formulated as inverse reinforcement learning, where the robot learns a reward function by observing the operator's actions [16]: where α denotes the parameter of the reward function, and denotes the expected discounted reward under a policy. An illustration is shown in Figure 3.
of the environment states: The learning of the reward function is based on the formulation in Ref. [15], where the reward function is a linear combination of a set of predesigned basis functions: and ϕ i is a basis function. In a personalized environment, the reward function must encode potential changes of environment states due to the appearances and disappearances of the objects inside the environment. With behavioral footprints, this problem is solved by clustering the state space dimensions into multiple abstracted dimensions, with the correlations between different dimensions as the distances: (cst 1 , ..., cst n ) = partition (S, RLT) The clustering not only excludes redundant information due to object state correlations, but also reveals invisible state transitions. In addition, it avoids having the basis functions redesign when the objects' number changes, because only an object uncorrelated with all existing dimensions requires redesigned basis functions. Besides, this clustering allows the robot to use one action to change the states of all related objects.
Inspired by the work in Ref. [15], we transform this maximization into an optimization similar to the Support Vector Machine (SVM): This optimization is solved with an existing SVM implementation [18].

Robot actions
Using the learned reward function to indicate the operator's This optimization maximizes the differences between the operator's actions and other actions, allowing the robot to learn the operator's habits.
With only binary indicators of the environment states' agreeability, the maximization in Eq. (1) is simplified as: where α 1 denotes the actions agreeable with the operator's habits, and α 2 denotes the actions disagreeable with the operator's habits. The agreeability is measured with the binary indicators. With Eq. (3), the robot learns a reward function, a function

Robotics-Article
Research habits, the robot can guide its actions as a normal reinforcement learning problem.

Setup
We use Turtlebot as the personalized robot to observe the behaviors of people in an environment composed of four outdoor states and four indoor states. The four outdoor states include outside temperature, humidity, wind, and rain; and the four indoor states include a thermometer, a door, air conditioner switches, and the state of the operator.
To observe the indoor objects accurately, a map is built with a Gmapping package [17] in the robot operating system. After collecting the states for about seven days, the robot tries to learn the habit and use the habit to guide future actions.
Our robot is not equipped with a robot hand to physically change the object states, so the robot actions are simulated.

Habit observation
Four weather conditions are observed, including the temperature, humidity, rain, and wind, which are extracted from a weather website (www.weather.com). These environment states are collected for seven days for the city of Hong Kong, as shown in Figure 4.
Four indoor objects are observed, including a thermometer, a door, the air conditioner switch, and the status of the operator in a house. The states of these objects are measured by the robot based on their visual appearances, as shown in Figure 5.

Habit learning
Based on the observations, the robot collects the operator's behaviors and the cues leading to the behaviors. The cues are collected as the environment states when the operator has contact with the objects. For example, when the operator enters the room and turns on the air conditioner, the current environment states are collected as the cue that leads to changes to the air conditioner switches.
The behaviors are collected as the changes of environment states due to the operator's actions, such as the switching of the air conditioner, the opening of the door, and so on.
After collecting cues and behaviors for seven days, the robot uses them to learn the operator's habit and to update the result based on new observations. This habit is represented with the reward function.

Robot actions
With the learned reward function, the robot searches for the optimal actions to adjust the environment. In this work, the generated actions are applied manually to evaluate their effects.

Results
After collecting observations and learning the operator's habits for one week, the robot extracts a set of reward functions, corresponding to increasing samples. To evaluate these learned reward functions, two indexes are adopted, including the accuracy of reward function r A , computed by comparing the robot's evaluation of the home states on agreeability and the true values provided by the operator, and the accuracy of robot action r D , indicated by the ratio of disagreement on actions between the robot and the operator.
Two sets of experiments are conducted, corresponding to different numbers of objects in the environment, as shown in The results show that the two methods have similar accuracy in evaluating home states, but the proposed method is much more accurate in guiding the robot's actions. The reason is that in a new state, a robot using the baseline method has to search in the records. However, if the action-cue pair is not in the record, the baseline method will not be able to find a correct strategy. By learning the reward function, the proposed method can generate different actions according to the environment states.

Conclusions
In this article, we propose a method to enable a robot to learn the habit of an operator based on observations, in the framework of inverse reinforcement learning. The behavior is described by the environment state changes due to the be-  haviors, namely the behavioral footprints. The robot learns the cue based on the contact between the operator and the objects, and learns the habits as a reward function based on the operator's behaviors in the house. After that, it uses the reward function to guide its future actions, in order to serve the operator autonomously. This work concentrates on the robot learning how to adjust indoor temperatures, and compares the proposed method with a baseline method on home state evaluation and robot action selection. The results show that the proposed method is more accurate in guiding the robot's actions in complicated scenarios.
In future work, the proposed method can be improved in multiple aspects. First, the basis function can be designed more flexibly, in order to analytically describe the change

Robotics-Article
Research of environment states. The learning method can also be improved to cover different types of habits, in addition to the one represented by a set of basis functions.