Autonomous Robust Skill Generation Using Reinforcement Learning with Plant Variation

This paper discusses an autonomous space robot for a truss structure assembly using some reinforcement learning. It is difficult for a space robot to complete contact tasks within a real environment, for example, a peg-in-hole task, because of error between the real environment and the controller model. In order to solve problems, we propose an autonomous space robot able to obtain proficient and robust skills by overcoming error to complete a task. The proposed approach develops skills by reinforcement learning that considers plant variation, that is, modeling error. Numerical simulations and experiments show the proposed method is useful in real environments.


Introduction
This study discusses an unresolved robotics issue: how to make a robot autonomous.A robot with manipulative skills capable of flexibly achieving tasks, like a human being, is desired.Autonomy is defined as "automation to achieve a task robustly." Skills can be considered to be solutions to achieve autonomy.Another aspect of a skill is including a solution method.Most human skill proficiency is acquired by experience.Since how to realize autonomy is not clear, skill development must include solution methods for unknown situations.Our problem is how to acquire skills autonomously, that is, how to robustly and automatically complete a task when the solution is unknown.
Reinforcement learning [1] is a promising solution, whereas direct applications of existing methods with reinforcement learning do not robustly complete tasks.Reinforcement learning is a framework in which a robot learns a policy or a control that optimizes an evaluation through trial and error.It is teacherless learning.By means of reinforcement learning, a robot develops an appropriate policy as mapping from state to action when an evaluation is given.The task objective is prescribed, but no specific action is taught.Reinforcement learning often needs many samples.The large number of samples is due to the large number of states and actions.So, online learning in a real environment is usually impractical.Most learning algorithms consist of two processes [2]: (1) online identification by trial and error sampling and (2) finding the optimal policy for the identified model.These two processes are not separated in typical learning algorithms such as Q-learning [3].Reinforcement learning is said to be adaptive because it uses online identification and on-site optimal control design.Robustness attained using this adaptability is often impractical.It takes a very long time for online identification by means of trial and error.
In our approach, by learning a robust policy rather than by online identification, reinforcement learning is used to achieve a solution to an unknown task.
Using our approach, this study addresses an autonomous space robot for a truss structure assembly.It is difficult for a space robot to achieve a task, for example, peg-in-hole task, by contact with a real environment, because of the error between the real environment and the controller model.In order to solve the problem, a space robot must autonomously obtain proficiency and robust skills to counter the error in the model.Using the proposed approach, reinforcement learning can achieve a policy that is robust in the face of plant variation, that is, the modeling error.Numerical simulations and experiments show that a robust policy is effective in a real environment and the proposed method is used.

Need for Autonomy and Approach
2.1.Need for Autonomy.Autonomy is needed wherever robots work.Below, we discuss why and what kind of autonomy is required [4,5] for space robots.Space robots are required to complete tasks in the place of extra vehicular activity by an astronaut.Studies of autonomous systems are needed to realize space robots that can achieve missions under human-operator command.There are many applications for the autonomous space robots [6].
We developed ground experiments to simulate a freeflying space robot under orbital microgravity conditions (Figure 1).Using this apparatus, we have studied robot autonomy.Targeting control-based autonomy, we developed an automatic truss structure assembly, and so forth.However, it has been hard to achieve perfect robot automation because of various factors.To make a robot autonomous in the face of the following challenges, we must (c) overcome the difficulty of comprehensively predicting a wide variety of states; (d) identify tasks and find unknown solutions to realize robust robot autonomy.

Approach to Autonomy Based on Reinforcement Learning.
Human beings achieve tasks regardless of the above complicating factors, but robots cannot.Many discussions have rationalized the difficulties as originating in the nature of human skills.We have not established a means by which we can realize such skills.This study takes the following approach to this problem.Section 4 approaches factors (a) and (b) in part by way of control-based automation taking account of the robustness of controls.The autonomous level of this approach is low because only small variations are allowable.
Section 5 considers how to surmount factors (a), (b), and (c) by learning.Using a predicted model and transferring the learned results to the actual system [7] is studied.This model is identified online and relearned.This procedure is applied where adaptability or policy modification is needed to thwart variations in the real environment.In some cases, the robot completes the targeted tasks autonomously.Learning is completed within an acceptable calculation time.
Section 6 considers factors (a), (b), (c), and (d).A peg-inhole task has higher failure rates using the same approach as used in Section 5.Because of small differences between the model and the real environment, failure rates are excessive.The robot thus must gain skills autonomously.Skills similar to those of human beings must be generated autonomously.These skills are developed by reinforcement learning and additional procedures.
Section 7 evaluates approach robustness and control performance of the newly acquired skills.The skills obtained in Section 6 are better than those in Section 5.

Experimental System and Tasks
3.1.Outline of Experimental System.Our experimental system (Figures 1 and 2) simulates a free-flying space robot in orbit.Robot model movement is restricted to a twodimensional plane [5].
The robot model consists of a satellite vehicle and dual three degrees of freedom (3-DOF) rigid SCARA manipulators.The satellite vehicle has CCD cameras for stereovision and a position/attitude control system.Each joint has a torque sensor and a servo controller for fine torque control of the output axis.Applied force and torque at the end-effector are calculated using measured joint torque.Air pads are used to support the space robot model on a frictionless planar table and to simulate the space environment.RTLinux is installed on a control computer to control the space robot model in real time.Stereo images from the two CCD cameras are captured by a video board and sent into an image-processing computer with a Windows OS.The imageprocessing computer measures the position and orientation of target objects in the worksite by triangulation.Visual information is sent to the control computer via Ethernet.
The position and orientation measured by the stereo vision system involve errors caused by quantized images, lighting conditions at the worksite, and so forth.Timeaveraged errors are almost constant in each measurement.Evaluated errors in the peg-in-hole experiments are modeled as described below.Hole position errors are modeled as a normal probability distribution, where the mean is  = 0 [mm] and standard deviation is  = 0.75 [mm].Hole orientation errors are modeled as a normal probability distribution, where the mean is  = 0 [rad] and standard deviation is  = 0.5/180 [rad].
We accomplish a hand-eye calibration to achieve tasks in the following sections.An end-effector, a manipulator hand, grasps a marker with a light-emitting diode (LED).The arm directs the marker to various locations.The robot calculates the marker location by using sensors mounted at the joints of the manipulator arm.The vision system also measures the marker location by using the stereo image by triangulation.Measurements using these joint-angle sensors have more precise resolution and accuracy.Hence, we calibrate the visual measurements based on measurements using the joint angle sensors.We consider the joint angle sensor measurement data to be the true value.

Tasks
3.2.1.Truss Assembly Task. Figure 3 illustrates the truss structure assembly sequence.The robot manipulates a truss component, connects it to a node, and proceeds each assembly step.Later this task is achieved by controls based upon mechanics understanding [5].The truss design is robot friendly for easy assembling.

Peg-in-Hole
Task.The peg-in-hole task is an example that is intrinsic to the nature of assembly.The peg-in-hole task involves interaction within the environment that is easily affected by uncertainties and variations, for example, errors in force applied by the robot, manufacturing accuracy, friction at contact points, and so forth.The peg easily transits to a state in which it can no longer move, for example, wedging or jamming [8].Such variations cannot be modeled with required accuracy.
To complete a task in a given environment, a proposed method analyzes the human working process and applies the results to a robot [9].Even if the human skill for a task can be analyzed, the results are not guaranteed to be applicable to a robot.Another method uses parameters in a force control designed by means of a simulation [10] but was not found to be effective in an environment with uncertainty.In yet another method [11], the task achievement ratios evaluated several predesigned paths in an environment with uncertainty.An optimal path is determined among the predesigned paths.There was the possibility a feasible solution did not exist among predesigned paths.
In the peg-in-hole experiment (Figure 4), the position and orientation of the hole are measured using a stereo camera.The robot manipulator inserts a square peg into a similar sized hole.This experiment is a two-dimensional plane problem (Figure 5).The space robot model coordinate system is defined as Σ 0 , the end-effector coordinate system as Σ  , and the hole coordinate system as Σ ℎ .While the space robot completes its task, the robot grasps the structural site with another manipulator; the relative relation between Σ 0 and Σ ℎ is fixed.State variables are defined as [  ,   ,   ,   ,   ,   ], where (  ,   ) is the position of Σ  in Σ 0 ,   is the orientation about k 0 -axis, (  ,  ), and   are the forces and torque in Σ 0 that end-effector applies to the environment.
The peg width is 74.0 [mm] and the hole width is 74.25 [mm].The hole is only 0.25 [mm] wider than the peg.The positioning error is composed of the measurement error and the control error.The robot cannot insert the peg in the hole by position control if the positioning error is beyond ±0.125 [mm].Just a single measurement error by stereo vision often moves the peg outside of the acceptable region.

Control-Based Automation
4.1.Truss Assembly Task.Automatic truss assembly was studied via control-based automation with mechanics understanding.The robot achieved an automatic truss structure assembly [5] by developing basic techniques and integrating them within the experimental system.
The following sensory feedback control [12] is used for controlling manipulators: where  is the control input to the manipulator and J is the Jacobean matrix.The y is the manipulation variable whose elements are the hand position/orientation [  ,   ,   ], y  is the reference value of y, q is the joint angle vector, and K  and K  are feedback gains.When the end-effector contacts the environment and manipulation variable y is stationary under constraint, the end-effector applies force and torque to the environment: The force and torque can be controlled by y  .This is a compliant control.Figure 3 is a series of photographs of the experimental assembly sequence.As shown in panel (i), the robot holds on to the worksite with its right arm to compensate for any reaction force during the assembly.The robot installs the first component, member 1, during panels (ii) and (iii).The robot installs other members successively and assembles one truss unit, panels (iv)-(vi).
Control procedures for the assembly sequence are as follows.There are target markers in the experiment environment as shown in Figures 1 and 3. Target markers are located at the base of the truss structure and at the storage site for structural parts.Each target marker has three LEDs at triangular vertices.The vision system measures the marker position and orientation simultaneously.The robot recognizes the position of a part relative to the target marker before assembly.The robot places the end-effector position at the pick-up point, which is calculated from the target marker position as measured by the stereo vision system.At the pick-up point, the end-effector grasps a handgrip attached to the targeted part.The position and orientation of the part to the end-effector are settled uniquely when the end-effector grasps the handgrip.The robot plans the path of the arm and part to avoid collision with any other object in the work environment.It controls the arm to track along the planned path.The robot plans a path from the pick-up point to the placement point, avoiding obstacles by means of an artificial potential method [13].Objects in the environment, for example, the truss under assembly, are regarded as obstacles.The arm is then directed along a planned trajectory by the sensory feedback control (1).The end-effector only makes contact with the environment when it picks up or places the part.Hence, feedback gains in (1) are chosen to make it a compliant control.Consequently, the truss assembly task is successfully completed by control-based automation.However, measurement error in the vision system sensors, and so forth, prevents assembly from being guaranteed.Irrespective of uncertainties and variations at the worksite, the space robot model requires autonomy to complete the task goal.

Peg-in-Hole
Task.The positioning control of (1) tries to complete the peg-in-hole task.The peg first is positioned at   = 0 [rad], it transits to the central axis of the hole, and it moves in a positive direction, toward i 0 .The peg does not contact the environment during transition from the initial state to the goal.Insertion is successful if the position control of the peg relative to the hole is free from error.Unfortunately, the peg-in-hole task is often unsuccessful because of the existing error.The robot cannot insert the peg in the hole by position control if the positioning error  is greater than ±0.125 [mm].So, single measurement error using stereo vision is often beyond the acceptable error.The manner in which the task fails is almost the same as shown in Figure 8 in the next section.

Outline of Existing Method with Reinforcement Learning.
Reinforcement learning [1] is used to generate autonomous robot action.
In "standard" learning (Figure 6(a)), controller K  is designed in advance by learning the nominal plant model P  , and it is applied to real plant P. We use a policy called controller K  , which is designed with reinforcement learning methods.When variations exist, for example, measurement error in the vision system, unexpected obstacles appear in the environment, and so forth, and K  cannot complete tasks due to poor robustness and adaptability.
As shown in Figure 6(b), new plant model P   is reconstructed using visual measurement.Controller K   is designed for the reconstructed model P   .Controller K   is then applied to real plant P. Learning converges within a practical calculation time and the new policy is applicable to the truss structure assembly [5].This method works well because it treats the kinematic problem without force interaction between the robot and the environment.The plant model for learning is reconstructed by visual measurement within a short time.This method has adaptability only if the model can be reconstructed accurately within a short time.
If the robot cannot complete the task with the controller due to error between the model and the real plant, the robot switches to online learning.Adaptability is realized by online identification and learning.However, this cannot be used for peg-in-hole task, because online identification requires too much time.In the next section, a reinforcement learning problem for a peg-in-hole task requires several tens of thousands of state-action pairs.It requires tens of days for online identification if a hundred samples are selected for each state-action pair and each sampling takes one second.There is a cost-free termination state  0 , where  00 (  ) = 1, ( 0 ,   ,  0 ) = 0, and ( 0 ,   ) = 0, ∀  .We assume that the state transition probability   (  ) is dependent on only current state   and action   .This is called a discrete-time finite Markov decision process (MDP).The system does not explicitly depend on time.Stationary policy  is a function mapping states into actions with (  ) =   ∈ U, and  is given by the corresponding time-independent action selection probability (  ,   ).

Existing Method with Reinforcement Learning
In this study, we deal with an infinite horizon problem where the cost accumulates indefinitely.The expected total cost starting from an initial state  0 =   at time  = 0 and using a stationary policy  is where   [⋅] denotes an expected value, and this cost is called -factor.Because of the Markov property, a -factor of a policy  satisfies A policy  is said to be proper if  satisfies   (  ) < ∞, ∀  .We regard the -factor of every state as an evaluation value, and the optimal policy  * is defined as the policy that minimizes the -factor: The -factor of the optimal policy is defined as the optimal -factor.It is denoted by  * (  ).
The optimal policy defined by ( 5) satisfies Bellman's principle of optimality.Then, the optimal policy is stationary and deterministic.The optimal policy can be solved by minimizing the -factor of each state independently.Hence, the optimal -factors satisfy the following Bellman equation, and the optimal policy is derived from the optimal -factors:

Solutions.
The existing type of reinforcement learning problem is solved as "standard" learning in Figure 6(a).It obtains the optimal policy  * nom , which minimizes the factors of (7) for the nominal plant.It corresponds to controller K  in Figure 6(a).The optimal -factor  * nom of  * nom can be obtained by the DP-based solutions.The solutions are mentioned in [1,2], but they are omitted here.[5,5,5,4,3,3].Robot action at the end-effector is  1 ,  2 ,  3 , and  4 , at each of the end-effector states transiting by ±1 in the direction of the i 0 -axis or j 0 -axis, and  5 and  6 , at each of the end-effector states transiting by ±1 about the k 0axis of rotation.State-action space is described in the space robot model coordinate system Σ 0 .The hole is 0.25 [mm] wider than the peg, and (  ,   ) are quantized larger than this difference.

Learning Skill for
Control in ( 1) is used to transit from present state   to the next state   by action   .The reference manipulation variable to make the transition to   is y given by y 6), where y (  )  is kept constant during transition.When the end-effector contacts the environment and manipulation variable y is stationary under constraint, the end-effector applies force and torque to the environment f = −K  (y − y  ) as (2), where y  is the reference manipulation variable.Force and torque are controlled by y  , which is changed by an action.This is a compliant control, a force control.Tasks in which the endeffector contacts the environment, for example, peg-in-hole, demand a control with compliance.Therefore, a compliant control is essential as a basic control.
The robot takes the next action after (1) control settles and the peg becomes stationary.Regardless of whether the peg is in contact with the environment, the robot waits for settling and proceeds to the next action.For this reason, state variables do not include velocity.
The goal is to achieve states with the largest   , the peg position in i 0 -direction, in the state space.The one-step cost is (  ,   ,   ) = 1 for all states other than the goal state.Hence, the -factor is the expected step number from   to the goal.

Learning Method.
State transition probabilities for the state-action space in the previous section are calculated with sample data.Sample data are calculated with a dynamic simulator in a spatially continuous state space.The dynamic simulator is constructed with an open-source library, the open dynamics engine (ODE) developed by Russell Smith.The numerical model of this simulator has continuous space, force, and time in contrast to the discretized models for reinforcement learning in Section 5.2.This discretized state transition model is regarded as plant P  , and the method in Figure 6(a) is applied.The optimal policy  * nom , that is, controller K  , is derived from the solution in Section 5.2.The optimal policy  * nom is applied to the dynamic simulator with a continuous state-action space or the hardware experimental setup, the real plant P in Figure 6(a).This study does not deal with the online learning.

Learning
Result.The result of a numerical simulation in which controller K  is applied to the environment with no position/orientation error is shown.The peg moves and arrives at the goal as shown in Figure 7. Peg positioning is first changed to   = 0 [rad].After the peg transits to the hole central axis, it is moved in a positive direction toward i 0 .Then the peg is inserted into the hole.During the transition from the initial state to the goal, the peg does not make contact with the environment, and the end-effector applies force and torque, [  ,   ,   ] = [0, 0, 0].
In an environment with the hole position error of −0.5 [mm] in j 0 direction, the peg does not arrive at the goal with controller K  ; see Figure 8.The task is not completed using K  due to small errors caused by visual measurement, and so forth.

Autonomous Acquisition of
Skill by Learning

Autonomous Acquisition of Skill for Peg-in-Hole.
A robot can achieve peg positioning or movement with contact force, and it must have basic control functions same as a human being.Human vision measurement and positioning control are not accurate enough.However, the rate of a human failure in the same task is not as high as that of a robot.One reason for this may be the skills a human being brings to the task.A human being conducting peg-in-hole task uses a typical sequence of actions (Figure 9).First, the human being puts a corner of the peg inside the hole.The peg orientation is inclined.The peg is in contact with the environment.Two points of the peg, the bottom and a side, are in contact with the environment, as shown in the close-up in Figure 9.The human then rotates the peg and pushes it against the hole and maintains the two contact points.The two corners are then inserted into the hole.Finally, the human inserts the peg into the hole and completes the task.
Human vision measurement accuracy and positioning control accuracy are not high.A human presumably develops skill while manipulating this situation.We conducted an experiment to check whether robot learning in the same situation could achieve the task as well as a human.
This situation conceptually corresponds to Figure 6(c).Plant P  + ΔP denotes a variation plant with error caused by visual measurement, and so forth.Variation plant set {P  +

Advances in Mechanical Engineering
Close-up ΔP} is composed of all the variation plants that can exist.Real plant P is supposed to be a member of variation plant set {P  + ΔP}.The learning robot obtains controller K   .The controller is able to complete the task for all of the variation plants in {P  + ΔP}.

Problem Definition for Reinforcement
Learning with Variation.We assume there are  variation plants around the estimated plant (the nominal plant).We use a set composed of  variation plants for learning.
We consider difference   between a variation plant and the nominal plant in each state, which is a discrete variable and the element of finite set W. Finite set W is composed of  differences denoted by  0 ,  1 , . . .,  −1 .Difference  0 indicates no difference.If an agent is in state   with difference   and chooses action   , it will move to   within a state transition probability   (  ;   ) and incur a one-step cost (  ,   ,   ;   ).This transition is denoted by (  ,   ,   ;   ).Difference   can be considered as the disturbance that causes state transition probability   (  ;  0 ) to vary to   (  ;   ).We assume that the   (  ;   ) and (  ,   ,   ;   ) are given.
Difference   at each state is determined by a variation plant.Variation  is a function mapping states into difference with (  ) =   ∈ W. The nominal plant is defined by  0 (  ) =  0 for all states   .The plant does not explicitly depend on time, so variation  is time-invariant.We assume that (  ) =   is given.
A plant set composed of  plants used for learning is represented by H = { 0 ,  1 , . . .,  −1 }.Set H corresponds to {P  + ΔP}.Let (  ) denote the probability that the plant variation is   .We call this the existing probability of variation plant   .We assume that () is given at time  = 0.
For set H, the expected cost of a policy  starting from an initial state  0 =   at  = 0 is which is the -factor of this problem.This -factor formula using the plant existing probability is where  ,  (  ) denotes the expected cost using the policy  on a plant   starting from an initial state   .It satisfies We define the optimal policy as which minimizes the -factor of every state.The -factor of the optimal policy  * is defined as the optimal -factor, represented by  * (  ).The objective is to obtain the optimal policy.We assume that there is at least one policy  satisfying   (  ) < ∞, ∀  , in this problem.Henceforth, we will call this problem the original problem.The variation plant in the original problem correlates with differences between any two states.Due to this correlation, the optimal policy does not satisfy Bellman's principle of optimality [14].Therefore, the optimal policy and the optimal -factor in this problem do not satisfy ( 6) and (7).In general, the optimal policy is not stationary.If policies are limited to stationary, the optimal policy is stochastic.
Therefore, another problem definition or another solution method is needed.

Solutions for a Relaxed Problem of Reinforcement Learning with Variation.
We relax the original problem to recover the principle of optimality.Then, we can find the optimal -factor efficiently by applying DP algorithms to the relaxed problem.We treat a reinforcement learning problem based on a twoplayer zero-sum game.
We assume that differences,  0 ,  1 , . . .,   , exist independently in each state   .Then the original problem is relaxed to a reinforcement learning problem [15][16][17] based on a two-player zero-sum game [18] whose objective is to obtain the optimal policy for the worst variation maximizing the expected cost.Since the correlations of differences in a variation plant are ignored, the principle of optimality is recovered.
The H 2 is defined as the set of variation plants consisting of all possible combinations of any differences.Since each state has  types of differences, the number of variation plants in H 2 is    .We define the optimal policy  * 2 as the policy minimizing the expected cost against the worst variation  * 2 maximizing the expected cost and the optimal -factor  * 2 (  ) is defined as the -factor of the optimal policy and the worst variation.
Therefore, the optimal -factor can be obtained by a DP algorithm.Using the optimal -factor, the optimal policy and the worst variation are obtained by The optimal policy  * 2 is applicable to all    variation plants in H 2 .The optimal policy  * 2 is proper for all plants in H of Section 6.2 because H ⊆ H 2 holds.However, the actual number of plants to which the policy should be applied is only  and  ≪    .Hence, the optimal policy of the reinforcement learning problem based on the two-player zero-sum game is often conservative and yields poor performance because the problem does not consider the existence of variation plants.We cannot solve this problem if there is no policy satisfying  , (  ) < ∞, ∀  , ∀ ∈ H 2 , even though the policy  exists and satisfies ∑   ,  (  ) < ∞, ∀  ∈ H. Hence, a solution method to solve the original problem is desired.

Learning of Peg-in-Hole
Task with Variation 6.4.1.Problem Definition of Peg-in-Hole Task with Variations.This section uses the same problem definition for peg-in-hole as Section 5.3.1.The following is added to take variations into account.
The hole position and orientation are measured by the stereo vision system.These measurements involve errors caused by quantized images, lighting conditions at a worksite, and so forth.Time-averaged errors are almost constant while the space robot performs the task, unlike white noise whose time-averaged error is zero.Error evaluations are modeled as described below.Hole position errors are modeled as normal probability distributions, where the mean is  = 0  for learning, as listed in Table 1, where each plant has a combination of errors among [−1.0,0.0, 1.0] [mm] in j 0 -axis direction and [−(0.5/180),0.0, (0.5/180)] [rad] in k 0 -axis rotation.Plant  0 with no error both in j 0 -axis direction and in k 0 -axis rotation is the nominal plant.The plant existing probabilities followed the above-mentioned normal probability distributions.
In the later simulations and experiments to evaluate the learned results, the hole position and orientation are derived from the above normal probability distribution.In the simulations, a variation in the hole position and attitude is chosen for each episode, but the variation is invariant during the episode.

Learning Method.
Under the conditions of the above problem definition, a policy is obtained by the solution in Section 6.3.It is the optimal policy  * 2 of the two-player zero-sum game, that is, K   in Figure 6(c).The optimal policy  * 2 is applied to the dynamic simulator with a continuous state-action space or the experimental hardware setup, which is a real plant P in Figure 6(c).No online learning is needed.
There is no proper policy for all plants in H 2 if the variations in Table 1 are too large.In this case, there is no policy satisfying the reinforcement learning problem based on the two-player zero-sum game.A typical approach for this situation is to make the variations smaller, to reconstruct H 2 , and to solve the two-player zero-sum game again.This approach is repeated if we cannot obtain any solutions.This approach reduces the robustness of solutions.

Control Results.
In results for numerical simulation (Figure 10), where the peg arrives at the goal using controller K   in the environment without hole position/orientation error.Peg positioning is firstly inclined, and the peg moves in the positive direction, toward i 0 .Then, the peg's corner is inserted in the hole.The peg makes contact with a corner of the hole.The peg transits in a positive direction, toward j 0 , while maintaining contact.Another corner of the peg is put inside the hole when the action in the direction of i 0 and j 0 is repeated.Peg positioning is changed to   = 0 [rad], and the peg slips into the hole.The task is completed.The learned result is similar to that of human skill for the peg-in-hole task in Figure 9.
The peg has arrived at the goal using controller K   for variation plants  1 - 8 .The numerical results for  1 - 4 are shown in Figures 11,12,13,and 14.Each transition is similar to the case of  0 , and the peg is inserted into the hole.
The task is achieved with controller K   in the same environment with error, where K  previously did not work at all.This means that the action generated by controller K   is robust against variations as well as human skill.We judge the robot, that is, controller K   , to have obtained a skill, the ability to complete a task when the vision measurement accuracy is low.

Evaluation of Robustness and Control Performance.
Robustness and control performance of controllers K  and K   are evaluated by simulations and hardware experiments, the peg-in-hole task.
Variation plants, that is, error in hole position and orientation, are derived from the normal probability distribution in Section 6.4.The robustness and the control performance are evaluated, respectively, by the task achievement ratio and the average step number to the goal.The achievement ratio equals the number of successes divided by the number of trials.Table 2 shows the achievement ratios and the average step number of K  and K   as evaluated by simulations and hardware experiments.The simulations and the experiments are executed 10,000 times and 50 times, respectively.The achievement ratios of K  are 59% and 64% in simulation and hardware experiments.Those of K   dramatically increase to 99% in numerical simulation and 96% in hardware experiments.These results show that the robot autonomously  generates robust skill using the proposed learning method.The difference in step numbers between hardware experiments and simulations, an increase in hardware steps, may be due to variations, for example, irregular friction in the environment, joint flexibility, and so forth.Such variables are not considered in the numerical simulation.Robust skills are thus autonomously generated by learning in this situation, where variations make task achievement difficult.

Conclusions
We have applied reinforcement learning to obtain successful completion of a given task when a robot normally cannot complete the task using controller designed in advance.Pegin-hole achievement ratios are usually low when we use conventional learning without consideration of plant variations.In the proposed method, using variation consideration, the robot autonomously obtains robust skills which enabled the robot to achieve the task.Simulation and hardware experiments have confirmed the effectiveness of our proposal.Our proposal also ensures robust control by conducting learning stages for a set of plant variations.

Figure 1 :Figure 2 :
Figure 1: Photograph of space robot model and truss.
(a) solve problems in the actual robot environment; (b) operate robustly in the face of uncertainties and variations;
5.2.1.Problem Definition.Following general dynamic programming (DP) formulations, this paper treats a discretetime dynamic system in a reinforcement learning problem.A state   and an action   are the discrete variables and the elements of finite sets S and U, respectively.The state set S is composed of   states denoted by  1 ,  2 , . . .,    and an additional termination state  0 .The action set U is composed of  actions denoted by  1 ,  2 , . . .,   .If an agent is in state   and chooses action   , it will move to state   and incur a one-step cost (  ,   ,   ) within state transition probability   (  ).This transition is denoted by (  ,   ,   ).

Figure 6 :
Figure 6: Learning using (a) nominal plant model, (b) updated plant model, and (c) plant model with variation.

Figure 7 :Figure 8 :
Figure 7: Trajectory of controller K  in a simulation without any hole position error.
[mm] and standard deviation is  = 0.75 [mm].Hole orientation errors are modeled as normal probability distributions, where the mean is  = 0[rad] and standard deviation is  = 0.5/180 [rad].If the error's statistical values gradually vary, we have to estimate them online.The relative position and orientation between Σ 0 and Σ ℎ are fixed during the task.The plant variations are modeled as hole position and orientation measurement errors.Consider these errors as variations ΔP added to nominal model P  (Figure6(c)).We constructed 9 plants  0 ∼  8

Figure 10 :Figure 11 :Figure 12 :
Figure 10: Trajectory of controller K   in a simulation using nominal plant  0 without any hole position error.

Figure 13 :Figure 14 :
Figure 13: Trajectory of controller K   in a simulation using plant  3 .

7. 1 .
Results of Hardware Experiments.Example results in the hardware experiment using controllers K  and K   are shown in Figure 15.The following variations are used: +0.3 [mm] in i 0 , +1.2 [mm] in j 0 , and +0.5/180 [rad] rotation about the k 0 -axis.Controller K  cannot complete the task due to environmental variations, but controller K   can.

Figure 15 :
Figure 15: Experimental trajectories using two controllers in an environment with error (a) controller K  and (b) controller K   .

Table 1 :
Hole position of variation plant   from the nominal plant.

Table 2 :
Achievement ratios and averaged step numbers of peg-inhole task with controllers K  and K   .