Crossing the Reality Gap: a Survey on Sim-to-Real Transferability of Robot Controllers in Reinforcement Learning

The growing demand for robots able to act autonomously in complex scenarios has widely accelerated the introduction of Reinforcement Learning (RL) in robots control applications. However, the trial and error intrinsic nature of RL may result in long training time on real robots and, moreover, it may lead to dangerous outcomes. While simulators are useful tools to accelerate RL training and to ensure safety, they often are provided only with an approximated model of robot dynamics and of its interaction with the surrounding environment, thus resulting in what is called the reality gap (RG): a mismatch of simulated and real control-law performances caused by the inaccurate representation of the real environment in simulation. The most undesirable result occurs when the controller learnt in simulation fails the task on the real robot, thus resulting in an unsuccessful sim-to-real transfer. The goal of the present survey is threefold: (1) to identify the main approaches to face the RG problem in the context of robot control with RL, (2) to point out their shortcomings, and (3) to outline new potential research areas.


I. INTRODUCTION
R EINFORCEMENT Learning (RL) [1] allows to design controllers (often referred to as agents) with the capability of learning an optimal behaviour by interacting with the environment. The behaviour is defined in terms of stateaction pairs, also known as policy, which is learnt through a trial and error process.
In some well-known RL tasks, such as pole-balancing, grid-search, or mountain car, state and action spaces of the system are small enough to allow approximating policies through tables [1], i.e., actions and states can be dealt with as finite discrete variables. However, the higher the complexity of the system to control, the more ineffective the tabular approaches become [1]. Indeed an increase in system complexity is often related to an increase of the state and action spaces dimensions, which makes a tabular approach intractable. In challenging cases, such as robot control, treating state and action as continuous variables in a compact set is a more appropriate way to deal with the problem [2,3]. For this purpose, approximators of the policy or of some supporting element, such as value function, or of both, are required [1]. When Deep Neural Networks (DNNs) are employed as approximators, the approach is referred to as Deep Reinforcement Learning (DRL); it allows to develop RL controllers with less manual feature engineering than classic tools (radial basis functions, tile coding, etc.) [4,5,6]. On the other hand, when DRL is directly employed on the robot in real-time, it results in considerably long training times. Moreover, due to the intrinsic trial and error nature of RL, a real-world training, in particular during the exploration phase of the state and action space, can lead to unsafe actions of the robot. Therefore, a way to train robots safely and quickly is needed.
Simulators allow to easily address these problems once they are provided with a model of the robot dynamics able to replicate the actual behaviour as closely as possible. In principle, simulators allow to train the controller with faster and safer procedures: once the policy has been learnt, it is transferred to the real system (sim-to-real transfer) [7]. However, sim-to-real transfer is only effective when the simulator is given a sufficiently accurate model of the real robot and the environment [8]; unfortunately, the more accurate the simulation, the heavier the computational cost. A less accurate simulator is therefore often preferred, although it may result in a less effective sim-to-real transfer. The phenomenon in which a controller learnt on simulator degrades once applied on the real world is the so-called reality gap (RG) [9]. In the worst case, the RG leads to a failure of the policy when applied on the real world, which means a robot unable to achieve its goal.
RL is not the only approach that can be affected by the RG. Any technique in which the controller design relies on a simulator of the real system can potentially exhibit a reality gap [10]. Indeed, several works faced the RG problem in other frameworks, such as Evolutionary Computation [9,11,12,13,14,15,16,17] or Model Predictive Control [18,19,20,21].
However, here we focus our attention only on those works facing the RG problem on robot controllers learnt with RL. Most of the many solutions proposed in the literature, are task-dependent and/or have been tested on a specific task only. The outcomes are that: (a) generalisation is not ensured, (b) and a comparison between different approaches is not feasible.
Although a sketch of the current state of the art is already proposed in [22], here we conduct a more in-depth analysis. We introduce the main concept behind the RG in a general RL framework and we survey relevant and recent literature concerning RG in the context of robot control with RL. According to our analysis, the approaches for coping with RG in this context fall into three broad categories: domain randomisation (DR), adversarial reinforcement learning (ARL), and transfer learning (TL). Hence, we introduce a general RL framework suitable to be specialised for each of the mentioned approaches.
The aims of the present work are: (1) to provide a systematic picture of the literature concerning how to solve the RG problem in robot control tasks with RL; (2) to clarify the differences between the three main identified approaches by highlighting the relative pros and cons; (3) to identify new possible research areas.
The remainder of the paper is organised as follows. In Section II we describe a formal framework for RL useful to better explain the key elements of DR, ARL, and TL. In Section III we introduce RL as a means for robot control and discuss the RG in the context of robotics. In Section IV we survey and discuss the current methodologies to face the RG, relying, where possible, on the provided formalism. Finally, in Section V, we draw the conclusions.

II. REINFORCEMENT LEARNING
Reinforcement Learning (RL) can be employed to perform optimal data-driven control without the need to rely on a mathematical model of system dynamics [1,23,24].
It is typically described as a Markov Decision Process (MDP); i.e., a tuple (X, A, p, r) defined by the state set X, the control set A, the transition probability p, and the immediate reward r ∈ R. In brief, it expresses a discrete-time stochastic control process in which a scalar r is employed to assess the quality of the controller choice in terms of task achievement.
Here, we adopt a control systems-oriented formalism similar to the one employed in [25]. However, what follows also applies to MDP settings by defining some of the following elements in terms of expected values.

A. ADOPTED FORMALISM
A dynamical discrete-time system Ω is a tuple (X, A, O, f, g) composed of the state set X, the control set A, the observation set O, a transition function f : X × A → X, and an observation function g : X → O. Let x (k) ∈ X, a (k) ∈ A, and o (k) ∈ O be the state, the applied control input, and the observation, respectively, at the k-th time-instant. A dynamical discrete-time system, starting from an initial state x (0) and subjected to a control sequence a (0) , a (1) , . . . , evolves according to the following laws: An environment E is a dynamical discrete-time system described by a tuple (X, A, O, f, g, h) composed of the same elements of Ω plus a reward function h : X × A → R. Let r (k+1) ∈ R be the reward at the (k + 1)-th timeinstant. An environment E, starting from an initial state x (0) and subjected to a control sequence a (0) , a (1) , . . . , evolves according to Equations (1) and (2) and: Overall, we assume that g(f (·)) = f (·), thus resulting in o (k+1) = x (k+1) . The following is also pertinent, with appropriate adjustments, in case of g(f (·)) = f (·) that simply means dealing with a partial observability of the state. In the MDP setting this amount to employing a Partially Observable Markov Decision Process (POMDP), which requires to introduce the concept of belief [26]. However, it is not necessary for the purpose of the present work.
A controller (or policy) for a dynamical discrete-time system, and therefore also for an environment, is a function Given a discount factor γ ∈ [0, 1], an optimal policy π * is a policy that satisfies, for any initial state x (0) : where Π is the set of policies, while J π x (0) is the infinite horizon discounted reward starting from x (0) under the policy π.
We denote with π ↔ E the closed-loop system where π determines the control input to be applied on E, represented in Figure 1. In particular, according to the above definitions The closed-loop system π ↔ E where a policy (light blue block above) applies a control input a (k) to an environment (gray block below) that outputs an observation o (k+1) and a reward r (k+1) .
The closed-loop system L ↔ E where a RLbased agent L (green block above) applies a control input a (k) to an environment (gray block below) that outputs an observation o (k+1) and a reward r (k+1) . and to the assumption that g(f (·)) = f (·), the scheme represents a state feedback control. Clearly, when g(f (·)) = f (·), the same scheme could represent an output feedback control.
A policy learning algorithm L is an algorithm that, given an environment E = (X, A, O, f, g, h) and a learnable policies set Π L ⊆ Π, outputs (learns) a policy π L = L(E) ∈ Π L . When the learning encompasses an interaction with the environment, the policy learning algorithm could be seen as an agent L that learns a controller π by interacting with an environment E. We denote by L ↔ E the resulting closed loop system (Figure 2). A policy learning algorithm is said to be model-based, if it relies on the knowledge of f and g (either known or identified based on collected data), or model-free, otherwise.
A typical RL agent can be seen as a policy learning algorithm L. Indeed, during training, it interacts with the environment E = (X, A, O, f, g, h) and updates a controller π by observing the consequences (in terms of reward r) of selected control inputs. Its global goal is to learn a controller π * in E which satisfies (4).
The learning procedure can either be episodic or in a single chapter (non-episodic): in the former case, the learning is performed through episodes and the state is reset in case of a failure, a goal achievement, or the achievement of the maximum episode length T . In addition, the agent can update the controller either by using data from the current policy (on-policy) or independently of it (off-policy). RL approaches can be categorised in three main categories [1]: • Value-function approaches, based on the idea of the value of a state (value function V π (x)) or of a statecontrol input pair (action-value function Q π (x, a)). The value function and the action-value function represent, respectively, the cumulative reward obtained from x (k) by applying π and the cumulative reward obtained from x (k) by first choosing an a (k) and then applying π.
The optimal policy corresponds to the optimal V or Q functions: In brief, by interacting with the environment and observing rewards, either V * or Q * are estimated, thus leading to the optimal policy. • Policy-search approaches, in which a parametrized policy π θ is defined, whose parameters θ are updated based on the observed reward in order to maximise J π θ , either employing gradient-based or gradient-free optimisation techniques [27]. • Actor-Critic approaches, which integrate the idea of both previous categories. V π (x), in this case, is employed as a baseline (Critic) for policy gradient optimisation (Actor). Further technicalities are not necessary for the purpose of this work. For additional details on RL we refer readers to [1,25,28,29].

III. REINFORCEMENT LEARNING IN ROBOTICS
The motivation for using RL in robotics is to make a robot autonomous in finding an optimal policy, through trial and error interactions with its environment, without an explicit knowledge of the model (model-free). However, as a matter of fact, the most effective methods to date are modelbased [30,31,32]. In addition, policy search approaches result in more efficient trainings in terms of time needed for convergence [33,34,35,36,37,38]. Regardless of the specific RL method being used, the application of RL enabled researchers and practitioners to face different significant robotic tasks (e.g., manipulation, navigation, motion control, etc.). In Table 1, we report the tasks that are more often considered, along with a few significant research papers, some of which (those dealing with sim-to-real transferability) are surveyed in the present study.
A first issue concerns the representation of V π (x), and Q π (x, a). While in some simple RL scenarios O and A sets can be discretised over a finite range, hence allowing a tabular representation of V π (x) or Q π (x, a), in robot control problems the physical nature of observations and control inputs advocates for a finely discretised (ideally, continuous) representation of O and A. In this case, table may require a huge amount of memory and the problem gets practically intractable. An alternative and often suitable solution to escape tabular representation is function approximation, recently addressed by DRL. Here DNNs are used as function approximators of V π (x), Q π (x, a), or directly π, and a loss function η θ is designed in order to guide the training of the network itself. For a more detailed description of DRL, including recent efforts and some applications, we refer the reader to [4,63].
A second issue, pivotal for the aim of the present work, derives from the trial and error process employed by RL and its direct application to the robot: learning time may become too long, thus unpractical, and the risk of damaging the robot, or more in general the environment, may be too high. Typically, simulators are used for addressing this issue; i.e., an environment mapping operator φ is assumed to exist (albeit unknown in practice) such that E = φ(E) = (X , A , O , f , g , h ) is a digital approximated copy of E. Intuitively, φ corresponds to modelling a real system described by E as a simulated system described by E such that a policy learnt on E can be applied to E. The resulting controller design and test are, therefore, split in two distinct phases that are performed on two different environments: (i) an agent L interacts with a simulated environment E = φ(E) and outputs a controller π * ; (ii) the resulting π * is applied on the real E. In this scenario, outlined in Figure 3, the latter step, that in general can be defined as E -to-E transfer, can be renamed as simto-real transfer, given the nature of the considered E and E.
As will be clear from the reviewed examples in the following sections, a controller learnt in simulation often exhibits performance losses when applied on the real robot and, in the worst scenario, totally fails the task. From the point of view of the designer, this issue becomes relevant when, although a safer and faster training and an effective test have been carried out on E , the policy learnt in simulation does not achieve the goal in real world. In such a situation, the learning algorithm is affected by an intolerable RG and the corresponding learnt policy π * is said to be non-transferable.
Note that, although it is not always clearly emphasised, a mandatory step ahead of the sim-to-real transfer is to perform a test of the learnt policy on the simulator itself 1 . Otherwise, there is no guarantee of the learnt controller effectiveness in achieving the task even in simulation. For instance, a typical learning stop criterion consists in terminating the training when the moving average of the cumulative reward settles down. However, this empirical rule does not ensure that the learnt controller is able to correctly perform the task. The learning algorithm may have been trapped in a local maximum, and the resulting controller may have a completely unexpected behaviour in tests, even on the simulator. If the test on the simulator is effective, possible reasons for a performance loss in a test on the real robot, and therefore for RG, are: (a) φ operator is unrealistic: E and E differences in f ,f and/or g, g are such that, applying the same input on both environments, the resulting o and o are different; or, possibly, E and E differences in h and h lead to two different optimisation problems J π (x) = J π (x ); (b) L is unable to output a π * sufficiently robust to possibly small and unavoidable errors in φ. However, the particular case in which h is very different from h is unusual, in practice. Typically, h is properly designed and remains "the same" for both E and E (for this reason hereinafter we consider h = h). Therefore, the RG can be essentially attributed to a mismatch between E and E and the possible solutions may be: (a) improve E , by properly adjusting φ (not always possible because it could result in an excessive computational effort or because of lack of knowledge); (b) make the controller more robust to model errors.
Note that, in general, it is not required that the controller behaves identically when applied to E and E , but, rather, that it is E -to-E transferable, i.e., sim-to-real transferable. In practice, this requirement translates to a (subjectively) properly bounded RG. Clearly, if an appropriate behaviour is only reached when E is an identical copy of the robot E (E = φ(E) = E), using E rather than E has no benefit in reducing the overall learning time. But still, E may be useful for addressing risks concerning safety. Figure 4 summarises a generic routine for investigating the presence of RG. We start by (1) performing a training on E ; (2) then we test the resulting policy π on the same E , to check its effectiveness in reaching the task, and, only if the test ends successfully, (3) we test π also on E; otherwise, we revert to (1). Once that the (3) has been executed, (4) we compare the performance obtained in the π tests on E and E .

IV. METHODOLOGIES FOR SOLVING RG
In the present paper, we focus only on those articles that meet all the following requirements: (i) address explicitly the 1 This only applies to those situations where the controller is transferred on the real robot. As shown in Section IV-C, in some scenarios, the training continues on the real robot. Hence a test procedure is not needed in these cases. RG problem, (ii) deal with robot control applications, and (iii) employ RL techniques. We have identified three major categories of approaches for addressing the RG in this scenario: domain randomisation (DR), adversarial RL (ARL), transfer learning (TL). All the articles discussed below are summarised in Table 2 according to this categorisation. The table also shows, for each article, if the authors conducted experiments only in simulation (sim-to-sim) or (also) on real robot (sim-to-real), and specifies the employed simulators. Due to task diversity ( Figure 5 shows a visual summary of the robotic tasks) and the lack of a common theoretical framework for the RG, the surveyed articles do not present their results in a way that permit a systematic comparison. However, we provide a general formal definition for each of the previously mentioned categories, according to the formalism of Section II, which allows to understand each approach.

A. DOMAIN RANDOMISATION
Domain randomisation (DR) has already achieved good results in sim-to-real transfer of robotics controllers outside RL [54,75,76,77,78,79]. The main idea behind this approach is what in control theory is called robust control under either parametric or non parametric uncertainty [80,81], that is the design of controllers able to guarantee certain properties despite some tolerable parameters variations and/or noise.
We callẼ =φ(E) a corrupted simulator described by FIGURE 5: Overview of some robotic tasks considered in the surveyed articles to address the RG. (a) The Fetch robot used in [49]. (b The Minitaur of [54]. (c) The robot employed for manipulation task in [74]. (d) The robotic arm engaged in deformable object manipulation of [45]. (e) The Marble maze game of [56]. (f) The ball on plate system used by [53]. (g) The Fetch robot used for the pushing task of [44]. (h) The quadrotor employed for the autonomous navigation task of [40].
(i) The five-finger humanoid hand used in [57]. (j) The classical Open AI Gym environment used to test in simulation several strategies [49,50,51,52].
process disturbances set, Υ is the measurement disturbances set, f ξ :X ×Ã × Z →X the corrupted and parametric transition function, with parameters ξ ∈ Ξ , and g ψ :X → O the corrupted and parametric observation function, with parameters ψ ∈ Ψ . Given a parametrisation ξ , ψ , starting from an initial statex (0) and subject to a control sequenceã (0) ,ã (1) , . . . , a process disturbance sequence ζ (0) , ζ (1) , . . . , and a measurement disturbance sequence υ (0) , υ (1) , . . . , a corrupted simulatorẼ evolves according to: The main idea behind DR is that, during training, L selects ξ and ψ , interacts with the resulting environmentẼ , and updates a controller π by observing the consequences (in terms of reward) of selected control inputsã (k) , process disturbances ζ (k) , and measurement disturbances υ (k) . Its final goal is twofold: (a) maximize the finite horizon discounted reward in a perturbed environment (see Equation (4)) and (b) find a solution π * which ensures a loss in performance lower than a threshold when applied on different domains of the same distribution. In particular, the final controller π * sim-to-real transferability is here seen as a form of controller robustness obtained by training π in a collection of environment models, chosen by L, instead of a single one- Figure 6 graphically summarises this process. The resulting controller π * , learnt by maximising the finite horizon discounted reward J π,T under these conditions, is expected to be robust to perturbations. Therefore, if these perturbations are such that the L ↔ E interaction returns a π * affected by a tolerable RG, the result is a sim-to-real transferable controller. Table 3 summarizes the articles that tackle the RG using the DR approach. The table shows also the employed learning algorithms and the considered tasks.
We remark that in some of these studies the actual simto-real transferability is not evaluated (see Table 2); instead the controller robustness with respect to the perturbations is tested. We discuss each of the paper below.
In Sadeghi and Levine [39] authors train a vision-based navigation policy entirely in simulation, trying to use it on DR PyBullet [47] TL Gazebo [61] DR MuJoCo [41] DR Gazebo a real quadrotor without performing additional real training runs. During training, at each time k, the state of the system is here represented by an indoor synthetic image I (k) generated by a renderer. Images are generated in order to reproduce different hallways and a variety of environment parametric settings (ξ ,ψ ). First, a Deep Convolutional Neural Network is learnt in order to predict the collision probability for each TRPO a Ball on plate with robotic arm [44] HER b +RDPG c Pushing task with robotic arm [45] DDPGfD f Deformable object manipulation [54] PPO e Trotting and galloping of quadruped [56] A3C d Marble maze game with robotic arm [57] PPO e Rubik's cube with robotic hand [46] QT-Opt g Rvision-based control task of a robotic arm for grasping [61] PPO Climb and descend stairs with bipedal robot [41] A3C Wheeled mobile platform navigation a Trust Region Policy Optimisation [83], b Hindsight Experience Replay [84], c Recurrent Deterministic Policy Gradient [85], d Asynchronous Actor-Critic Agents [86], e Proximal Policy Optimization [87], f Deep Deterministic Policy Gradient from Demonstration [88]. g Q-function Targets via Optimization [89] h Deep Q Network [90] I (k) , a (k) . Then, a Deep RL agent is trained for fine-tuning the previous model to provide the action-value function Q(I (k) , a (k) ). Hallway randomisation enacts a wide variety of environments, and shows very good performance during test, both in simulation and on the real world, even with environments never seen during training. However, performance falls when the drone encounters reflective glass doors, thus resulting in a crash. Mandlekar et al. [50] introduces an algorithm, called Adversarially Robust Policy Learning (ARPL), to teach a controller to correctly behave in presence of increasing adversarial perturbations. The agent uses a curriculum learning approach [91], in which ξ , υ (k) , and ζ (k) alternately assume the form of isometrically scaled versions of Fast Gradient Sign Method (FGSM) [92]. Here, the controller is parametrized by θ (π θ ) and updated following the on-policy vanilla Trust Region Policy Optimisation (TRPO) [83]. The key idea is to use a corrupted simulatorẼ in training, and then testing the resulting π * on a different corrupted simu-latorẼ = (X , A , O , Z , Υ , f ξ , g ψ , h) environment, obtained with different perturbations Z = Z , Υ = Υ , Ξ = Ξ and Ψ = Ψ . These perturbations are such that π * VOLUME 4, 2016 is misled to provide wrong control inputs a (k+1) .
The choice of adversarial perturbations is motivated by the fact that by employing them, the resulting models are likely to generalise well [93].
The ARPL algorithm has been tested in several benchmark examples (Inverted pendulum, Half cheetah, Hopper, Walker) and seems to deliver promising results, exhibiting significant robustness. However, examples of sim-to-real controller transferability have not yet been provided.
The Simulation-based Policy Optimisation with Transferability Assessment (SPOTA) algorithm, designed in [53], uses randomised physics parameters, drawn from a probability distribution ξ ∼ ρ κ (ξ ) (parametrized by κ), to perform a robust optimisation of the controller. In SPOTA, the controller is trained on model ensembles, according to the following 4 phases: (i) learn a candidate solution π C θ using a TRPO updating rule; (ii) learn n R reference solutions π Rj θ , j = 1, . . . , n R on n R differentẼ , each obtained for different ξ and ψ settings; (iii) compare the performance of candidate C with that of each reference R j in the same condition of R j ; and, finally, (iv) decide whether or not stop the learning. The last step is carried out by introducing a Simulation Optimisation Bias (SOB) concept: an error caused by an optimistic bias of the optimisation procedure, whose existence has been proven by [94]. The authors have assumed that it can be treated as the error between the finite horizon discounted reward J π * R θ ,T obtained by considering the reference solutions and the finite horizon discounted reward J π C θ ,T of candidate solution. Taking into account that an RL approach in a stochastic setting allows to find only estimates of J π * R θ ,T and J π C θ ,T , the authors have derived an upper bound for the tolerated SOB of the candidate solution, called Upper Confidence bound on Simulation Optimisation Bias (UCSOB). In order to ensure a desired performance β, the final candidate solution UCSOB must be lower than β. In this framework, the authors have tested the algorithm by developing a controller for a ball on plane task, governed by a robotic arm, in the same simulator (obtaining satisfactory results) and varying the physics engine (with worse outcomes). However, although in this case a sim-to-sim transferability test has been carried out, a sim-to-real controller test has not been done.
Peng et al. [44] shows the effectiveness of memory-based policies (i.e., policies learnt by using past memory for future learning [95,96,97]) to deal with the RG, introducing DR to generalise environment dynamics. Hindsight Experience Replay [84] has been used for the purpose: a technique able to generalise over different goals using past experience as a baseline. In this case, the parameters ξ , the measurement noise υ (k) , and the time step ∆t are sampled according to a distribution, which is a design parameter. In particular, ξ is kept locked for an entire episode, while the remainder are varied at each time step. The proposed solution, learnt using a RDPG algorithm (off-policy), has been tested on a robotic pushing task and, when transferred to reality, shows performances comparable to those obtained in simulation, despite poor calibration.
An improved version of DDPG [88] is adopted in [45] to solve deformable object manipulation tasks in simulation. The resulting controller transferability on the real robot is therefore tested. In particular, a robotic arm is involved in three different towel folding tasks, in which RGB images are included in the observation o. The DR is here implemented by sampling some environment values from either normal or uniform distributions around noisy ground truth estimates. Experimental results suggest that randomisation of extrinsic camera parameters (i.e., position and orientation) is particularly useful for sim-to-real transfer, since the controller has an evident sensitivity to changes of its position. Besides, they show that heavy randomisation can lead to unsuccessful transfers.
Controller sim-to-real transferability has also been tested on locomotion tasks of a Minitaur quadruped of Ghost Robotics [54]. Here authors have used Proximal Policy Optimisation (PPO) to learn π * and have observed the impact of two different solutions to reduce the RG: (a) improving simulated model via system identification; (b) using randomised ζ (k) , υ (k) , and ξ to learn robust controllers as the observation space changes. Obtained results suggest that simulators improvement is an essential requirement since, as the model becomes less adequate, not even a robust controller is able to avoid a large RG. The authors of the cited paper also pointed out that considering a large observation space does not always bring benefits. On the contrary, their evaluations have showed that controllers learnt in simulation with large observation space lead to bad results when transferred to real robot.
Van Baar et al. [56] shows the benefits of using DR and Asynchronous Actor Critic Agent (A3C) algorithm [86] (onpolicy) for learning the controller, with respect to not using DR. The parameters ξ are here randomly sampled according to a uniform distribution. Both controllers are then applied on real-world robot and the fine-tuning time required to convergence is compared. The analysed task is a Marble maze game driven by a robotic arm and the results show that there is a trade-off between controller robustness and finetuning steps. Controller learnt through DR requires fewer fine-tuning steps than the remaining one, further proof of an existing trade-off between efficiency and RG.
In a quite recent work [57], Automatic Domain Randomisation (ADR) is proposed in order to transfer a policy learnt in simulation on the real system, framing it in a manipulation task of Rubik's cube with a robotic hand. Here, the RL agent does not solve Rubik's cube but "only" learns, using a PPO algorithm (on-policy), how to move correctly the robotic hand in order to perform control inputs suggested by another non-AI based algorithm. What changes from standard DR idea are ρ κ (ξ ) and ρ κ (υ (k) ) distributions (parametrized by κ) that allow to randomly select υ (k) and ξ . Indeed, while in other DR approaches these distributions are parametrized with fixed κ, in ADR κ changes during learning procedure. In particular, these additional environments, obtained with different κ, are added to the considered collection of environment models only when a lower performance limit is reached (i.e., a fixed number of successful episodes are performed). The developed controller has been firstly tested in environments in which distributions were manually tuned, achieving good results. In addition, a sim-to-real transfer is performed, with worse results.
A Randomised-to-Canonical Adaptation Network (RCAN) is conversely introduced in [46]. The main idea is to map the observations collected on the simulated domain as well as those collected on the real domain into a common further domain called the canonical domain. The approach has been applied to a vision-based robot grasping task, and the canonical domain consists of extremely simplified images whose purpose is capturing just the relevant information for the task. The map is learnt by using an image-conditioned Generative Adversarial Network (cGAN) [98], able to map an image of a domain D into an adapted image of the canonical domain D c ; i.e., G : D → D c . The resulting image of D c is then sent to the controller which is learnt by using Q-function Targets via Optimisation (QT-Opt) [89]. During training, cGAN receives randomised simulated images, sampled from the trajectories, and learns to convert them in canonical images. The resulting observations are then used by the QT-Opt to produce the policy. In the test procedure, the realworld images are mapped into canonical images and sent to the controller. The proposed approach returns excellent results, however, an effective transfer is not always achieved. In particular, when the cGAN during training is fed with images sampled only from non-successful trajectories, the final controller results in an unsatisfactory transfer.
Siekmann et al. [61] proposes a simple terrain randomization to learn robust proprioceptive controllers for bipedal robots involved in the task of climbing and descending stairs. They model the policy with a Long short-term memory (LSTM) network, for its capability of processing temporal sequences. Indeed, unlike feed-forward neural networks, LSTMs are equipped with a feedback mechanism that allows them to process sequences of input data, without treating each sample of the sequence independently. They retain useful information about earlier data points in the sequence, aiding in the processing of new data points. The authors compare the performance of three different controllers π: (i) A learned LSTM controller with different terrain parameters ψ, (ii) a feed-forward NN controller learned with different ψ terrain parameters, and (iii) a LSTM controller learned on a single simulated environment. The experimental results show that the first π is the one with the highest overall probability of success in the task. Thus, the combination of LSTM and DR seems to be an effective solution to the problem for the tested task.
Finally, Hu et al. [41] face the reality gap of a controller involved in a wheeled robot navigation task. The proposed solution tries to render the controller robust to possible parametric errors in the model, but also to possible disturbances that corrupt its dynamics. To this end (i) a terrain is also taken into account. The resulting π, learned entirely in simulation, results in an effective application on the real-world environment. Moreover, a comparison with some state of the art solutions for robot navigation highlights the better performance of the proposed approach in terms of success rate as well as cumulative travel distance, and time required for task execution.

B. ADVERSARIAL RL
In the adversarial RL (ARL), the agent L is composed of two sub-agents: the protagonist L P and the antagonist L A . The underlying idea resembles the one behind domain randomisation: enforce robustness (and, hence, improve controller transferability) by training the controller in a collection of environment models instead of a single one. In the case of ARL, however, the diversity is obtained by training a secondary controller (the adversarial) to generate more difficult models to handle (those that minimise the cumulative reward). Figure 7 graphically summarises the process of ARL.
Given a corrupted simulatorẼ of E, defined by a tuple (X ,Ã ,Õ , Z , Υ ,f ξ ,g ψ , h) and evolving according to Equations (7) to (9), L P and L A interact withẼ seeking to maximise their respective discounted cumulative reward [99].
We denote withr  choice [100] is to provide L A with a rewardr . As a result two controllers are learnt: • π P , whose target is to maximise the cumulative reward over time, resulting in the final controller which will be tested in aẼ -to-E transfer; • π A :Õ → Z × Υ × Ξ × Ψ that searches for those environment perturbations or parameters variations that minimise the same cumulative reward over time. The outputs of L A and π A are perturbations and parameters variations ofẼ . In a noise-corrupted simulated environment, the observed reward and hence the discounted reward will depend on the disturbances as well as on the policy. Since, in ARL, disturbances are generated by the adversarial agent, the finite horizon discounted reward will depend on both policies. To catch this dependency we can write J π P ,π A x (0) . The resulting L goal can be compactly stated as: max falling in a worst-case approach that in control theory is known as min-max optimal control (also referred to H ∞control) [101,102,103]. Consequently, in ARL,Ẽ interacts simultaneously with L A and L P , thus L ↔Ẽ results in L A ↔Ẽ ↔ L P and the global L task is to find: Table 4 summarises the articles that use ARL for learning robust controllers, along with the respective learning algorithms and the considered tasks.
The ARL approach was firstly introduced by [51], with the Robust Adversarial Reinforcement Learning (RARL) algorithm. There, π P (the protagonist's policy) is trained to work in presence of an adversary (π A ), able to inject destabilising disturbances to environment (in particular only ξ disturbances). The proposed solution can be summarised in two main steps that are repeated n iter times: (i) learn the protagonist policy π P while keeping the adversary one fixed; (ii) learn the adversary policy π A while keeping π P fixed. The experiments have been done on several OpenAI Gym environments [82]. In a first experiment, the authors compare mean and variance of the cumulative reward over 50 RARL policies, obtained using different seeds and initialisation, with TRPO ones. For all tasks RARL behaves better than TRPO in terms of mean and variance. In a second experiment, [51] shows that RARL behaves better than TRPO under adversarial attacks while keeping hold the protagonist. Finally, in a third experiment, the authors introduce different ζ (k) in the test phase, and again obtain better results with respect to TRPO.
Pan et al. [58] builds on the RARL idea by introducing the Risk-Averse Robust Adversarial RL (RARARL) concept: a RARL algorithm in which the protagonist is trained to be risk-averse and the adversarial, in contrast, risk-seeking. The authors of the cited study state that "a robust policy should not only maximise long-term expected reward, but should also select actions with low variance of that expected reward". For that purpose, they train κ different Q-value networks that return κ action-value outputs. The risk of an action is estimated by the empirical variance of these κ Qvalues (Var κ (Q)). At the beginning of each training episode one of the κ networks is randomly chosen and employed for control input selection during the entire episode. The protagonist and the antagonist take actions sequentially: the protagonist action-value function Q π P is augmented by a risk-averse term (Var κ (Q π P )), which encourages the choice of lower variance control inputs; the adversary Q π A , instead, is reduced by a risk-seeking term (Var κ (Q π A )) in order to guide it towards higher-variance outputs. The algorithm has been tested in a simulated self-driving task and obtained experimental results highlighting the better robustness of RARARL controller with respect to one subjected to random perturbation in training. During the test, control inputs are selected according to the mean value of the κ networks.

C. TRANSFER LEARNING
The previously discussed approaches are aimed at controllers that, once learnt in simulation, can be directly transferred on real robots without any (or, at most, with very few) additional training steps. Basically, an agent searches for a sim-to-real transferable controller.
A different perspective is the one adopted in transfer learning (TL) approach. Indeed, its aim is not to find a solution to the RG, but rather to avoid its occurrence by means of two subsequent or simultaneous training phases (first in simulation and second in reality), penalising the resulting L efficiency.
Let us ignore for the moment the RG and suppose to be in a classic RL training scenario, described by L ↔ E. The basic idea of TL is that generalisation is possible not only within task but also between tasks [105]. Therefore, since a task can be entirely defined by an environment, considering a second different but compatible environment 2 , the controller learnt 2 Two environments E = (X, A, O, f, g, h) and E = (X , A , O , f , g , h ) are compatible if and only if A = A and O = O . Thus, a policy for a system E is also a policy for every system E that is compatible with E. However, a policy which is optimal for E is in general not optimal for E . for the first is expected to be a helpful tool to speed-up the second learning process, whether or not it involves the same agent. Thereby, in RL, in which the controller is the result of a trial and error process, TL could be employed to speed-up the learning, thus avoiding training from scratch.
Back to the RG, the idea of "recycling" policies between tasks could be useful in speeding the real robot learning procedure or, possibly, while performing a fine tuning of simulated and real agents (L and L respectively).
In the first scenario, what has been learnt in simulation, by using L , is reused (in its entirety or in part) in subsequent phases of real-world training, performed by using L. The expected result is a faster real-world training that bridges the discrepancy between simulator and real robot at its root, i.e., while real agent L is learning its π. Therefore, the transfer occurs once, and only in one direction (from simulation to reality).
In the second case, by providing some real information to simulator and taking example from it, the simulator could more realistically adapt itself to reality, thus reducing performance misalignment and, thereby, the RG. In this situation, the transfer is repeated, and in both directions (sim-to-real and real-to-sim).
Overall, we will refer to the exchanged information as u is the sim-to-real transferred information, while u (k) the real-to-sim one (Figure 8).
We categorise the different TL approaches based on: • the kind of information passed through u (k) exch , (e.g., weights of an image processing net, state-control input pairs, policy parameters or even the policy itself); • the transfer timing of u  Table 5 summarises the articles that use TL and characterises them in terms of these four factors. The table also shows the tasks and algorithms.
Christiano et al. [49] adopted TL to perform sim-to-real control input adaptation. They assume that if simulator does not exactly replicate the real robot behaviour, applying the same control input in both scenarios does not necessarily lead to same observation. However, they assume that the observation o (k) obtained in simulation, is the one that should be achieved also on the real environment. Therefore, given the simulated observation o (k) , they employ past history to discover what real control input a (k) can lead to o (k) o (k) . For this purpose, a neural network is trained in order to predict the control input a (k) that leads to a specific o (k) . In particular, it is assumed that a simulated-based policy π , a forward dynamic robot model F , and a sequence τ i = o (0) , a (0) , . . . , o (i−1) , a (i−1) , o (i) of i real observations and i − 1 real control inputs are known. Here, TRPO is used to learn π , but any L agent (not necessarily RL) could be used for the purpose. While training the policy π, the policy π returns a control input a (i) based on the provided history τ i . However, rather than being applied to the real robot, it is sent to F and the resulting observation o (i) is provided, together with τ i , as inputs to L. Finally, the learnt policy π provides the control input a (i) that results in o (k) similar to o (k) . Therefore, here the communication is continuous and in both directions u It is worth remarking that what is actually learnt in this case is an inverse dynamic model (implemented as a neural network) that must be employed in conjunction with the controller learnt in simulation. The authors of the cited study evaluate their approach first in a sim-to-sim scenario with several OpenAI Gym environments and then in a sim-to-real scenario based on a Fetch robot. Results highlight the effectiveness of such method and the relatively low number of samples required for convergence. However, the authors assume that, when the consequences of a control input applied in simulation differ from those applied on the real robot, real observations should match simulated ones, and that is not always true.
In [60], the Grounded Action Transformation (GAT) algorithm is proposed to learn a humanoid bipedal locomotion policy. Inspired by the Grounded Simulation Learning (GSL) idea, introduced in [109], they try to reproduce it in a RL framework, additionally improving some aspects. GLS is based on two main principles: grounding and guide. The former refers to making the simulator E closer to the real robot E, by properly modifying some parameters of E on the basis of data collected from E. The latter consists in having an expert able to guide the optimisation algorithm in finding the proper parameters of E to be tuned. In practice, given an evaluation function J eval , such as a penalty function (for example the opposite of the reward), a policy π is applied to the real robot in order to collect end-effector trajectories D. By performing D both in simulation and in reality, and collecting the resulting real end-effector trajectories, an optimisation problem is solved in order to find those E parameters able to minimise the Kullback-Leibler divergence between the probability of observing the same trajectories in the two cases. The resulting E is therefore used in order to find a set of candidate policies Π C trying to minimise J eval . The optimal policy is the π ∈ Π c such that J eval is minimised once performed on the real robot E. However, the above procedure is aimed at finding the correct values of the E parameters. Conversely, GAT introduces an action transformation function a (k) = m(a (k) ), learnt in a supervised fashion, able to map each action a (k) ∈ A into an action a (k) ∈ A . In particular, a forward robot dynamics model is trained to compute the x (k+1) resulting from a (k) . The inverse robot dynamics, instead, is trained to find the simulated action a (k) able to lead the simulator in x (k+1) = x (k+1) . The resulting   [108] procedure leads to u Both sim-to-sim and sim-to-real experiments provide good results; however, as authors point out, the drawback of using a supervised method for m(·) learning is that policies are no longer effective when there are changes between training and testing distributions. Moreover, neglecting the contact dynamics can lead to simulation bias.
Wulfmeier et al. [52] proposed Mutual Alignment Transfer Learning (MATL), a method that relies on a Generative Adversarial Network (GAN) [110]. The main idea is similar to [49]: enforcing a similarity between observations o (k) and o (k) . Indeed, although a control sequence achieving the goal in simulation may not produce the same effects in the real environment, the corresponding sequence of simulated observations, if reproduced on the real system, can lead to task accomplishment. For the purpose, here, simulator E and real robot E work in parallel as generators and interact with two different agents (L and L respectively). A discriminator D, instead, is employed (and trained along with L and L) to classify the environment from which a sequence of observations τ κ , provided as input, has been collected. L and L are trained not only to maximise their respective environment reward, but also to mislead the discriminator, based on the assumption that the more the discriminator is mislead, the more "aligned" the observations are. Therefore, each time a sequence of observations τ κ is collected, D will receive it as input and will output the probability D(τ κ ) that it was generated by E . A term log(D(τ κ )) is respectively added and subtracted to E and E rewards thus encouraging misleading actions (here TRPO is the employed algorithm). The proposed solution results in an experience exchange between L and L and a consequent alignment of the collected observations. To exploit the simulator, L is updated M times more frequently than L, thus accelerating learning. Thereby, in this case u (k) exch is the D output and it is equal to [− log(D(τ κ )) log(D(τ κ ))] T . Wulfmeier et al. [52] evaluate their approach on various RL tasks: rllab [111], OpenAI Gym, and DartEnv [82]. Results show that MATL is able to work with significantly different environments of same simulator in which only parameters variation is performed. Less encouraging results are reported when employing different simulators.
A different solution was proposed in [43], where progressive nets [74] are employed for sim-to-real information transfer. Here, by exploiting the capability of those nets to learn a tasks sequence through lateral connections, simulated knowledge can be used to avoid training from scratch on real robot. A progressive net is composed of l "columns" in which the i-th column represents an independent network of κ hidden activations. Each j-th activation act j,i of the i-th column is a function of the same column j − 1-th activation (act j−1,i ) and of the j − 1-th activation of all the previous m < i columns (act j,1 , . . . , act i−1 ). Rusu et al. [43] propose to use this tool in order to learn a simulated controller π , via A3C algorithm, in the first column and subsequently transfer its knowledge on real robot by means of lateral connections headed towards the second column, which represents the real agent L. Then, the second net training (L) begins. Therefore, letting s be a simulated column, u The authors evaluate their approach on a robot manipulation task on Jaco arm [112] in which a visual target must be reached. The performances are compared with those obtained using a finetuning approach showing the superiority of progressive nets.
In [55], a Neural Augmented Simulation (NAS) approach is used to reduce the RG. A Long Short Term Memory is trained on the differences between simulated and real robot, and used to adapt the simulator on the basis of real world data. The policy π resulting from a Proximal Policy Optimisation algorithm is learnt on the simulated environment, whose next state at each step is adjusted by using the correction term ∆ provided by the LSTM. The resulting policy, therefore, associates to each estimated value of the real robot state x (i) = x (i) + ∆ an action a (i) = a (i) . The transfer is in this case only from the simulator to the real robot and u (k) exch = [u (i) 0] T = [π 0] T . The NAS has been tested in a sim-to-sim and a sim-to-real transfer. For the former, authors create an artificial RG by varying some parameters of two different simulated robotics environments of Open AI Gym, one of which was considered as the real environment. For the sim-to-real transfer, two Poppy Ergo Jr robots [113] have been used in a ErgoShield task, in which an attacker (one of the two involved robots) is controlled to touch as often as possible the shield attached to the end-effector of a defender (the other robot). The defender is able to move the shield in random poses. Experimental results show good performance both in sim-to-sim and sim-to-real transfer. Moreover, since a policy-specific fine-tuning is not required, the method can be appropriate for multi-task robotic applications.
In [40], conversely, they propose to learn a RL controller in simulation (using a deep Q-Learning approach) in which the first stages represent a visual perception module, parametrized by vector θ VP . Therefore, by keeping the weights fixed, they use this module to work with real-data and predict rewards for h planned control inputs by learning a DNN. The predictor is trained, by means of a real-world data-set, in order to minimise reward prediction error. In the second phase, the real application, the predictor is used by an MPC controller. Hence, at each step, the MPC controller computes a sequence of h control inputs which maximises the expected discounted predicted reward within an horizon h. In this context, u As prescribed by the MPC approach, only the first control input is applied; hence, the process is repeated. Kang et al. [40] consider a nano aerial vehicle collision avoidance task to assess the proposed solution. Moreover, the authors compare it with other approaches: simulation only, simulation with fine-tuning, simulation with fine-tuning and perception fixed, real world only, supervised and unsupervised. Their solution outperforms all others tested, and shows the best result in terms of pre-collision time.
Yuan et al. [47] performs an action-value function adaptation in a supervised fashion. Here, a Baxter robot is asked to solve a nonprehensile rearrangement task, i.e., the problem of pushing an object into a predefined goal pose. The proposed procedure consists of three sequential steps: (1) learning, in simulation, an optimal action-value function Q sim able to select the best action a to perform when an image of the scene is provided as observation o , (2) collecting a data-set of real and simulated observation pairs (o, o ), and (3) using it, along with the pre-trained Q sim , in order to create a Q real useful to adapt the agent for a real world application. In the former, a deep-Q network is used to approximate Q sim . In the second step, starting from real scenes o, the obstacle and the portable object positions are used to recreate the same scenes in simulation o . In particular, randomly setting o 0 , the RL agent (learnt in step (1)) is applied to the real robot in order to collect a set of O real = o 0 , o 1 , o 2 , . . . from which the respective simulated counterpart set O sim = o 0 , o 1 , o 2 , . . . is created. The resulting data-set, composed of (O real , O sim , Q sim ) is used in order to learn a Q real able to minimise a loss function defined as r + γQ sim − Q real . Three different strategies have been deployed: (a) train Q real keeping the Q sim structure but retraining the network in its entirety; (b) use Q sim as baseline and adapt only the parameters of the convolutional layers; (c) add two new fully connected layers to increase the flexibility of the network and learn their parameters and those of the convolutional layers. Therefore in this case u Authors compare their results with the one obtained by using the same domain randomisation idea of [75]. Experimental results show that their approaches surpass the one of [75] in terms of performances and, in particular, (c) turns out to be the best. However, collecting data both in simulation and in reality in order to build the data-set used to learn the Q real may be costly and time-consuming.

D. DISCUSSION AND PROMISING IDEAS
Despite several attempts found in the literature to make simto-real transferable controllers, many of which associate the idea of robustness with that of sim-to-real transferability (DR and ARL), the lack of uniformity of the considered tasks does VOLUME 4, 2016 not allow to determine which solution is the more appropriate in terms of the RG. Besides, a considerable fraction of the proposed approaches were not experimentally evaluated in an actual sim-to-real scenario. A controller robust to certain model disturbances or parametric variations, is not necessarily sim-to-real transferable. In fact, if these variations and disturbances do not correctly represent the simulator inaccuracies with respect to reality, it might result not simto-real transferable. This suggests that an interaction with the real system during training is still needed: to this respect, TL approaches appear promising. On the other hand, the TL approaches here surveyed often lacked an assessment of the robustness. Moreover, since TL requires two successive (or simultaneous) training phases, it may be exhibit low efficiency.
Some mixed approaches exist, that borrow ideas from DR, ARL, and TL. A first attempt in merging DR and TL is proposed in [114]. Here the authors propose to learn a policy in a randomised simulation and to adapt the distribution of simulation parameters on the basis of a real-world performance.
A promising research direction to tackle the RG problem could be the meta-learning strategy application in RL, in order to quickly adapt experience gained in simulation on the real system [115,116]. In Meta-RL, given a distribution over tasks, the agent learns an adaptive policy that maximises the expected reward for a new task from the distribution. A recent work [117] has shown the great ability of this approach to generalise in environments totally different from those used during the training.
Another promising solution to avoid a direct RL training on the real robot seems to be the Probabilistic Inference for Learning Control (PILCO) proposed by [118]. Here, a probabilistic model of the system dynamics is learnt incorporating uncertainty by using only some trial on the real system and a policy is learnt through it. This solution allows to avoid the RG in the first place, by training a simulator with few real interactions and using it for the trial and error procedure. However, although the potential usefulness of PILCO and Meta-RL to cope with the RG, the former underestimates state uncertainty at future time steps [119], thus possibly leading to a decrease in performance; the latter, on the other hand, is computational demanding and needs an high number of real-world evaluations [120].

V. CONCLUSIONS AND OPEN CHALLENGES
Training a RL robotic controller in real-time in its actual environment is a costly process. While simulators can alleviate the problem, the approximations often present in the employed models play a crucial role in determining the effectiveness of the learnt controllers. The more accurate the model of the robot (and the surrounding), the more effective the controller but, at the same time, the greater the computational cost of learning it. When performances of the controller in the simulator and in real robot are different, a RG exists and the controller is not sim-to-real transferable.
In the present article, we provided a formal framework for the RG and reviewed the most relevant existing methods aiming to achieve sim-to-real transferable controllers in robotics RL applications. We surveyed the literature concerning RL and RG and categorised the approaches as: domain randomisation, adversarial reinforcement learning, and transfer learning. Moreover, we described them in detail according to the proposed framework and in terms of the employed algorithms, involved tasks, and evaluation methods.
We conclude commenting on some significant open challenges. Each one of the examined approaches appears tailored to a specific task, and its applicability to other, potentially different, tasks is not clear. A general task-independent approach able to guarantee an effective sim-to-real transferability of the controller is still missing. With this in mind, we believe that a significant open problem is that of providing a proper index able to reveal and quantify the RG. Indeed, being able to characterise and quantify the RG would (i) enable systematic comparison among different techniques, hence favouring the advancement of research, and (ii) allow to use the measure of RG directly as an optimisation objective, hence putting the transferability as a direct goal in the learning of RG-aware controllers. Another significant open problem is the sample efficiency of the developed approaches. As highlighted in [121], RL is very dataintensive. The computation effort required for learning a RL agent involved in a complex task can be huge, thus limiting the practical applicability of such methods. The surveyed approaches lead to learning paradigms that unquestionably aggravate this issue, especially in those cases in which an interaction with different environment domains is involved.