Proximal Policy Optimization with Adaptive Threshold for Symmetric Relative Density Ratio

Deep reinforcement learning (DRL) is one of the promising approaches for introducing robots into complicated environments. The recent remarkable progress of DRL stands on regularization of policy, which allows the policy to improve stably and efficiently. A popular method, so-called proximal policy optimization (PPO), and its variants constrain density ratio of the latest and baseline policies when the density ratio exceeds a given threshold. This threshold can be designed relatively intuitively, and in fact its recommended value range has been suggested. However, the density ratio is asymmetric for its center, and the possible error scale from its center, which should be close to the threshold, would depend on how the baseline policy is given. In order to maximize the values of regularization of policy, this paper proposes a new PPO derived using relative Pearson (RPE) divergence, therefore so-called PPO-RPE, to design the threshold adaptively. In PPO-RPE, the relative density ratio, which can be formed with symmetry, replaces the raw density ratio. Thanks to this symmetry, its error scale from center can easily be estimated, hence, the threshold can be adapted for the estimated error scale. From three simple benchmark simulations, the importance of algorithm-dependent threshold design is revealed. By simulating additional four locomotion tasks, it is verified that the proposed method statistically contributes to task accomplishment by appropriately restricting the policy updates.


Introduction
Reinforcement learning (RL) (Sutton and Barto, 2018) is one of the promising methodologies for resolving complicated robot control problems. Remarkable developments have received a lot of attention, especially by grace of the combination with deep neural networks (DNNs) (LeCun et al., 2015) to approximate value and stochastic policy functions, so-called deep RL (DRL) .
The basic learning laws for RL have been established with two directions in a relatively early stage (Sutton and Barto, 2018). One is value-function-based methods, which can find global solutions for the problems with discrete action space. Another is policy-gradient-based methods, which can handle continuous action space although obtain local solutions. This paper focuses on the policy-gradientbased methods for applications such as continuous joint control of robots. Note that these basic learning laws are taken over by DRL.
In the basic research of them, various learning algorithms are being proposed in order to alleviate the numerical instability of the learning process behavior due to nonlinear regression of DNNs: e.g. experience replay (Lin, 1992;Schaul et al., 2015); target network (Mnih et al., 2015;Kobayashi and Ilboudo, 2021); entropy maximization of policy (Haarnoja et al., 2018;Shi et al., 2019); parallelization of learning with asynchronous agents (Mnih et al., 2016); and utilization of learned model (i.e. modelbased RL) (Chua et al., 2018;Clavera et al., 2020). These methods, alone or in combination, have stabilized and accelerated the learning of the given tasks by DRL, and are steadily approaching the stage of practical use.
Among them, research on policy regularization (e.g. (Haarnoja et al., 2018;Shi et al., 2019)) has recently been systematized theoretically as policy-regularized RL (Geist et al., 2019), and its importance can be seen to be high. The motivation for the policy regularization comes from the fact that the policy improvement is an indirect optimization through the value function, and it is prone to learning instability; in other words, the variance of learning performance due to random seeds is large. By constraining the update amount and direction of policy, the policy regularization makes the policy smoothly improve and acquire the near-optimal one.
This study further focuses on proximal policy optimization (PPO) (Schulman et al., 2017) as one of the policy regularization methods. The main concept of PPO is proximal update by softly constraining the latest policy to the baseline (basically old) one. Although PPO has two versions, one if for soft constraint of Kullback-Leibler (KL) divergence and another is for clipping of density ratio, the latter version is major due to its intuitive and practical implementation while outperforming the first version. In this paper, therefore, PPO refers to the clipping version.
To improve PPO, several methods have been proposed by combining it with other techniques. For example, (Hämäläinen et al., 2020) optimized the covariance of policy based on CMA-ES to improve the exploration efficiency. In (Imagawa et al., 2019), an optimistic bonus according to the uncertainty of return is added to facilitate the exploration. A single demonstration has been provided in (Libardi et al., 2021) to accomplish the tasks where rewards are sparsely given. All of these methods utilize PPO to stabilize learning, and then improve learning efficiency by using other methods in combination.
On the other hand, the way to regularize the policy in PPO is recently revisited. In PPO-RB , the density ratio over the clipping threshold is explicitly reverted to the clipping range for more certain regularization. PPOS (Zhu and Rosendo, 2021) relaxes the regularization in PPO-RB by weakening it according to the degree of exceedance from threshold. In the earlier work of this paper (Kobayashi, 2021a), a new regularization method has been proposed based on f-divergence to revise the ambiguous regularization caused by clipping.
While PPO has obtained many successes, open issues related to the clipping-based regularization can be raised.
1. PPO has no capability to make the latest policy softly constrain to the baseline one Zhu and Rosendo, 2021). 2. Although the symmetric threshold for clipping is given, the density ratio is in asymmetric domain (Kobayashi, 2021a). 3. A recommended threshold range is provided, but it is not for arbitrary learning algorithms.
As for the third problem, a generalized advantage estimation (Schulman et al., 2015) has been utilized for the update law in the original PPO, but the clipping regularization can be acceptable to other learning algorithms. Although threshold-based regularization achieves the taskinvariant tuning, it is easy to imagine that the desirable threshold varies depending on the update speed of policy, which further depends on the learning algorithm combined with the regularization. The optimization or adaptive design of the threshold for the learning algorithm to be employed has not been proposed yet. This should be done by considering the error scale from the center of the density ratio (i.e. 1), but in practice, its asymmetry interferes with the estimation of this error scale. Hence, this paper proposes a new PPO variant to resolve the above issues by theoretically considering a new regularization problem of relative Pearson (RPE) divergence Sugiyama et al., 2013). Since RPE is one of the divergence metrics between two probabilities, its minimization can softly constrain the latest policy to the baseline one. To inherit the threshold-based regularization like PPO, the strength of regularization in PPO-RPE is reshaped to the corresponding threshold mathematically. On the other hand, the relative density ratio is introduced instead of the raw density ratio, and by adjusting its relativity parameter, it can be in symmetric domain. Thanks to this symmetry, in addition to the earlier work of this paper (Kobayashi, 2021a), the error scale of the relative density ratio can easily be estimated, and it can be utilized as the adaptive threshold. Hence, PPO-RPE would be a versatile method that provides mathematically appropriate policy regularization and has a mechanism that automatically adjusts its strength (i.e. threshold) without any consideration of the task and learning algorithm.
As investigation of PPO-RPE, three simple benchmark tasks provided in Pybullet (Coumans and Bai, 2016) are simulated with two types of learning algorithms. In their results, the importance of algorithm-dependent threshold design is revealed. In addition, four locomotion tasks are simulated using PPO and PPO-RPE. The statistical tests after learning show that only PPO-RPE enables to achieve high scores even in unstable tasks by restricting the policy updates more appropriately than PPO. Therefore, it can be concluded that PPO-RPE does not require the taskand algorithm-dependent tuning, and brings easy application to a variety of practical problems.

Policy-regularized reinforcement learning
Reinforcement learning (RL) (Sutton and Barto, 2018) optimizes an agent's policy π to maximize the sum of rewards r, so-called a return defined at time step t as R t = ∞ k=0 γ k r t+k with γ ∈ [0, 1) discount factor. This problem can be solved in Markov decision process (MDP) with the tuple (S, A, R, p 0 , p e ). Here, S and A denote the state and action spaces, respectively, and R is the reward set. p 0 and p e represent the initial state probability and the state transition probability.
Specifically, MDP assumes the following process. At the time step t, the state s t ∈ S is first sampled from environment with either of probabilities, s 0 ∼ p 0 (s 0 ) as the initial random state or s t ∼ p e (s t | s t−1 , a t−1 ) as the state transition. According to s t , an agent decides the action a t ∈ A, using the learnable policy a t ∼ π(a t | s t ). a t acts on the environment and stochastically updates the state to the next one p e (s t+1 | s t , a t ). At that time, the agent obtains a reward r t ∈ R from the environment: r t = r(s t , a t ). By repeating this process with the experienced data (s t , a t , s t+1 , r t ), the agent estimates the expected R t as a value function. In on-policy learning like PPO (Schulman et al., 2017), the policy-dependent action value function Q π (s t , a t ) = E[R t | s t , a t ] and/or the policy-dependent action value function V π (s t ) = E at∼π [Q π (s t , a t )] can be learned through Bellman equation as the following minimization problem: This paper utilizes V π , although Q π can be learned by replacing the above V π with Q π . Note that, with DNNs, the target network is often employed to output V π (s t+1 ) for numerical stability. The optimization of π is formulated as the following minimization problem.
where A(s t , a t ) = Q π (s t , a t ) − V π (s t ) denotes the advantage function, and actually it corresponds to the temporal difference (TD) error. Note that A also depends on π, but it is omitted to simplify the notation. With a policy regularization term Ω (Geist et al., 2019), the original minimization target is reformulated as follows: where a baseline policy b(a t | s t ) (e.g. the old version of π or the one outputted from slowly updated target network (Mnih et al., 2015)) is introduced and a density ratio ρ = π/b is derived in L † . This allows the agent to reuse past empirical data for training as experience replay (Lin, 1992;Schaul et al., 2015), where the actions have not been sampled from the current policy. A † (s t , a t ) and ρ † (s t , a t ) are the surrogate versions defined as A † (s t , a t ) = A(s t , a t ) − Ω(s t , a t ) and ρ † (s t , a t ) = ρ(s t , a t )(1 − Ω(s t , a t )/A(s t , a t )), respectively. By applying a policy-gradient method to minimize L † (π) with Monte Carlo approximation, which eliminates the expectation operation, the policy parameterized by a parameters set θ (e.g. weights and biases in DNNs) can be optimized.
If A † has no direct computational graph with π,Ã † = A † . In general DRL, θ is updated using one of the stochastic gradient descent (SGD) optimizers like (Ilboudo et al., 2020).

Proximal policy optimization
In the clipping version of proximal policy optimization (PPO) (Schulman et al., 2017) (and its variant named PPO-RB ), eq. (4) is given as the following condition ρ † = ρ PPO , instead of the explicit Ω, with a threshold parameter > 0. Note that, to simplify the rest of the description, the arguments (e.g. (s t , a t )) are omitted in most cases.
where σ denotes the sign of A. η ≥ 0 is for rollback to the baseline policy, which has been developed for PPO-RB, and only if η = 0, it reverts to the original PPO. In this condition, Ω PPO for the specific regularization in PPO(-RB) can be analytically derived (see the appendix Appendix .1). However, its heuristic design does not corresponds to divergence between two probabilities like KL divergence.

Relative Pearson divergence
To inherit the abilities of PPO (eqs. (7) and (8)) and overcome its issues, the following conditions are considered.
1. When ρ = 1, the term inside the expectation of the loss function should be ρA. 2. Only when ρ is on the threshold, the gradient must be zero and there is only one local maximum. 3. For the explicit constraint between π and b, one of the f-divergences, which is the generalized divergence (Liese and Vajda, 2006), is desired to be utilized. 4. Symmetrical shape of ρ † is desired to symmetrically clip it and to adjust the threshold easily according to the error scale of ρ † from its center.
KL divergence, which is utilized in the prior version of PPO (Schulman et al., 2017), cannot satisfy all of them (especially, the symmetry). Thus, the alternative divergence among one of the f-divergences that satisfies all the above is desired to be selected even heuristically. This study finds that relative Pearson (RPE) divergence Sugiyama et al., 2013) has the potential to satisfy them.
Specifically, the standard Pearson (PE) divergence between two policies, π and b, is given at first.
Due to b(a | s)ρ(s, a)da = π(a | s)da = 1, the mean of ρ is equal to 1. That is, this PE divergence means the expected squared error of ρ from its mean. This is nonnegative and vanishes if and only if π = b. , the both density ratios are equal to one; although the raw density ratio diverges as b decreases and π increases (i.e. ρ increases), the relative density ratio has finite upper bounds 1/β; in particular, if β = 0.5, the shape of relative density ratio becomes symmetric around one as shown in (b).
Instead of b, RPE divergence introduces the relative density function with a mixture ratio β ∈ [0, 1] as follows: In addition, the relative density ratio ρ β is defined using π β .
where the term on the right side can be derived by multiplying it with 1 = b −1 /b −1 . As important properties, ρ β is finite within [0, 1/β), and its mean is still 1. Therefore, if β = 0.5, ρ β obtains the symmetry, as shown in Fig. 1. By replacing ρ in eq. (9) to ρ β , RPE divergence is defined as follows:

Proposed method
3.1. Proximal policy optimization with relative Pearson divergence: PPO-RPE This paper proposes a proximal policy optimization with relative Pearson divergence, so-called PPO-RPE. The explicit f-divergence minimization can softly constrain the latest policy π to the baseline one b. In particular, by employing RPE divergence, the gain of this regularization C can be adaptively tuned as PPO-like (but adaptive, not fixed) threshold, as introduced in the next section. The main process of PPO-RPE except the adaptive threshold for determining the gain C is summarized in Alg. 1.

Minimization target
To make PPO-RPE be a sub-class of policy-regularized RL described in eq. (3), the regularization term for PPO-RPE, Ω RPE , is derived. With π β in eq. (10) and ρ β in eq. (11), Ω RPE is given as follows: where C again denotes the gain of this regularization. The expectation w.r.t π of this regularization is equivalent to eq. (12) amplified by the fixed C as shown in below.
Note that if C depends on π β (or π), this derivation is not exact but sufficiently relevant. By substituting eq. (13) for eq. (3), the following regularized loss function is derived for PPO-RPE.
where Ω RPE is included in A RPE , and it can be represented as ρ RPE A with ρ RPE = ρA RPE /A.

Policy gradient
Here, the policy gradient for eq. (15) is analytically derived. According to eq. (5),Ã RPE = A RPE + ρ∇ ρ A RPE is computed analytically as follows (its derivation is in the appendix Appendix .2): Algorithm 1 PPO-RPE: the value function Q and/or V will be trained in parallel according to Bellman equation.

12:
Main formulas of PPO-RPE 13: Get C from Alg. 2 given A and ρ β as inputs Update of policies 19: Update b according to π with θ 22: t ← t + 1 23: end while This is substituted for eq. (5), and used for updating the parameters set θ in eq. (6).
The second term in the square bracket is consistent with the regularization, and vanishes if and only if ρ = ρ β = 1 (i.e. π = b) regardless of C and β. Namely, with this policy gradient, π has the potential to converge on π * , which maximizes A (more specifically, the action-value function Q), while being regularized to b during the policy updates.

Adaptive threshold for designing gain of regularization
Actually, the gain for regularization, C, is hard to be tuned since the ratio between A for the original purpose and Ω RPE for the regularization is task-and algorithmdependent. In contrast to C, PPO utilizes the threshold with for tuning the strength of regularization. The threshold allows for more intuitive design, and in fact, the original paper (Schulman et al., 2017) shows the recommended range of for any task.
Inspired by this, C is converted to the PPO-like threshold, which releases from the task dependency. In addition, by applying to the relative density ratio ρ β , not the raw density ratio ρ, the symmetric threshold works in the same symmetric domain. Such symmetry enables PPO-RPE to easily estimate the algorithm-dependent error scale of ρ β from its center 1, which would be utilized for the adaptive Algorithm 2 Adaptive threshold design for PPO-RPE 1: Set β ∈ [0, 1] (0.5 is the default value) 2: Set λ ∈ (0, 1) (0.999 is the recommended value) 3: Set κ ∈ (0, 1) (0.5 is the recommended value) 4: Set ∆ ∈ (0, 0.5) (0.1 is the recommended value) 5: Initialize ∆ max = 0 (i.e. no prior information) 6: Initialize ∆ = 1 (i.e. theoretical maximum value) 7: while True do 8: Computation of C 9: Get A and ρ β from Alg. 1 10: Yield C to Alg. 1

Conversion from gain to threshold
We focus on the fact that the density ratio ρ is asymmetric around 1, and not consistent with the symmetric threshold in PPO, causing unbalanced regularization. Instead, PPO-RPE can serve the threshold for the relative density ratio ρ β with β = 0.5, which is in the symmetric domain (see again Fig. 1). Given ∈ (0, 1) (will be adapted later), the following condition is considered.
where ρ can be inversely derived from eq. (11). Note that σ /(1 − β(1 + σ )) in ρ yields the asymmetric threshold for ρ domain, and if β = 0, it is reverted to be symmetric. Only at ρ β (and ρ ), in order to softly constrain the latest policy π to the baseline b,Ã RPE is desired to be zero, which makes L RPE be convex. Using this condition, C can be derived from eq. (16) as follows: where A/σ = |A| is utilized. Indeed, C designed above is regarded as the function of A, and therefore, yields the adaptability to various tasks with different scales of reward function. By substituting eq. (19) for eq. (16), PPO-RPE achieves the following policy gradient term with instead of C.
Using this, the parameters set θ is updated according to SGD with the policy gradient −ρÃ RPE ∇ θ ln π(a t | s t ).
Here, all the examples of the negative loss functions (i.e. ρA † or ρ † A) for the proposed and conventional methods are illustrated in Fig. 2. As can be seen in this figure, PPO-RPE has the asymmetric threshold in ρ domain, which is symmetric in ρ β=0.5 domain.

Adaptive threshold using estimated error scale
In eq. (20), we can see that the threshold-based gain for regularization can adapt to differences in reward scales for different tasks. However, the mean of divergence between π and b depends on the learning algorithm. For example, when b is given by the target network, when experience replay (Lin, 1992) is not used, or even when the learning rate α is changed, the mean would vary. The algorithmdependent design of threshold is therefore required, but as can be imagined, hand-tuning is burdensome since the recent DRL methods are constructed like by patchworks of various modules and there are many combinations. In addition, it is difficult to completely eliminate task dependency only by the threshold-based design. From the earlier work (Kobayashi, 2021a), this paper additionally contributes to design the adaptive threshold .
Specifically, thanks to the symmetry of ρ β (with β = 0.5) and its consistency with the threshold, its maximum error scale from its center (i.e. one) is revealed as one: |0 − 1| = |1/0.5 − 1| = 1. With this fact, can be interpreted as defining a threshold value for the algorithmindependent (i.e. absolute) maximum error scale, although the algorithm-dependent (i.e. relative) maximum error scale may vary from it. It is therefore expected that if is given for the relative maximum error scale, it would release from the algorithm dependency.
Here, ∆ max stores the recent maximum value, then it is smoothly transferred into ∆. Since max operator switches the output value discretely, this two-step update provides smooth estimation of ∆ and suppresses regularization fluctuations. By giving in this way, the adaptive threshold in PPO-RPE is obtained against the relative maximum error scale that depends on the learning algorithm, easing algorithm-specific tuning. As a remark, although the new hyperparameters have been introduced here, they can be given regardless of other implementations: for λ, 0.999, which is also used in the most of SGDs (Ziyin et al., 2020;Kobayashi, 2021b;Ilboudo et al., 2020) for relatively slow updates, would be the appropriate value; κ can be fixed to place the local maximum of the regularization onto half of the relative maximum error scale, i.e. 0.5; and ∆ can be designed based on the minimum recommended value of in PPO, i.e. 0.1.

Implementations
In this paper, the proposed method with its hyperparameters is implemented in almost the same way as in (Kobayashi and Ilboudo, 2021). The details are introduced below. Note that all the hyperparameters commonly used are listed in Table 1.
First, the policy π is parameterized by student-t distribution (Kobayashi, 2019) with neural networks. d adimensional student-t distribution has three model parameters: µ ∈ R da ; σ ∈ R da + ; and ν ∈ R + . They are given from  Figure 3: Network architecture in actor for policy or critic for value function: state is inputted to L series-connected modules; each module contains a fully connected layer, a layer normalization (Ba et al., 2016), and a swish activation function (Elfwing et al., 2018); an additional fully connected layer uses for shaping the features given by L modules to the output.
the networks with d s -dimensional state space inputs. The value function V is approximated by the corresponding networks with the same architecture as that of π. The networks contains L = 5 fully connected layers with N = 100 neurons and pairs of layer normalization (Ba et al., 2016) and Swish activation function (Elfwing et al., 2018) for nonlinearity (also see Fig. 3). These implementations are built on PyTorch (Paszke et al., 2017). As an optimizer of the networks, a robust SGD, i.e., LaProp (Ziyin et al., 2020) with t-momentum (Ilboudo et al., 2020) and d-AmsGrad (Kobayashi, 2021b) (so-called td-AmsProp), is employed with their default parameters except the learning rate.
To learn the networks stably, a target network with t-soft update (Kobayashi and Ilboudo, 2021) is employed for stable and efficient improvement. Here, the baseline policy b is regarded as the output from the target network for π, that is, b is a kind of the old policy. In addition, the policy entropy regularization based on SAC (Haarnoja et al., 2018) with a regularization weight β DE ; and the TD regularization (Parisi et al., 2019) with a regularization weight β T D are combined.
For acceleration of learning, either or both of adaptive eligibility traces (Kobayashi, 2022) and the experience replay are employed. Although the adaptive eligibility traces are with the same hyperparameter as (Kobayashi and Ilboudo, 2021;Kobayashi, 2021a), the buffer size of the experience replay is reduced due to memory limitation. According to them, the proposed method will appropriately modify the threshold adaptively.

Environments
In this experiment, seven benchmark tasks are simulated on Pybullet dynamical engine (Coumans and Bai, 2016) wrapped by OpenAI Gym (Brockman et al., 2016) (see Table 2). They can be divided into three simple and four complex tasks. That is, the first three tasks (i.e. In-vertedPendulum, Swingup, and DoublePendulum) are the simple ones with only one-dimensional action space, although they are nonlinear systems. The remaining tasks (i.e. Hopper, Walker2D, HalfCheetah, and Ant) are the complex locomotion control tasks with multivariate action space and more than 10-dimensional state space. Totally, 20 trials for the simple tasks and 10 trials for the complex tasks with each condition are conducted with different random seeds. Note that the robot's power in Walker2D was insufficient to walk, hence, it is amplified by a factor of 3.75 (i.e. 0.4 to 1.5).

Results for simple tasks
To verify the value of the adaptive threshold, the first three tasks are conducted with the following conditions. where {e,r} denote the use of the adaptive eligibility traces or the experience replay. For the fixed threshold, = 0.1 is set as one of the recommended value and the same value as the earlier work (Kobayashi, 2021a), which is with the adaptive eligibility traces. For the adaptive threshold, all the recommended values in Alg 2 are utilized. Note that the other PPO variants introduced in the introduction (Hämäläinen et al., 2020;Imagawa et al., 2019;Libardi et al., 2021) were omitted because they are not for improvements of regularization and are difficult to compare fairly with the proposed method. First of all, all the learning curves are illustrated in Fig. 4. With the adaptive eligibility traces, InvertedPendulum and Swingup could be completely accomplished by all the conditions since the threshold was already tuned in the earlier work (Kobayashi, 2021a). However, only  (Brockman et al., 2016;Coumans and Bai, 2016) ID Name State space d s Action space d a Episode E  InvertedPendulumBulletEnv-v0  InvertedPendulum  5  1  200  InvertedPendulumSwingupBulletEnv-v0  Swingup  5  1  200  InvertedDoublePendulumBulletEnv-v0  DoublePendulum  9  1  2000  HopperBulletEnv-v0  Hopper  15  3  2000  Walker2DBulletEnv-v0  Walker2D  22  6  2000  HalfCheetahBulletEnv-v0  HalfCheetah  26  6  2000  AntBulletEnv-v0  Ant  28  8  2000 PPO-RPE(-A) could accomplish DoublePendulum sometimes, and in particular, only the adaptive threshold enabled to reach the score around 5000, namely, it could succeeded in accomplishing DoublePendulum over 50 %. This is because online learning using the eligibility traces cannot maintain sufficient sample efficiency in failure-prone task such as DoublePendulum, but PPO-RPE limits the exploration range of the policy by regularizing the policy well, making it easier to find the optimal solution. With the experience replay, PPO-RB, PPOS, and PPO-RPE failed to acquire all the tasks. Since their performances generally decreased with stronger regularization methods, these results are probably due to too strong regularization of π to b for the replayed data. As pointed out in Zhu and Rosendo, 2021), PPO has no capability to softly constrain π to b even with the too small threshold, and therefore, it yielded the near-optimal policy in the tasks except Swingup. In contrast to them, only PPO-RPE-A stably learned all the tasks. This is probably due to the relaxation of regularization (larger than 0.1).
To confirm the adaptability of the proposed method, Fig. 5 additionally shows the average thresholds in {e,r}PPO-RPE-A. Basically, we can see that the adaptive eligibility traces and the experience replay required the small and large thresholds (i.e. around κ∆ and κ∆), respectively. Note that clamping by ∆ and ∆ worked in a limited number of scenes, such as second half of learning in InvertedPendulum. In addition, most of them did not converge to completely constant values, indicating its adaptive behavior according to the learning progress. The benefits of such an adaptive behavior need to be investigated.

Results for complex tasks
For further investigation of the performance of the proposed method, the following experiment is also conducted on the remaining complex locomotion tasks under the following conditions. where all the conditions are with both of the adaptive eligibility traces and the experience replay. Due to the bad performance of PPO-RB in the previous results, it is omitted here. To evaluate the value of the adaptive behavior in the learning progress, as indicated in the previous results, PPO-RPE (and PPO) with 0.3, which is one of the recommended values closest to the approximate mean of the adaptive threshold (see Fig. 6), are also compared.
All the learning curves are illustrated in Fig. 7. In addition, after learning the policy for each task with each condition, the agent performed the task using the learned policy 100 times to compute the median of the scores. This test results are summarized in Fig. 8.
The wide confidence intervals of the learning curves and the test scores suggest that Hopper and Walker2D are very unstable tasks, where the agent often encounters falling over and resetting the episode. In such an environment, the policy should be updated carefully, and in fact, only the proposed method (i.e PPO-RPE-A) accomplished the task multiple times in 10 trials. Note that PPO-RPE without the adaptive threshold could not obtain the same results because the appropriate regularization strength is not served.
On the other hand, in HalfCheetah, PPO with both thresholds outperformed the proposed method. This is probably because this task is relatively stable, easily falling into a local solution (i.e. an upright state); as a result, sufficient exploration is required to get out of such a local solution. As pointed out in Zhu and Rosendo, 2021), PPO does not softly constrain the latest policy to the baseline, and therefore, it may hold the exploration capability enough. Nevertheless, it is clear that given the regularization strength adaptively, the task accomplishment rate by PPO-RPE-A is better than that by PPO-RPE. Although Ant also seems to be a stable task, it only requires local exploration to improve the performance, hence, the proposed method obtained the maximum performance.
Finally, to reveal the gap between the latest and baseline policies, numerically-approximated Pearson divergence between the latest and baseline policies is depicted in Fig. 9. Clearly, PPO fails to softly constrain the latest policy to the baseline, and is inadequate as a policy regularization method, although it does not impair the exploration capability. In contrast, PPO-RPE succeeded  (Kobayashi, 2022) is employed, and due to the appropriate threshold, all the conditions succeeded in accomplishing InvertedPendulum and Swingup; however, due to the limitation of eligibility traces, DoublePendulum was accomplished only by PPO-RPE, and especially, the adaptive threshold enhanced the success rate of it; in (d)-(f), the experience replay method is employed, and due to too strict threshold, PPO-RB and PPO-RPE could not accomplish any task, although PPO could do so probably because of the lack of capability to softly constrain the latest policy to the baseline; by relaxation of the threshold, PPO-RPE-A achieved the best performance in all the tasks.
in softly constraining the policy to the same degree, with or without a threshold. This suggests that the adaptive threshold, which is adjusted at key points, contributes to the performance improvement, even though statistically giving the same level of regularization.

Conclusion and future work
This paper proposed PPO-RPE, a variant of PPO integrated with RPE divergence regularization, to clearly constrain the policy to its baseline in a symmetric manner. The threshold-based gain was derived for PPO-RPE, as the standard PPO does, and makes this regularization tuning task-invariant. In addition to the earlier work (Kobayashi, 2021a), this paper further contributes to design the adaptive threshold based on the symmetric property in the relative density ratio for RPE divergence. The conventional threshold can be interpreted that it is set based on the theoretical error scale of the relative density ratio from its center, but the actual error scale is algorithm-dependent. According to this fact, the adaptive threshold was designed to consider the algorithmdependent error scale, which is estimated from experience heuristically. This design can cancel the algorithm de-pendency of the regularization tuning. In the numerical simulations with two types of learning algorithms, only the proposed method could accomplish all the tasks using both the learning algorithms, although the standard PPO was also robust to the change of learning algorithm due to the lack of the capability to softly constrain the latest policy to the baseline. In the additional simulations for the four complex locomotion tasks, only the proposed method could achieve high scores even in the unstable tasks (i.e. Hopper and Walker2D), although the firm regularization to the policy updates resulted in lower performance than PPO on the task requiring sufficient exploration (i.e. HalfCheetah).
The simulation results suggested that the regularization of the policy updates does not necessarily lead to improved performance of RL. By appropriately adjusting κ (e.g. decaying like (Farsang and Szegletes, 2021)), we can expect to get the better balance between the strength of the regularization and the exploration performance. Alternatively, the existence of a flat region like PPO and PPOS may be useful in order not to deteriorate the exploration performance and/or not to constrain the latest policy to the too old baseline. In the future, therefore, the proposed method will be improved in terms of the exploration per- : Adaptive thresholds with two learning algorithms: the corresponding shaded areas show the 95 % confidence intervals; with the eligibility traces, the value of was smoothly decreased to the near minimum because the latest and baseline policies (π and b) are not far apart enough; with the experience replay, the value of was almost kept in the initial one (i.e. the maximum) since b in the buffer is often old and π tends to be far from it; however, in both cases, most of did not converge to a constant, and fluctuated adaptively according to the learning progress. formance while holding the regularization capability. In addition, the proposed method can be integrated with the other PPO variants (Hämäläinen et al., 2020;Imagawa et al., 2019;Libardi et al., 2021). Hence, by investigating the integration with them, we can expect to obtain better learning performance in the near future.

Appendix .1. Regularization term in PPO and PPO-RB
With eqs. (4), (7), and (8), the regularization term for PPO, which has not been explicitly defined in the previous work (Schulman et al., 2017;Wang et al., 2020) yet, is derived. The changes in ρ PPO compared to the original ρ should be transferred into the surrogate advantage function A PPO , which is the sum of the original A and the negative regularization term for PPO −Ω PPO . When no clipping or rollback is done, Ω PPO is clearly derived as zero. In contrast, when clipping or rollback, the following relationship is derived.
Therefore, the regularization term in PPO, Ω PPO , is given as follows: As can be seen in this regularization term, PPO regularizes the policy adaptively according to the advantage function A. Although this seems to be a kind of hinge loss function, which is zero if and only if ρ = 1 + σ or no clipping and rollback, its mathematical meaning is not intuitive, even by converting it to its expectation E b [ρΩ PPO ] as the regularization target.
Appendix .2. Derivation of eq. (16) With ρ β = ρ/(1 − β + βρ), the partial differentiation of ∂ρ β /∂ρ is given as follows: in acquiring locomotion multiple times in 10 trials; PPO is suitable for learning HalfCheetah, which is stable and requires the high exploration capability, although the proposed method could adjust its regularization strength, yielding the multiple successes; in Ant task, PPO-RPE-A outperformed the others probably because it is stable and local exploration is enough to find a global solution.
Using this, eq. (16) can be derived in the following steps.