Exploring and Exploiting Conditioning of Reinforcement Learning Agents

The outcome of Jacobian singular values regularization was studied for supervised learning problems. In supervised learning settings for linear and nonlinear networks, Jacobian regularization allows for faster learning. It also was shown that Jacobian conditioning regularization can help to avoid the “mode-collapse” problem in Generative Adversarial Networks. In this paper, we try to answer the following question: Can information about policy network Jacobian conditioning help to shape a more stable and general policy of reinforcement learning agents? To answer this question, we conduct a study of Jacobian conditioning behavior during policy optimization. We analyze the behavior of the agent conditioning on different policies under the different sets of hyperparameters and study a correspondence between the conditioning and the ratio of achieved rewards. Based on these observations, we propose a conditioning regularization technique. We apply it to Trust Region Policy Optimization and Proximal Policy Optimization (PPO) algorithms and compare their performance on 8 continuous control tasks. Models with the proposed regularization outperformed other models on most of the tasks. Also, we showed that the regularization improves the agent’s generalization by comparing the PPO performance on CoinRun environments. Also, we propose an algorithm that uses the condition number of the agent to form a robust policy, which we call Jacobian Policy Optimization (JPO). It directly estimates the condition number of an agent’s Jacobian and changes the policy trend. We compare it with PPO on several continuous control tasks in PyBullet environments and the proposed algorithm provides a more stable and efficient reward growth on a range of agents.


I. INTRODUCTION
Reinforcement Learning (RL) is the area of Machine Learning concerned with finding optimal actions for an agent interacting with an environment. Despite we can say that an RL algorithm should predict what would be the best action for the agent in the current situation, the RL setting is not similar to the supervised learning setting as it has to search a set of such dependent actions that should maximize a reward function. It is thus not surprising that the generalization problem in RL is different from the supervised learning generalization problem [1]. We need specific techniques to avoid overfitting of RL algorithms [2].
The associate editor coordinating the review of this manuscript and approving it for publication was Venkateshkumar M .
In RL, an agent's performance on the test data depends on the agent's architecture, because different architectures have different prior algorithmic preferences (inductive biases) [1]. This can lead to situations when agents achieve different scores on the test set while all of them achieve the same rewards during training. For example, Convolutional Neural Networks (CNNs) agents are too sensitive to small visual changes and can completely fail due to perturbations [3]. Such techniques as the first CNNs layer randomization can prevent it and help to learn robust representations [3]. However, the question on how to control an agent's sensitivity to small changes in an environment remains.
Policy gradient methods learn the policy directly with a parameterized function [4]. Vanilla policy gradient methods are easy of implementation and tuning in comparison to VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ other vanilla Deep RL methods [5] which require to tune a higher range of hyperparameters. Modern policy gradient approaches are central to breakthroughs in using deep neural networks for control, from video games [6] to Go [7] and Chess [8]. However, policy gradient methods often have a poor sample efficiency and exploration, and agents need to take millions (or billions) timesteps to learn tasks. In policy gradient methods, updates occur by small steps that may lead to exploration problems in some cases. Agents may under-explore their environments or under-develop strategies. If an agent gets trapped in some states at the beginning of the training, the experience obtained from different environment locations later may not lead to significant policy changes because the clip range became too small. In complex environments where the transition from one area to another is not ''trivial'', the agent may be stuck in a local minimum [9]. As a result, the agent has a productive strategy only in a small area of the environment, which leads to unpredictable outputs of the policy for unseen states.
Restriction of policy updates may not be always reasonable. An agent visits profitable states as well, and the new policy based on these data may be more competent than the old one. However, determining whether the new policy is more relevant than the old one or not is a challenge, as the obtained rewards do not always reflect this, and rewards are often unavailable during most training phases in many environments. This difficulty is caused by various factors such as sparsity of rewards or highly noisy gradients [10].
To decide how much we need to restrict a policy at each update, we need some indirect policy characteristics that are not dependent on the rewards. One of such measurements that we can calculate based only on the neural network state is to verify whether or not the network fulfills the property of Dynamical Isometry [11]. This property can be achieved by having a mean squared singular value equal to O(1) of a Jacobian input-output network [11]. Forcing a network to achieve the Dynamical Isometry property by using orthogonal weight initialization can dramatically speed up training, especially for very deep networks with dozens of layers without Batch-Norm [12] and residual connections [13].
The role of the Jacobian singular values distribution was also studied for Generative Adversarial Networks (GANs). It was shown that conditioning of generator Jacobian is causally related to the generator performance, and a conditioning regularization can help to avoid the ''mode-collapse'' problem when generators in GANs only represent a few modes of the true distribution [14]. Here conditioning or condition number is the measure that indicates how much the function output can change for a small change in the input.
Being motivated by these powerful effects of wellconditioning of Jacobian, we rise the following research questions: • Q1: Does Jacobian conditioning influence RL agents' performance?
• Q2: Can Jacobian conditioning regularization help to shape a more stable and general policy of RL agents? Our studies show positive answers to both questions. We conducted a study of the relationship between policy performance and conditioning described in Section III and found that in many cases better policies are more well-conditioned. Based on this, we proposed a conditioning regularization technique aiming to improve the policy optimization agent's performance that is described and studied in Section IV. We also propose an algorithm that clips changes in policy with respect to conditioning that is described and studied in Section V. To the best of our knowledge, this is the first work that research conditioning of reinforcement learning agents.

II. POLICY OPTIMIZATION A. VANILLA POLICY GRADIENT
Williams et al. [4] proposed the commonly used method for policy gradient estimation: where π θ is a stochastic policy with parameters θ, a t and s t are actions and states at timestep t with reward r t . Alongside with the corresponding framework for minimizing the following surrogate objective is based on a method closely related to stochastic gradient descent: The expectationÊ t is taken across several timesteps up to a finite horizon with reward r t . There are many methods for estimating a policy. For example, Actor Critics methods use a value function approximation to get a lower variance advantage estimate [15]. One of the main disadvantages of the vanilla policy gradient method is that the variance of vanilla policy gradients is vast and significant policy updates can cause policy performance degradation.
During the optimization of policy gradient algorithms, we search over the sequence of policies = {π i }. In this approach, we do not have direct control over the policy, because the policy is updated by applying changes in the space of the parameters θ ∈ ⊂ R m . One of the problems is that policy and parameter spaces do not always map congruently. Taking a step using gradient methods in the parameter space, we have no handle on relations between parameters and probability distribution, which controls the agent's actions. This is why small changes in the parameter space can yield large changes in the probability distribution. If two pairs of parameters have the same distance in the parameter space d θ (θ 1 , θ 2 ) = d θ (θ 2 , θ 3 ), they do not necessarily have equal distance values between mapped policies d θ (θ 1 , θ 2 ) = d θ (θ 2 , θ 3 ) ⇔ d π π θ 1 , π θ 2 = d π π θ 2 , π θ 3 . This is a significant problem, since it is unclear how a policy will be changed after updating parameters. This issue is important in RL as the agent has control over the data that it will collect at future steps. Due to that, an update that makes the policy worse is risky. Due to the non-optimal policy, the collected observations in the next episode will be less useful. A downward spiral can appear, where the policy starts to collect inferior rewards at every time step and, therefore, no longer visits states that can enhance it. If the policy received such poor updates, it requires new useful observations to recover. This problem is well known as performance collapse [16]. It can be avoided by choosing a proper update step size. However, this is often challenging. Furthermore, even if the model overcomes performance collapse, poor sample efficiency can be unavoidable.
To solve these problems, several policy gradient methods were proposed. Practices such as Trust Region Policy Optimization (TRPO) [17], Proximal Policy Optimization (PPO) [18], and Kronecker-Factored Approximated Curvature (K-FAC) [19] try to use a penalty or gradient clipping techniques to avoid performance collapse. In these methods, in particular, in PPO, policy updates occur by small steps, and the magnitude of these steps often decreases towards the end of the learning process. These methods are relatively complicated but outperform approaches such as deep Q-learning [6] or vanilla policy gradient methods [15] on Atari games, Mujoco, and in other RL environments.

B. TRUST REGION POLICY OPTIMIZATION
In Trust Region Policy Optimization (TRPO), a surrogate objective for local approximation of the expected return of the policy was introduced: where IS stands for ''importance sampling'' and θ old is the old policy. With a recurrent policy, this gradient method requires to be run for T timesteps and then compute the advantage estimationÂ t with the following trajectory: The truncated version of the generalized advantage estimation when λ = 1 is as follows: This variant is differentiable in the same way as the vanilla policy gradient following the chain rule, when θ = θ old : Due to the locality of the approximation, TRPO forces the policy to stay in a ''trust region'' and provides theoretical justification, which leads to a guaranteed strict increase in policy performance [17]. Such a restriction is achieved by calculating each policy step based on the solution of the following constrained optimization problem: where KL is Kullback-Leibler divergence and δ is a positive constant. Since this is a hard problem to solve, TRPO features the use of linear approximation for the objective and quadratic approximation for the constraint. Furthermore, the problem can be reformulated by replacing the constraint with a penalty: where β is a penalty coefficient. Nonetheless, TRPO methods typically use a constraint rather than a penalty, mainly because a suitable choice of β is hard to achieve across different tasks. Also, it is unsafe to rely on β in problems where environmental characteristics change during the learning process.

C. PROXIMAL POLICY OPTIMIZATION
Unlike TRPO, Proximal Policy Optimization (PPO) is a method that strikes a balance between ease of implementation, sample complexity, and ease of tuning. In the PPO version featuring the KL penalty, the penalty coefficient β dynamically scales to force a change of the trust region. The PPO method represents an alternative approach to the natural gradient. While TRPO forces locality assumptions by using constraints, PPO clips the probability ratio between the two following consecutive policies: The expectation here is taken from a minimum of two terms. The first term is the unconstrained TRPO objective. The second one is also based on the TRPO objective, but additionally features a clip in the interval [1 − η, 1 + η] [18]. The purpose of this formula is to make the minimum value of them as high as possible while simultaneously allowing the probability only to drop down but not exceeding a fixed rate of 1 + η.
An important note is that the PPO objective is equal to the TRPO objective nearby θ old . PPO approximates the TRPO trust region via clipping policy updates in some desired range. The choice of the correct clipping parameter value for PPO, which efficiently bounds the policy updates, is a challenging task. In many cases, PPO uses a constant clipping parameter. Moreover, it was recently shown that PPO does not effectively approximate the trust region via bounding the maximum of policy ratios [20].
Based on that, the authors assumed that there is a need for either a technique that enforces trust regions more strictly or a rigorous theory of trust region relaxations.

III. CONDITIONING OF REINFORCEMENT LEARNING AGENTS A. CONDITIONING ESTIMATION
Evaluation of the mean squared singular value of a Jacobian input-output network using Singular Value Decomposition (SVD) [21] is time-consuming, because we need to apply SVD at each training time-step. For faster learning, we adapted the Jacobian Clamping technique (JC) designed to assess GAN models to RL agents Jacobian conditioning estimation [14]. It penalizes the condition number of the generator's Jacobian to bring it inside the interval where the Dynamic Isometry property can be achieved. The authors also proposed a simple and efficient approach to estimate singular values of the Jacobian of a deep neural network: where G is a generator, z ∼ p(z), ε ∼ U s (0, 1) and U s is uniformly distributed from a unit sphere. Their experiments shown that JC can stabilize generator behavior and help it to avoid mode collapse, which means that in general, it can help networks to preserve a wide coverage of the desired distribution.
To compute condition number in RL agents, we feed two mini-batches at a time to the agent. The first batch consists of the real environment states S t at timestep t. The second batch consists of the same states but with some added disturbance δ. Then we estimate how these batches affected the agent: J t = π θ (S t )−π θ (S t +δ) δ . After this, we compute the value ψ t that characterizes how close J t is to the range (λ min , λ max ) . These values approximately set the desirable range for model conditioning. We set these parameters equal to the range defined previously for GANs, namely (1,20).
More details are presented in Algorithm 1.

B. RESEARCH ON CONDITIONING AND POLICY PERFORMANCE
To examine the relation between policy performance and condition number, we run PPO with different hyperparameters and random seeds on the four continuous control PyBullet [22] environments Humanoid-v0, Hopper-v0, Ant-v0, and Reacher-v0. Through these trials, we try to examine whether ineffective policies are less conditioned. We use the standard PPO parameters as the optimal configuration and made three adjustments to those parameters to produce less effective policies.
In each configuration, we use the same minibatch size, the number of timesteps T , PPO epoch, policy learning

Algorithm 1 Conditioning Estimation
Input: policy π θ , norm ε, target quotients λ max and λ min , minibatch size M , number of epochs K , state size n for iteration 1,2,. . . ,K do for actor 1,2,. . . ,A do δ ∈ R B×n ∼ N (0, 1) δ := (δ/ δ )ε for Timesteps t, . . . , T do Make action with policy π θ at state S t end for end for end for rate (LR), and η. Parameters that we tune are: value function (VF) coefficient, VF LR, VF epochs, GAE parameter, discount γ [17], [18]. The sets of hyperparameters are presented in Table 1. We test each setting on 4 PyBullet environments with 3 random seeds and 10 agents for each seed. Results of the experiments are presented on Fig. 1. They show that the conditioning has similar behavior patterns with the number of received rewards.
On the Humanoid task, we found that the most effective policy has the lowest condition number. And furthermore, the drop of condition number in Params-2 corresponds to the moment of a sharp increase in rewards for the agent. This shows that is important to control sensitivity to the input changes for the Humanoid task agent, and better conditioning can lead to a more stable policy.
The contribution of the conditioning in achieved rewards is very task-dependent because of different environments structures and dynamics. For some tasks, it is more important to have long-term stable policies than for others. The connection between policy performance and conditioning is not clearly evident in the Ant task. However, Params-2 and Params-3 that obtained smaller reward values are more distant from the low conditioning values. Furthermore, an interesting observation that is worth noting is that policies, which are well-performing and gain higher reward values at the end of the training, are better conditioned, often even from the first training steps.
Because of environment dynamics, a linear relationship between the reward curves and condition number is difficult to establish. However, in general, based on these experiments, a pattern can be observed: a policy that receives fewer rewards is less conditioned. Also, turning back to the privileges that Dynamical Isometry provides for deep non-linear networks in classification and generation tasks too, we assume that if an agent is closer to Dynamical Isometry, it will find a more stable and efficient policy.

IV. CONDITIONING REGULARIZATION A. CONDITIONING REGULARIZATION TECHNIQUE
It was shown that in supervised learning settings for linear and nonlinear networks, Jacobian conditioning regularization allows for faster learning [11], [23]. In GANs, Jacobian conditioning regularization [14] allows to achieve more stable training of generator network and help to avoid the ''mode-collapse'' problem [24]. To regularize the policy, we simply use the condition number as a penalty. The example of regularized PPO presented below. We used the PPO algorithm and added a value of ψ to the surrogate policy loss: where L CLIP is PPO policy loss, c 1 is the coefficient for is policy entropy for state s t multiplied by entropy coefficient c 3 . Conditioning penalty can be applied to other algorithms too, in our experiments we used it for TRPO as well. Condition number used for a penalty computing on the new policy on PPO and TRPO algorithm.

B. COMPARISON ON CONTINUOUS CONTROL TASKS
We conduct experiments of the regularization technique on PPO and TRPO algorithms. We optimize 30 agents for each task (10 agents for 1 random seed) over 2500 updates (5 million timesteps). We test algorithms on Humanoid-v0, Hopper-v0, Ant-v0, Reacher-v0, Double Inverted-Pendulum-v0, Humanoid-Flag-v0, Walker-v0, and Half-cheetah-v0 environments. In these tests, the selection of hyperparameter values is equal to the optimal one presented in PPO and TRPO literature [17], [18] for continuous control tasks. For the TRPO algorithm, we also used mean conditioning of a trajectory as a penalty for surrogate policy loss. In all experiments, model with name ''reg'' is conditioning regularized model. We used the penalty multiplied by a coefficient c 1 equal to 0.001. The results of comparing PPO with its regularized version it presented on Fig. 2 and the results of comparing TRPO with its regularized version are presented on Fig. 3.
Both basic TRPO and regularized one show better results than PPO. The average rewards for the last 100 updates are shown in Table 2.

C. COMPARISON ON GENERALIZATION
Our continual learning problem was set without explicitly separated training and testing stages. In generalization experiments, we trained models on the fixed large-scale set of 500 levels of CoinRun [25] and tested on unseen levels. In this experiment, we run PPO with l2 and Dropout [26] regularization coefficients. Then we run the same methods but with the proposed conditioning penalty. For this experiment, we use NatureCNNs architecture proposed for tests in [25]. Also, we tested the PPO method without l2 and Dropout regularization but based on IMPALA [27] architecture.
We noticed a high variance in scores during tests. Due to that, we increase the number of runs at the evaluation stage form 5 as it was used in [3] to 20. We trained models over 50M VOLUME 8, 2020

TABLE 2.
Mean reward over the last 100 optimization steps for TRPO, PPO, PPO reg, and TRPO reg. The mean was computed over 3 random seeds and 10 agents for each seed using optimal policy hyperparameters. timesteps, but only on one random seed, all other settings were the same as described in [3] (Section 4.2).
Results are presented in Figure 4 and Table 3. Our method outperforms PPO in all 4 training scenarios.

V. PROBABILITY RATIO CLIPPING A. JACOBIAN POLICY OPTIMIZATION ALGORITHM
Experiments on PPO verify that PPO with a clipped probability ratio performs the best [18]. However, the authors reported that it was difficult to choose the right clipping interval size. This observation was confirmed in other studies.
Ilyas et al. [20] showed that the PPO variants Policies Maximum Ratios regularly violate the ratio trust region. However, in our experiments, some PPO variants featuring max ratio were typically inside the trust region, see Fig. 1. On the contrary, low max ratio values do not reflect a policy's success, and rapidly decreasing maximum ratios are common  for poor policies. In our opinion, this behavior is related to exploration problems. Agents are visiting identical states that do not change their policies.
Contrasting to the standard PPO implementation that has a fixed interval, in which the probability is clipped, we propose a method where we check how close the condition number of the old and new policy is to the desired range. Based on that, we then shape the clipping range. The idea is that we trust the policy, which is more conditioned. If the old policy is more conditioned than the new one, we decrease the size of the clip parameter by specific value τ . If the new policy is more conditioned, the clip parameter is not changing, and the policy can be updated more radically.
More specifically, at each timestep, we estimate the value ψ of the old π old and the new policy π. We define a function φ, which can output two values (0.0, τ ). This function takes the ψ values of the old and new policy as an input. If the old policy is more conditioned than the new, it returns τ and 0.0 otherwise. Then this parameter is used as a penalty for the clip value η. φ(ψ old , ψ) = τ, if ψ old < ψ 0.0, otherwise.
Using the Jacobian clipping technique, the total loss is then formed taking into account the squared-error loss of the value function V θ , with value loss coefficient c 1 and entropy S [π θ ] (s t ) for state s t , entropy coefficient c 2 and L CCLIP : We performed a comparison between the PPO and JPO. The parameters for these tests correspond to the values in Table 1.
Value of τ is set be equal to 0.01. The results are presented on Fig. 5. The proposed algorithm provides a more stable and efficient reward growth according to the plots. The average rewards for the last 100 updates for the optimal parameters are shown in Table 4. In all environments, JPO outperformed PPO demonstrating the importance of finding the right clip parameter for efficient policy shaping. Even though in JPO clip, we choose a more conditional policy, its value of conditioning is greater. In our opinion, there are several reasons for this. 1. As it was said, conditioning is very sensitive to the size of the steps with which we update the policy. With narrower clip values, network parameters change more slowly. 2. We also always update the policy, the size of the clip of this update is less in the less conditional direction. At the same time, the graphs clearly show that the use of regularization brings conditioning closer to a given range.  4. PPO, PPO with conditioning regularization and JPO mean reward over the last 100 optimization steps. The mean was computed over 3 random seeds and 10 agents for each seed with optimal policy hyperparameters.

VI. CONCLUSION
In this work, we propose a simple and computationally inexpensive optimization method for Deep RL. We adapted a technique called Jacobian Clamping to approximately estimate the conditioning of an agent. We tested our approach on the PyBullet and CoinRun domains. In our opinion, extending RL algorithms by conditioning regularization is a promising research direction. The condition number can provide important information about the policy, such as the correctness of hyperparameters or stability.
Our experiments show that different architectures conditioning regularization produces various results. We plan to test conditioning contribution to other architectures too and run them on the environments like DeepMind Lab [28]. Also, we plan to compare conditioning regularization with other methods such as information bottleneck [29]- [31]. Estimating squared singular values of the agent Jacobian matrix using SVD would be a very interesting experiment to examine the role of Dynamical Isometry in RL agents too.
Our experiments demonstrate that conditioning influences RL agent's efficiency and can help to shape a more stable policy. Agent conditioning can provide important information about the policy, such as the correctness of hyperparameters or stability, and can indicate problems with environmental exploration. Our results show that the PPO algorithm is susceptible to the value of the clip parameter. The selection of the clip parameter is critical and the condition number allows determining this value more accurately.
A promising direction of the research is the development of techniques that make the clip parameter more sensitive to the difference between old and new policy conditioning. Finally, an important direction is the use of proposed techniques in state-of-the-art methods based on PPO.