Evaluating the Effectiveness of Deep Reinforcement Learning Algorithms in a Walking Environment

Deep Reinforcement Learning algorithms have shown to perform well on complex tasks, such as video games and chess. However, when it comes to locomotive tasks, picking the right algorithm and hyperparameters continues to be a challenge for many researchers. This project addressed that issue by determining which one of three reinforcement learning algorithms worked most effectively to help a computer learn to walk, without any external supervision or guidance, in a simulated environment. In addition, the project also determined the best learning rate for the algorithms by testing out 6 learning rates. A walking environment was used as it is considered to be a good representative for a large class of reinforcement learning problems. Proximal policy optimization was found to be the most effective, followed by the trust-region policy optimization and the vanilla policy gradient. The algorithms worked best with learning rate 1e-3.


Introduction
Reinforcement algorithms, especially the ones that combine neural networks, called Deep Reinforcement Algorithms, have shown powerful learning abilities with the demonstration of AlphaGo and similar complex tasks Mnih et al.,2015, Sutton and Barto,2015, Silver et al.,2017. However, choosing the right algorithm and set of hyperparameters for locomotive tasks is a difficult task Peng et al.,2016. A walking environment was used as it is considered to be a good representative for a large class of reinforcement learning problems.

Basic Terminology
Reinforcement learning is a class of machine learning algorithms used to help computers learn to make decisions in an environment with or without any external guidance during the learning process Sutton and Barto,2015. The agent learns to make decisions solely off the states of the environment, the rewards that the environment returns to the agent, and the actions that the agent takes. The goal of reinforcement learning is to maximize a numerical reward by learning what to do and mapping situations to actions Lu,2017. In the absence of existing training data, the agent learns from experience, updating its weights to maximize the reward.
An artificial neural network (ANN) is a computational method inspired by neurological systems to represent complex functions Schmidhuber,2014. ANNs can be represented by neurons and axons, which form a large net. Neurons are organized in layers, and the organization of neurons in a layer can change the purpose of that layer. In this project, only fully-connected (FC) layers, in which all neurons in subsequent layers are connected to each other, were used.
ANNs update their weights to improve the accuracy of the function using a method known as gradient descent or ascent (depending on the type of problem), which computes the delta between the expected and actual value, and updates each weight in the network by propagating backwards through the network Ruder,2017. The "value" differs depending on the case, but in this case it refers to a reward value, given by the environment. Gradient descent also uses a hyperparameter called the learning rate (LR) that controls the speed of the convergence Goodfellow et al., 2017. As neural networks represent extremely complex functions, gradient descent has to take small steps to reach the optimal set of weights to prevent itself from overshooting the local optima Goodfellow et al., 2017, Baird Moore, 1999. Off-the-shelf implementations of gradient descent, such as Adam (ADAptive Moment estimation), which adaptively adjusts the learning rate as the gradient descent algorithm nears convergence, are easier to use in applications Kingma D. and Ba, J., 2015. As the goal of the algorithm was to maximize the reward, gradient ascent was used rather than gradient descent . However, the foundational methodology between gradient descent and ascent remains constant.

Vanilla Policy Gradient
The general goal of policy gradient methods is to create a policy, or strategy on which the agent can rely to make decisions in a virtual environment, that maximizes the possible reward Sutton and Barto,2015, Williams. Policy gradient methods differ from other reinforcement learning methods such as value-iteration update functions and actor-critic methods as they directly optimize the policy rather than optimizing the reward for each action and state (as in the case of Q-learning) Li et al.,2017. As a result, they tend to have better convergence rates and can work on environments with infinite action spaces (infinite available actions at each step), such as the one used in this experiment; however, they also are computationally intensive and often have high variances. Nevertheless, policy gradients have become the state-of-the-art algorithms to use in locomotion tasks Silver et al.,2017. More specifically, the underlying principle behind the vanilla policy gradient (VPG) method is to maximize the expected future discounted reward in the environment by performing gradient descent on the policy directly to reach the optimal weights Sutton and Barto,2015. Policy gradients differ from other methods in this way in that they do not receive the reward at each timestep; rather, they use the total reward at the end of the episode to optimize the policy. Although this method is seemingly inefficient as certain actions taken in an episode may have contributed to the reward more than others, in the end, this disparity has little effect as the policy, through exploration of actions, eventually learns which actions give a better reward.
The policies that are iterated through are formally defined by Π = {π θ , θ ∈ R m }, which represents a set of policies Π that contain policies π θ parametrized by weights θ. Since ANNs are typically used to represent policies, each policy π θ can be thought of as a neural network parametrized by weights θ that outputs an action at each state. The equation below shows the value for each policy, and the goal of the VPG method is to create a policy that maximizes this value. t represents the timestep, γ represents the discount factor, and r(t) represents the reward given at each step Li et al.,2017. A discount factor is used to lower the weight that the algorithm gives to future rewards (in future states).
The goal of the algorithm is to reach a set of parameters θ * such that θ * = argmaxJ(θ) by performing gradient ascent on the policy directly to reach the optimal weights Sutton and Barto,2015. The basic process that VPGs follow can be represented as pseudocode, shown below: Initialize Parameter θ For iteration 1, 2, 3 do: -Using policy π θ , interact with the environment by taking the actions (output from the policy) until the episode ends -At the end of the episode, obtain the total reward for that episode from the environment -Update the policy π θ by using gradient ascent on parameters θ

End for
The gradient estimator used to update the weights is where θ represents the weights that parametrize policy π θ , t represents the timestep, a t and s t represent the action and state taken at timestep t, respectively, and r(τ ) represents the total reward at the end of the episode (with τ as a trajectory of states and actions from one episode). Again, the policy gradient is only given the cumulative reward at the end of the episode, rather than individual rewards per timestep. The purpose of this algorithm is to update the weights θ by increasing or decreasing the probability of actions (represented by π θ (a t |s t )). Given a high episodic reward, the algorithm assumes that all the actions taken in that episode were good actions and pushes up the probabilities of all the actions, and vice versa given a low episodic reward Li et al.,2017. Again, while this may seem simplistic, this method does work as the policy, by trying out different actions, eventually learns which actions are good and which are not.
However, determining which rewards are better than others is also a challenge, as the range of the rewards between a bad episode and a good episode can be very small, and consequently, the updates will also be small, even though the actions taken in the good episode should receive a higher weight. To solve this problem, a baseline function is used to compare the rewards to determine which ones are better and which ones are worse Li et al.,2017.

Trust-Region Policy Optimization
The trust-region policy optimization algorithm (TRPO) builds off of the VPG algorithm by using Kullback-Leibler (KL) Divergence to constrain each optimization step to a "trusted region" around the original policy Schulman et al.,2015. This constrained optimization step "guarantees a monotonic improvement" to the policy, essentially making the ascent to convergence more controlled Schulman et al.,2015. Below is the derivation for the TRPO algorithm Lu,2017. An MDP is defined as a tuple (S, A, P sa ,γ, R, p 0 ), where: -S is a finite set of N states -A is a set of k actions, a 1 , a 2 , ...a k -P sa (s ) represents the probability of landing at state s upon taking action a at state s γ [0, 1) is the discount factor r : S → R is the reward function (defined by the environment) p 0 : S → R is the initial state distribution p π : S → R is the discounted visitation frequencies (the discounted probability of landing at each state) The above equation represents the expected discounted cumulative reward of policy π, where s 0 ∼ p 0 (s 0 ), a t ∼ π(a t |s t ), s t+1 ∼ P (s t+1 |a t+1 ).
The above equation is the action-value function, which expresses the expected value of taking action a t at state s t and then following the policy π afterwards.
The above equation is the value function, which expresses the expected value of following the policy π from state s t onwards.
The above equation is the advantage function, which expresses the "advantage" of taking action a t over following the policy π at state s t . As with the VPG, the TRPO does not have the advantage value for each timestep, rather having a cumulative advantage from the whole episode.
where π is the new policy and π 0 is the old policy. However, the gradient estimator does not have π yet; therefore, p π does not exist. The TRPO algorithm instead uses p π0 as an approximation of p π . Hence, the objective function becomes Schulman et al 2015 then used KL divergence to create a surrogate objective by penalizing the Loss function Schulman et al.,2015. KL divergence, which calculates the distance between two probability distributions, calculates the distance between the old policy and the new policy and penalizes the loss function by that value.
Therefore, the base TRPO optimization problem is However, in practice, TRPO does not use a penalty term or the penalty coefficient C as the step sizes would be very small. Instead, a hard limiter is used Schulman et al.,2015.
The algorithm can then be optimized using the conjugate gradient method Schulman et al.,2015. Because of this constrained optimization step, the TRPO algorithm provides a steady, consistent update to the policy. It is fairly computationally expensive, but further implementations of the base TRPO algorithm have been created to simplify the optimization steps Lu,2017. TRPO uses Natural Gradient Ascent to update the ANN as it is built directly into the TRPO algorithm; consequently, TRPO is not compatible with other optimizers, such as Adam. Natural Gradient Ascent uses KL divergence to constrain the optimization step of an ANN Grosse.

Proximal Policy Optimization
The Proximal Policy Optimization (PPO) algorithm builds on the base TRPO algorithm, rather using first-order optimization methods to simplify the computation Schulman et al.,2017. PPO does not use KL divergence, rather clipping the the ratio of the old and new policy to a certain range. It then takes the minimum across that clipped ratio and the original ratio, and finally multiplies that minimum by the Advantage estimate. Eliminating KL divergence from the surrogate objective function makes the PPO algorithm much simpler to implement.
The surrogate objective function for the PPO algorithm is given parameters θ, advantage function estimatorÂ t , policy ratio r t (θ) = π θ (at|st) π θ old (at|st) (probability ratio between new and old policy), and some hyperparameter used to clip the probability ratios Schulman et al.,2017. Clipping r t between [1 − , 1 + ] prevents the ratio between the old policy and new policy from going too high, which ensures that improvement is fairly controlled and constant. Furthermore, Schulman et al.'s tests determined that the best value for was 0.2 Schulman et al.,2017.
Intuitively, the PPO uses a simpler calculation to create a lower bound on which the policy can optimize on, similar to an Majorize-Minimization algorithm Lange,2007. This can be thought of as a "soft limit" as opposed to the hard limit of TRPO.

Experimental Design
The project tested the effectiveness of three different algorithms on a simulated walking environment, in which the computer had control over 6 joints and received 17 observations (17x1 input vector and 6x1 output vector) from each of those joints (see Figure 1). The goal was to move forward as far as possible, and rewards were based on the change in position as well as other metrics (see below for reward calculation). High rewards indicate a good performance. 6 learning rates, 0.1 (1e-1), 0.01 (1e-2), 0.001 (1e-3), 0.0003 (3e-4), 0.0001 (1e-4), and 0.00001 (1e-5), were tested on all of the agents to determine the best one. The computer ran each algorithm for 40,000 episodes, or for 16,000,000 time steps, since each episode was 400 timesteps long.
Other variables, including the network size and type, activation function, training episodes, and hardware used were kept constant between the algorithms. However, the Adam optimizer was used for the VPG and PPO algorithms but not for the TRPO algorithm. This was because TRPO already had a built-in optimizer technique that limited the search to a certain region, thus producing an identical effect as Adam (see 1.3). A two-layer ANN with tanh activations and 2 fully-connected layers with 32 nodes each was used.
The effectiveness of the algorithms was based on the following criteria: the 100-episode average reward after training was used to judge the performance of the algorithm for a learning rate. A higher reward meant that the agent performed better, and vice versa. -The algorithm had to show consistent improvement across the episodes for the results to be considered. This ensured that the agent was not just randomly attaining a certain result. if an algorithm did not get the highest reward for that episode, it could still get the highest rank for that learning rate if it showed that it had a higher slope, or more "momentum". This was determined based on the "learning curve", or progression of rewards per episode over episodes, for each algorithm (all learning curves are in the appendix).
The program itself used TensorForce, a reinforcement learning library built on top of Tensorflow (a common machine learning library). OpenAI Gym environment (an open source platform for creating, evaluating and benchmarking agents in a game environment) was used to render after the completion of the learning phase Brockman, 2016. Matplotlib and Numpy were used to process the data and graph the results.
In addition, the final code modified the existing TensorForce examples code base, found on Github. However, the code used in this project bears very little semblance to the original due to the extensive modifications that were made to suit the goals of the experiment. Functions were added to help render and record the models every 1000 episodes as well as save the model every 100 episodes. This helped to evaluate and compare the performance of the models.
In the game environment itself, 17 state observations were given to the agent, and these included information about the position of the agent from the center, balance, and other metrics. There were 6 available actions at each state of the model, with one for each joint. Rewards were calculated based on the agents distance from the starting point (see below for reward calculation equation).
The reward calculation (given by the environment) is Where α is the starting position of the simulated walker, β is the ending position, A is the set of actions, and T is the total number of timesteps in that episode.   Table 1 shows all of the results from the experiment. For reasons detailed in the Analysis section (Section 4), the results from Learning Rates 1e-1, 1e-2, and 1e-5 were dropped. Furthermore, TRPO tended to perform well across many learning rates, while PPO performed very well for a select range of learning rates and VPG performed poorly for most learning rates. Table 2 shows the modified results in numerical form, while Figure 2 shows the results as a line graph.   The below six graphs show the learning curves for the algorithms for each learning rate.

Analysis and Discussion
The hypothesis was partially supported, as the Proximal Policy Optimization (PPO) algorithm outperformed the Trust-Region Policy Optimization (TRPO) for two out of the three learning rates examined, and consistently had a higher momentum in the learning curves. VPG performed significantly worse that TRPO and PPO for all three learning rates.

Learning Rate 1e-1
All three agents did not learn at all with learning rate 1e-1, and their rewards oscillated randomly (refer to Figure 3). Hence the results from this learning rate were disregarded.

Learning Rate 1e-2
With a learning rate of 1e-2, VPG appeared to attain a better result, but the results fluctuated randomly and did not show consistent improvement (refer to Figure 4). Both PPO and TRPO had flat learning curves showing no consistent improvement in the rewards. Hence the results from this learning rate were disregarded.

Learning Rate 1e-3
With the learning rate of 1e-3, all three agents performed 6well (refer to Figure 5). In fact this learning rate provided the highest reward for all 3 agents across all learning rates and hence was chosen as the best learning rate. PPO outperformed TRPO and VPG with the highest reward.

Learning Rate 3e-4
With this learning rate PPO outperformed TRPO and VPG (refer to Figure 6). Even though all 3 algorithms showed good performance, rewards were still lower than 1e-3. PPO had the best momentum for this learning rate.

Learning Rate 1e-4
With the learning rate of 1e-4, VPG did not perform well as can be seen with the fairly flat reward function (refer to Figure 7). TRPO outperformed PPO by a relatively small margin. However, in the learning curve graph, PPO had a much higher slope as the training neared completion (around 30000 episodes). This indicated that, although PPO had a smaller reward, it had a higher learning momentum, and therefore performed the best for learning rate 1e-4.

Learning Rate 1e-5
With the learning rate of 1e-5 both PPO and VPG did poorly as can be with the flat line for the reward growth with episodes (refer to Figure 8). Even though TRPO showed a fairly good learning curve, the reward at 40,000 episodes was almost half that of other learning rates, such as 1e-3 and 3e-4. For this reason the results from this learning rate were disregarded.

Final Rankings
Therefore, PPO was determined to be the most effective algorithm for this task as it not only outperformed TRPO in two out of three learning rates and VPG in all three learning rates, but also consistently showed to have a higher learning momentum through the learning process. TRPO was given the second-highest rank as it consistently obtained better rewards than VPG, and also had a much better learning curve than VPG. Learning rate 1e-3 was chosen as the best learning rate for the algorithms as the algorithms performed best with this learning rate.

Conclusion
The purpose of this project was to determine which reinforcement learning algorithm would perform the best to learn to walk, a simple locomotion task, in a simulated environment. The hypothesis, based on previous studies, stated that the Proximal Policy Optimization Algorithm would perform the best, followed by the Vanilla Policy Gradient and the Trust-Region Policy Optimization. Algorithms were graded across a set of criteria, which included the 100-episode average reward after training, the speed of the learning process, the consistency of improvement across episodes, and others.
The hypothesis was partially supported, as the PPO algorithm outperformed TRPO and VPG, in that order. The results from learning rates 1e-1, 1e-2, and 1e-5 were disregarded because they failed to meet the criteria. All algorithms performed best for learning rate 1e-3, and with that learning rate, PPO outperformed TRPO and VPG, in that order. With learning rates 1e-3 and 3e-4, PPO outperformed TRPO, and even though it performed marginally worse with Learning rate 1e-4, it consistently showed to have a higher momentum throughout the learning process. VPG performed significantly worse than PPO and TRPO on all three learning rates examined.