Mirror Descent Search and Acceleration

In recent years, attention has been focused on the relationship between black box optimization and reinforcement learning. Black box optimization is a framework for the problem of finding the input that optimizes the output represented by an unknown function. Reinforcement learning, by contrast, is a framework for finding a policy to optimize the expected cumulative reward from trial and error. In this research, we propose a reinforcement learn- ing algorithm based on the mirror descent method, which is general optimization algorithm. The proposed method is called Mirror Descent Search. The contribution of this research is roughly twofold. First, an extension method for mirror descent can be applied to reinforcement learning and such a method is here considered. Second, the relationship between existing reinforcement learning algorithms is clarified. Based on these, we propose Mirror Descent Search and derivative methods. The experimental results show that learning with the proposed method progresses faster.


Introduction
In recent years, as stated in [1], attention has focused on the relationship between black box optimization and reinforcement learning.Black box optimization is a framework for the problem of finding the input x * ∈ X that optimizes the output f (x) : a → R represented by an unknown function.Because the objective function is unknown, we solve the black box optimization problem without gradient information.Reinforcement learning, by contrast, is a framework for finding a policy to optimize the expected cumulative reward from trial and error.Based on this, the solution to the black box optimization problem can be used as a solution for reinforcement learning.
In this research, we propose a reinforcement learning algorithm based on the mirror descent (MD) method [2].MD is general optimization algorithm that employs a Bregman divergence alternative to the Euclidean distance, which is the metric of gradient descent.The derivation for this is detailed in Section 2. We call our proposed method Mirror Descent Search (MDS).In addition, MDS is expected to generalize some existing reinforcement learning algorithms.This research shows (1) that the extension method in the MD can be applied to reinforcement learning, and (2) that the relationship between existing reinforcement learning algorithms can be clarified.

Related works
In this section, we describe previous research and its relation to the research in this paper.
Relative Entropy Policy Search (REPS) [? ] and its derivation method focuses on information loss during policy searches.This information loss is the relative entropy of the data distribution generated from the observation data distribution.The new policy and is set as the upper limit value.This is equivalent to defining the upper limit value of the Kullback-Librar (KL) divergence between each distribution.Episode-based REPS [4] is a derivation method formalized by considering the upper-level policy.Although the equations for episode-based REPS and the proposed method are similar, our method can naturally consider distance metrics other than the KL divergence.Consequently, we can apply extension methods in MD.
In [1,5], the authors focus on the relationship between reinforcement learning and black box optimization.Specifically, [1] explains the history of black box optimization and reinforcement learning, and proposes PI BB .PI BB refers to Policy Improvement with Path Integrals (PI 2 ) [6,7].It is considered a black box optimization method, where PI BB is derived on the basis of the Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES) [8], a black box optimization algorithm.Through a comparison with PI 2 , the authors discuss the connection between reinforcement learning and black box optimization.We further discuss the connection between PI 2 and the proposed method in the Appendix, below.
Previous studies proposed solutions to reinforcement learning based on MD.Indeed, [9] is strongly associated with our research insofar as the authors propose a method based on MD.However, the details are different.We adopted the exponentiated gradient method (EG) using KL divergence as a regularization term.By contrast, [9] argues that using the Minkowski distance with the Euclidean distance is preferable to KL divergence, because it offers flexibility when updating the gradient.

Derivation of proposed algorithm: MDS and G-MDS 2.1.1. MDS
A reinforcement learning algorithm is an algorithm aimed at obtaining an optimal policy to maximize reward (i.e., by minimizing cost).Consider the problem of minimizing the objective function J (θ).Rather than dealing with policy parameters θ ∈ Θ directly, we consider the probability function p (θ).Therefore, we search the following domain: The decision variable is p (θ), and the objective function is the expectation of the cost J (θ).
Therefore, the optimal generative probability is Next, we consider obtaining the optimal policy by updating p (θ).As a means for updating p (θ), we use MD, given as follows: The parameter β t in ( 4) is the probability distribution p t (θ) of the policy parameter θ at update step t.Thus, p t (θ) := β t (5) Substituting the above equation into Eq.( 4), we obtain the following: where B φ is the Bregman divergence, which has an arbitrarily smooth convex function φ and is defined as The domain of the decision variable is found on the simplex P. We can select the Bregman divergence as the KL divergence φ (x t ) = N j=1 x t,j log (x t,j ), but we can also use the Euclidean distances assumed on the simplex [10].Moreover, we can select a different Bregman divergence, as discussed in [10,11].Note that g t in Eq. ( 6) is the gradient of the objective function ∇ p(θ) J .We derive this as follows: That is, ∇ p(θ) J is a value obtained without using derivatives of J. From the above, p t (θ) can be updated using Eq. ( 6) and learning can proceed.
In a typical reinforcement learning problem, we employ the expected cumulative rewardderived with Eq. ( 8)-as the objective function J (θ): where the trajectory generated from the policy parameter θ j is τ θ j ∈ T , the generating probability of the trajectory τ θ j is p τ θ j , and the reward in the trajectory τ θ j is r τ . We can approximate this using a Monte Carlo integral, as follows: In order to solve this problem, we must generate trajectories of N kinds for M-type policy parameters, and make M × N attempts for one update.Here, we use the concept of online learning.
Considering this as online learning, the gradient of the objective function is derived as follows:.
where r τ θ i is a vector of the cumulative reward before calculating the expected value.Thus, ∇ p(θ) J ≃ r τ θ i can be used as the gradient of MD g t .Because this derived algorithm is a policy search based on MD, it is called MDS.

G-MDS
For the experiment, we considered a case where the Bregman divergence B φ in Eq. ( 6) is the KL divergence.That is, in B φ , φ is φ (x t ) = N j=1 x t,j log (x t,j ) x ∈ R N , x t,j > 0 .Then, it can be rewritten as follows: In this paper, we considered p t (θ i ) as the Gaussian distribution of the average µ t−1 and the variance Σ ǫ t−1,i , where θ i is generated accordingly: Here, we consider the average µ t of the Gaussian distribution.From Eq. ( 12), µ t can be calculated as follows: 1. Sample from the continuous distribution p t (θ); 2. Calculate the discrete distribution p x t−1 (θ) and p z t−1 (θ) from the continuous distribution p t (θ), using the obtained samples; 3. Evaluate objective values for each obtained samples as inputs; 4. Calculate the discrete distribution p x t (θ) and p z t (θ) based on Eqs. ( 19) and (20); 5. Perform fitting for the discrete distributions p x t (θ) and p z t as the continuous distribution (e.g., with a Gaussian distribution); 6. Calculate the continuous distribution p t (θ) for the next sampling.

G-AMDS
We derive the same procedure as G-MDS when using KL divergence.Let the Bregman divergence B φ from Eq. (19) be the KL distance, and let R = B ω in Eq. (20) be ω Accordingly, this method is referred to as G-AMDS.Furthermore, the result cannot be calculated analytically.Indeed, it is known that an efficient and numerical calculation is available.
Finally, we approximate the distributions p x (θ) and p z (θ) with a Gaussian distribution.

2DOF Via-point task
We performed a 2DOF Via-point task to evaluate the proposed method.The agent is represented as a point on the x-y plane.This agent learns to pass through the point (0.5, 0.2) at 250 ms.Before learning this, an initial trajectory from (0, 0) to (1, 1) is generated.The reward function is as follows: Here, DMP [12] is used for the parameterization of the policy, and the agent is seeking a policy for each x-axis and y-axis.The parameter settings are as follows: 1000 updates, 15 rollouts, and 10 basis functions.

Experimental Results
In this section, we describe the experimental results.We summarize the results for G-MDS and G-AMDS in Fig. Figure 2. In the figure, the thin line represents a standard deviation of 1. Table Table 1 shows the average and the variance at convergence.The variance Σ ǫ for each search noise shall be 1.0.From the above, we confirm that G-AMDS learns at a faster rate than G-MDS.Therefore, it is effective to apply the proposed extension for MD to reinforcement learning.

Conclusions
In this study, we proposed MDS.We explained the theoretical derivations of MDS, G-MDS, AMDS, and G-AMDS.According to the experimental results, learning progressed faster with the proposed G-AMDS.Moreover, based on the fact that AMD is a generalization of Nesterov's acceleration method, we expect that the acceleration will be effective for an objective function with a saddle point.

Figure 1 :
Figure 1: AMD as RL procedures

Figure 2 :
Figure 2: Cost with G-MDS and G-AMDS

Table 1 :
Convergence cost of G-MDS and G-AMDS

Table A .
2: PI 2 from G-MDS• Prerequisites:-Time t 0 . ..tN -Semi-definite matrix R -Immediate cost function r t = q t + θ T r Rθ t -Termination cost term φ T N-Probabilistic policy parameters a t = g T t (θ + ǫ) -Basic function of system dynamics g t i