Distributional Reinforcement Learning with Ensembles

It is well-known that ensemble methods often provide enhanced performance in reinforcement learning. In this paper we explore this concept further by using group-aided training within the distributional reinforcement learning paradigm. Specifically, we propose an extension to categorical reinforcement learning, where distributional learning targets are implicitly based on the total information gathered by an ensemble. We empirically show that this may lead to much more robust initial learning, a stronger individual performance level and good efficiency on a per-sample basis.


Introduction
The fact that ensemble methods may outperform single agent algorithms in reinforcement learning has been demonstrated numerous times [Singh, 1992, Sun and Peterson, 1999, Wiering and Van Hasselt, 2008, Faußer and Schwenker, 2015. These methods can involve combining several algorithms into one agent and then take actions by a weighted aggregation scheme or rank voting. However most conventional ensemble methods in reinforcement learning are often based on expected returns. Perhaps the simplest example being the average joint policy π derived from an ensemble of independently trained agents. That is, if we let Q (i) (x, a) denote an estimate of the expected future discounted return of agent i taking action a at state x. Then π selects the action a * at state x by a * := arg max Thus π is entirely dictated by the mean function Q := (1/k) k i=1 Q (i) of Qvalue estimates.
An alternate view to Q-values, the distributional perspective on state-action returns, is discussed in [Bellemare et al., 2017a]. This paradigm represents a shift of focus towards estimating or using underlying distributions of random return variables instead of learning expectations. This in turn paints a complex and more informationally dense picture, and there exist overwhelming empirical evidence that the distributional perspective is helpful in deep reinforcement learning. That is, apart from the possibility of overall stronger performance, algorithmic benefits may also involve reduction of prediction variance, more robust learning with additional regularization effects, and a larger set of auxiliary goals such as learning risk-sensitive policies [Morimura et al., 2010, Bellemare et al., 2017a, Dabney et al., 2018b, Dabney et al., 2018a, Lyle et al., 2019. Moreover, there have recently been important theoretical work done on understanding observed improvements and providing theoretical results on convergence [Bellemare et al., 2017a, Rowland et al., 2018, Lyle et al., 2019.
A successful approximation procedure which implements distributional reinforcement learning is categorical distributional reinforcement learning (CDRL), which was introduced by [Bellemare et al., 2017a] and made explicit by [Rowland et al., 2018]. The procedure considers distributions η (x,a) defined on a fixed discrete support, z = {z 1 , z 2 , . . . , z K }, with projections onto the support for all possible categorical distributions arising internally in the algorithm.
In this paper we propose an extension of CDRL, where we merge the distributional perspective with an ensemble method involving agents learning in separate environments. Specifically, for each agent we replace the target generation of CDRL by targets generated by the ensemble mean mixture distribution We argue that this implies an implicit sharing of information between agents during learning, where the distributional paradigm gives us more robust targets and an arguably more nuanced aggregated picture which preserves multimodality. As an initial study, we also provide empirical results where the extension was tested on a subset of Atari 2600 games [Bellemare et al., 2013]. The experiments confirm the validity of the approach, where in all cases the extension generated stronger individual agents and good efficiency when regarded as an ensemble.

Background
We consider agent-environment interactions. For each observed state, the agent selects an action, whereby the environment generates a reward and a next state. Following the framework of [Rowland et al., 2018], we let X and A denote the sets of states and actions respectively, and let p : X × A → P(R × X ) be a transition kernel which maps state-action pairs to joint distributions of immediate rewards and next states. Then we can model this interaction by a Markov Decision Process (MDP) (X , A, p, γ), where γ ∈ [0, 1) is a discount factor of future rewards. Moreover, an agent can sample its actions through a stationary policy π : X → P(A), which maps a current state to a distribution over available actions.
Throughout the rest of this paper we consider MDPs where X × A is a countable state-action space. We denote by D = P(R) X ×A the set of functions where η ∈ D maps each state-action pair (x, a) to a distribution η (x,a) ∈ P(R). Similarly we put D n = P n (R) X ×A , where P n (R) is the set of probability distributions with finite nth-moments. For a given η ∈ D, we let Q η : X ×A → R denote the function which maps state-action pairs {(x, a)} to the corresponding first moments of {η (x,a) }, i.e., To fully appreciate a subsequent summary of distributional reinforcement theory we may also need to make the following definitions explicit.
Definition 1. For a Borel measurable function g : R → R and ν ∈ P(R) we let g # ν denote the push-forward measure defined by on all Borel sets A ⊆ R. In particular, given r, γ ∈ R we let (f r,γ ) # ν be the push-forward measure where f r,γ (x) := r + γx.
Definition 2. The nth-Wasserstein metric d n , n ≥ 1, defined on P n (R) is given by For more details on Wasserstein distances see [Villani, 2008]. Another relevant metric on distributions is the Cramér metric which in the univariate case is related to the statistical energy distance [Rizzo and Székely, 2016].
Definition 3. The Cramér metric defined on P(R) is given by for any two probability distributions ν 1 , ν 2 with distribution functions F ν1 and F ν2 respectively. It follows that we can endow D with the supremum Cramér metric .
Suppose further that we have a set P of categorical distributions supported on a fixed set z = { z 1 , z 2 , . . . , z K } of equally spaced numbers. Then the following projection operator minimizes the distance between any categorical distribution ν = n i=1 p i δ yi and elements in P with respect to the Cramér metric given in Definition 3 [Lyle et al., 2019].
Definition 4. The Cramér projection Π z maps any Dirac measure δ y to a distribution in P by Moreover, the projection is defined to be linear over mixture distributions such that

Expected Reinforcement Learning
Before we go into the distributional perspective, let us first have a quick reminder of some value function fundamentals, here stated in operator form. Let (X , A, p, γ) be an MDP. Given (x, a) ∈ X × A, we define the return of a policy π as the random variable where (R t ) ∞ t=0 is a random sequence of immediate rewards, indexed by time step t and dependent on random state-action pairs (X t , A t ) ∞ t=0 under p and π. In an evaluation setting of some fixed policy π, let Q π : X × A → R be the expected return function, which by definition has values If we consider distributions dictated by p and π, and let R(x, a) and (X , A ) denote the random reward and subsequent random state-action pair given (x, a) ∈ X × A. Then we recall the Bellman operator T π defined by (2) on bounded real functions g ∈ B(X × A, R). Moreover, in the search for values attained by optimal policies, we also recall the optimality operator T * where It is readily verified that both operators are contraction maps on the complete metric space (B(X × A, R), d ∞ ). In addition, their unique fixed points are given by Q π and Q * respectively, where Q * is the optimal function defined by for all (x, a) [Bertsekas and Tsitsiklis, 1996].

Distributional Reinforcement Learning
We now proceed by presenting some of the main ideas of distributional reinforcement learning in a tabular setting. We will first look at the evaluation problem, where we are trying to find the state-action value of a fixed policy π. Second, we consider the control problem, where we try to find the optimal state-action value. Third, we consider the distributional approximation procedure CDRL used by agents in this paper.

Evaluation
We consider a distributional variant of (2), the distributional Bellman operator given by T π : D → D, Here T π is a γ-contraction in (D n , d n ) for all n ≥ 1 with a unique fixed point [Bellemare et al., 2017a, Lemma 3]. Moreover by [Lyle et al., 2019, Proposition 2], T π is expectation preserving when we have an initial coupling with the T πiteration given in (2). That is, given an initial η 0 ∈ D and a function g, such that g = Q η0 . Then (T π ) n g = Q (T π ) n η0 holds for all n ≥ 0. Thus, if we let η π ∈ D be the function of distributions of Z π in (1). Then η π is the unique fixed point satisfying the distributional Bellman equation It follows that iterating T π on any starting collection η 0 with bounded moments eventually solves the evaluation task of π to an arbitrary degree.

Control
Recall the Bellman optimality operator T * of (3). If we define a corresponding distributional optimality operator T * : D → D, where a * (x ) = arg max a ∈A Q η (x , a ). Then expectation values generated by iterates under T * will behave as expected. That is, if we put Q n := Q (T * ) n η0 , then we have an exponentially fast uniform convergence Q n → Q * as n → ∞. However, T * is not a contraction in any metric over distributions and may lack fixed points altogether in D [Bellemare et al., 2017a].

Categorical Evaluation and Control
In most real applications the updates of (4) and (5) are either computationally infeasible, or impossible to fully compute due to p being unknown. It follows that approximations are key to defining practical distributional algorithms. This could involve parametrization over some selected set of distributions along with projections onto these distributional subspaces. It could also involve stochastic approximations with sampled transitions and gradient updates with function approximation.
A structure for algorithms making use of such approximations is categorical distributional reinforcement learning (CDRL). In what follows is a short summary of the CDRL procedure fundamental to single agent implementations in this paper.
Let z = { z 1 , z 2 , . . . , z K } be an ordered fixed set of equally-spaced real numbers such that z 1 < z 2 < · · · < z K with ∆z : be the subset of categorical distributions in P(R) supported on z. We consider parameterized distributions by using D = P A×X as the collection of possible inputs and outputs of an algorithm. Moreover, for each η ∈ D we have as its Q-value function.
Given a subsequent treatment of our extension of CDRL we first reproduce the steps of the general procedure here (see [Rowland et al., 2018, Algorithm 1]).

Categorical Distributional Reinforcement Learning
1. At each iteration step t and input η t ∈ D sample a transition (x t , a t , r t , x t ).
2. Select a * to be either sampled from π(x t ) in the evaluation setting, or taken as a * = arg max a Q ηt (x t , a) in the control setting.
3. Recall the Cramér projection Π z given in Definition 4 and put η (xt,at) t 4. Take the next iterated function as some update η t+1 such that Consider first a finite MDP and a tabular setting. Define η (x,a) t whenever (x, a) = (x t , a t ). Then by the convexity of − log(z) it is readily verified that updates of the form satisfies step 4. In fact, if there exists a unique policy π * associated with the convergence of (3). Then this update yields an almost sure convergence, with respect to the supremum-Cramér metric, to a distribution in D with π * as the greedy policy (with some additional assumptions on the stepsizes α t and sufficient support, see [Rowland et al., 2018, Theorem 2] for details).
In practice we are often forced to use function approximation of the form where φ is parameterized by some set of weights θ. Gradient updates with respect to θ can then be made to minimize the loss where η t = Π z (f rt ) # φ(x t , a * ; θ fixed ) is the computed learning target of the transition (x t , a t , r t , x t ). However convergence with the Kullback-Liebler loss and function approximation is still an open question. Theoretical progress has been made when considering other losses, although we may lose the stability benefits coming from the relative ease of minimizing KL [Bellemare et al., 2017b, Lyle et al., 2019]. An algorithm implementing CDRL with function approximation is C51 [Bellemare et al., 2017a]. It essentially uses the same neural network architecture and training procedure as DQN [Mnih et al., 2015]. To increase stability during training, this also involves sampling transitions from an experience buffer and maintaining an older, periodically updated, copy of the weights for target computation. However, instead of estimating Q-values, C51 uses a finite support z of 51 points, and learns discrete probability distributions φ(x, a; θ) over z via soft-max transfer. Training is done by using the KL-divergence as the loss function over batches with computed targets η of CDRL.

Ensembles
Ensemble methods have been widely used in both supervised learning and reinforcement learning. In supervised learning this can involve bootstrap aggregating predictors for better accuracy when given unstable processes such as neural networks, or using "expert" opinion mixtures for better estimators [Breiman, 1996, Goodfellow et al., 2016. A simple example which demonstrates the possible benefits of aggregation is the following average pool of k regression models: Given a sample to predict, assume that the models draw prediction errors ε i , i = 1, . . . , k from a zero-mean multivariate normal distribution with E[ε 2 i ] = σ 2 and correlations ρ ij = ρ. Then the error made by averaging their predictions is It follows that the mean squared error goes to σ 2 /k as ρ → 0, whereas we get σ 2 and no benefit when the errors are perfectly correlated. Under the assumption of independently trained agents, we have a reinforcement learning variant of the average pool in the following definition. at every x ∈ X .
Thus, π represents an aggregation strategy where we consider the information provided by each agent as equally important. Moreover by the linearity of expectations and in view of (3), if we have initial functions Q n−1 of each agent yields Q n = T * Q n−1 for the ensemble. Assume further that learning is done with a single algorithm in separate environments. If we take Q (i) (x, a) as estimates of Q (i) n (x, a) for some step n, with errors ε i distributed as multivariate Gaussian noise. Then we should expect Q(x, a) to have a smaller expected error variance in its estimation of Q n (x, a) similar to regression models. This implies more robust performance when given an unstable training process far from convergence, but it also implies diminishing improvements when the algorithm is close to converging to a unique policy.
However in real applications, and in particular with function approximation, there may be instances where the improved performance by π does not vanish due to agents converging to distinct sub-optimal policies. An illustration of this Figure 1: Low capacity CDRL implementations in the LunarLander-v2 environment. We can see that the enhanced performance of an average joint policy of five agents may not vanish due to agents settling on distinct sub-optimal policies. phenomenon can be seen in Figure 1. It shows evaluations during learning in the LunarLander-v2 environment [Brockman et al., 2016]. The single agents used CDRL on a 29-atom support. To approximate distributions the agents used small neural networks with 3 encoding layers consisting of 16 units each. The architecture was purposely chosen to make it harder for the optimizer to converge to an optimal policy, possibly due to lack of capacity. At each evaluation point the models were tested with ε = 0.001. The figure also includes evaluations of average joint policies of five agents having the same evaluationε. However, we can see that the joint information provided by an ensemble of five agents transcends individual capacity, indicating that some agents settle on distinct sub-optimal solutions.

Categorical Q-Learning with Ensembles
We consider an ensemble of k agents, each independently trained with the same distributional algorithm, where η i , i = 1, . . . , k are their respective distributional collections. There are several ways to aggregate distributional information provided by the ensemble with respect to forecasts and risk-sensitivity [Clemen andWinkler, 1999, Casarin et al., 2016]. Perhaps the simplest being a distributional variant of the average joint policy, where we consider the mean function η of mixture distributions: Since η (x,a) is a linear pool it preserves multimodality during aggregation. Hence it maintains an arguably more nuanced picture of estimated future rewards compared to methods that generate unimodal aggregations around unrealizable expected values. In addition, expectations under η yields the Q-function used by the average joint policy in Definition 5 with all the performance benefits that this entails during learning. The finite support of the CDRL-procedure may provide another reason to aggregate by η: Under the assumption that η (x,a) i , i = 1, . . . , k are drawn as random vectors from some multivariate normal population with mean µ(x, a) and covariance Σ(x, a). Then η is a maximum likelihood estimate of the mean categorical distribution µ(x, a) induced by the algorithm over all possible training runs [Johnson and Wichern, 2014]. It follows that η may provide more robust estimates in reflecting mean t-step capabilities of the procedure in terms of distributions found by sending k → ∞.
It then stands to reason that (6) should help accelerate learning by providing better and more robust targets in the control setting of CDRL. This implies implicitly sharing information gained between agents and following accelerated learning trajectories closer to the true expected capability of an algorithm. We can summarize this as an extension of the CDRL control procedure.
For a fixed support z, we parameterize individual distribution functions η i,t , i = 1, . . . , k, at time step t by using D = P A×X as possible inputs and outputs of the algorithm. Let η t be the mean function of {η i,t } k i=1 according to (6).

Ensemble CDRL Control (ECC)
1. At each iteration step t and for each agent input η i,t sample a transition (x, a, r, x ).
3. Recall the Cramér projection Π z given in Definition 4 and put 4. For each agent, follow step 4 of CDRL with target η (x,a) We note that if updates are done in full or on the same transitions, then the algorithm trivially reduces to CDRL by the linearity of (f r ) # , hence we lose the benefits of the ensemble. To avoid premature convergence to correlated errors, we would ideally want the agents to have the freedom to explore different trajectories during learning. In the case of function approximation, this can involve maintaining a separate experience buffer for each agent. It can also involve periodical updates of ensemble target networks in the hope of generating sufficiently diverse policies until convergence. The latter is in practical terms the only way to minimize overhead costs induced by inter-thread ensemble queries in simulations. Too short periods here implies fast initial learning; but with correlated errors, high overhead costs and instability [Mnih et al., 2015]. Long periods would imply the possibility of more diverse policies but with slower learning.

Empirical Results on a Subset of Atari 2600 Games
As a first step in understanding the properties of the extension ECC discussed in Section 3.2, we now evaluate an implementation of the procedure on four Atari 2600 environments found in ALE [Bellemare et al., 2013]. Specifically, we look at ensembles of k = 5 agents. We employ for all agents essentially the same architecture, hyperparameters and training procedure as C51 in [Bellemare et al., 2017a]; except for a slightly smaller replay buffer size at 900K. This yields an implicit ensemble buffer size of 4.5M for ECC. In addition, we employ for each ECC-agent a larger ensemble target network. The network consists of copied weights from all ECC-networks and is updated periodically at every 10K steps with negligible overhead.
We train k agents on the first 40M frames (roughly 185h of Atari-time at 60Hz). Agent models are saved every 400K frames. For each save we evaluate the performance of the individual agents (ECC-Agent) and the ensemble with an average joint policy (ECC-Ensemble). Moreover, we take an ensemble of k = 5 independently trained agents using π as our baseline (Average Joint Policy). For comparison we also evaluate each such single agent (CDRL-Agent). In all performance protocols, we start an episode under the 30 no-op regime [Mnih et al., 2015] with an exploration epsilon set to ε = 0.001. The evaluation period is 500K frames with episodes truncated at 108K frames (30min).

Online Performance
To get a sense of algorithmic robustness and speed of learning we report the online performance of agents and ensembles [Dabney et al., 2018b]. Under this protocol we record the average return for each evaluation point during learning.
We can see in Figure 2 that the extension ensemble is on par or outperforms the baseline in online performance over all four environments. Moreover in three out of four games single ECC agents have similar performance to the joint policy of k independently trained agents, which is the main training objective of the extension algorithm. We also note that in all environments, except possibly Breakout, extension agents seem to be uncorrelated enough to generate a boost in performance by their joint information, while they in all cases have a better individual performance than single CDRL agents.

Relative Ensemble Sample Performance
Although ensembles will digest frames at nearly k times the rate of a single CDRL algorithm, we consider here relative sample performance, where we look

Method
Asterix at performance versus the total information accumulated by an algorithm. Under this protocol we measure the relative ratio of mean evaluation scores as a function of the total amount of frames seen by each learning system. This will give us an idea of how efficiently an ensemble algorithm can translate experience to performance on a per-sample basis compared to single CDRL. Note that if single CDRL agents all converge to correlated errors. Then the joint policy should eventually converge to 1/k-efficiency in relative sample performance. Thus in general we should expect the relative performance to degrade as training progresses with diminishing ensemble benefits. Table 1 shows measured relative performance of the two ensemble methods, averaged over the first 40M samples. We note that initial learning with ensembles may generate performance much higher than 1/k-efficiency. We also note that the extension ensemble came close to full efficiency in Berzerk and Breakout, i.e., it displayed a near k-factor increase in learning rate. However depending on the environment, the actual speed-up may vary wildly during learning, as shown in Figure 2.

Discussion
In this paper we proposed and studied an extension of categorical distributional reinforcement learning, where we employ averaged learning targets over an ensemble. This extension implies an implicit sharing of information between agents during learning, where under the distributional paradigm we should expect a richer and more robust set of predictions while preserving multimodality during aggregation. To test these assumptions, we did an initial empirical study on a subset of Atari 2600 games, where we employed essentially the same architecture and hyperparameter set as the C51-algorithm in [Bellemare et al., 2017a]. In all cases we saw that the single agent performance objective of the extension was accomplished. We also studied the effects of keeping extension amplified agents in an ensemble, where in some cases the performance benefits was present and stronger than an averaged ensemble of independent agents.
We note that unlike massively distributed approaches such as Ape-X [Horgan et al., 2018], the extension represents a decentralized distributed learning system with minimal overhead. As such, it naturally comes with poor scalability but with greater efficiency on a per-sample basis. An interesting idea here would be to somewhat counteract the poor scalability by choosing agents with successively lower capacity as the ensemble size increases. We should then expect to see better performance with increasing size until a cutoff point is reached, hinting at the minimum capacity needed to effectively find and represent strong solutions.
We leave as future work the matter of convergence analysis and hyperparameter tuning, in particular the update period for a target ensemble network. It is quite possible that the update frequency of C51 is too aggressive when using ensemble targets. This may lead to premature convergence to correlated agents upon reaching difficult environmental plateaus with rarely seen transitions to more abundant rewards. Some interesting ideas here would be scheduled update periods or eventually switching to CDRL from a much stronger and robust level of individual performance. However to fully gauge these matters we would need a more comprehensive empirical study.