Context Meta-Reinforcement Learning via Neuromodulation

Meta-reinforcement learning (meta-RL) algorithms enable agents to adapt quickly to tasks from few samples in dynamic environments. Such a feat is achieved through dynamic representations in an agent's policy network (obtained via reasoning about task context, model parameter updates, or both). However, obtaining rich dynamic representations for fast adaptation beyond simple benchmark problems is challenging due to the burden placed on the policy network to accommodate different policies. This paper addresses the challenge by introducing neuromodulation as a modular component to augment a standard policy network that regulates neuronal activities in order to produce efficient dynamic representations for task adaptation. The proposed extension to the policy network is evaluated across multiple discrete and continuous control environments of increasing complexity. To prove the generality and benefits of the extension in meta-RL, the neuromodulated network was applied to two state-of-the-art meta-RL algorithms (CAVIA and PEARL). The result demonstrates that meta-RL augmented with neuromodulation produces significantly better result and richer dynamic representations in comparison to the baselines.


Introduction
Human intelligence, though specialized in some sense, is able to generally adapt to new tasks and solve problems from limited experience or few interactions. The field of meta-reinforcement learning (meta-RL) seeks to replicate such a flexible intelligence by designing agents that are capable of rapidly adapting to tasks from few interactions in an environment. The recent progress in the field such as (Rakelly et al., 2019;Finn et al., 2017;Duan et al., 2016b;Wang et al., 2016;Zintgraf et al., 2019;Gupta et al., 2018) have showcased start-of-the-art results. Studies with agents endowed with such adaptation capabilities are a promising venue for developing much desired and needed artificial intelligence systems and robots with lifelong learning dynamics.
When an agent's policy for a meta-RL problem is encoded by a neural network, neural representations are adjusted from a base pre-trained point to a configuration that is optimal to solve a specific task. Such dynamic representations are a key feature to enable an agent to rapidly adapt to different tasks. These representations can be derived from gradient-based approaches (Finn et al., 2017), context-based approaches such as memory (Mishra et al., 2018;Wang et al., 2016;Duan et al., 2016b) and probabilistic (Rakelly et al., 2019), or hybrid approaches (i.e., combination of gradient and context methods) (Zintgraf et al., 2019). The hybrid approach obtains a task context via gradient updates and thus dynamically alters the representations of the network. Context approaches such as CAVIA (Zintgraf et al., 2019) and PEARL (Rakelly et al., 2019) are more interpretable as they disentangle task context from the policy network, thus the task context is used to achieve optimal policies for different tasks.
One limitation of such approaches is that they do not scale well as the problem complexity increases because of the demand to store many diverse policies to be reached within a single network. In particular, it is possible that, as tasks grow in complexity, the tasks similarities reduce and thus the network's representations required to solve each task optimally becomes dissimilar. We hypothesize that standard policy networks are not likely to produce diverse policies from a trained base representation because all neurons have a homogeneous role or function: thus, significant changes in the policy require widespread changes across the network. From this observation, we speculate that a network endowed with modulatory neurons (neuromodulators) has a significantly higher ability to modify its policy.
Our approach to overcome this limiting design factor in current meta-RL neural approaches is to introduce a neuromodulated policy network to increase its ability to encode rich and flexible dynamic representations. The rich representations are measured based on the dissimilarity of the representations across various tasks, and are useful when the optimal policy of an agent (input-to-action mapping) is less similar across tasks. When combined with the CAVIA and PEARL meta-learning frameworks, the proposed approach produced better dynamic representations for fast adaptation as the neuromodulators in each layer serve as a means of directly altering the representations of the layer in addition to the task context. Several designs exist for neuromodulation (Doya, 2002), either to gate plasticity (Soltoggio et al., 2008;Miconi et al., 2020), gate neural activations (Beaulieu et al., 2020) or alter high level behaviour (Xing et al., 2020). The proposed mechanism in this work focuses on just one simple principle: modulatory signals alter the representations in each layer by gating the weighted sum of input of the standard neural component.
The primary contribution of this work is a neuromodulated policy network for meta-reinforcement learning for solving increasingly difficult problems. The modular approach of the design allows for the proposed layer to be used with other existing layers (such as standard fully connected layer, convolutional layer and so on) when stacking them to form a deep network. The experimental evidence in this work demonstrates that neuromodulation is beneficial to adapt network representations with more flexibility in comparison to standard networks. Experimental evaluations were conducted across high dimensional discrete and continuous control environments of increasing complexity using CAVIA and PEARL meta-RL algorithms. The results indicate that the neuromodulated networks show an increasing advantage as the problem complexity increases, while they perform comparably on simpler problems. The increased diversity of the representations from the neuromodulated policy network are examined and discussed. The open source implementation of the code can be found at: https://github.com/dlpbc/nm-metarl

Related Work
Meta-reinforcement learning. This work builds on the existing meta learning frameworks (Bengio et al., 1992;Schmidhuber et al., 1996;Thrun and Pratt, 1998;Schweighofer and Doya, 2003) in the domain of reinforcement learning. Recent studies in meta-reinforcement learning (meta-RL) can be largely classified into optimization and context-based methods. Optimization methods (Finn et al., 2017;Li et al., 2017;Stadie et al., 2018;Rothfuss et al., 2019) seek to learn good initial parameters of a model that can be adapted with a few gradient steps to a specific task. In contrast, contextbased methods seek to adapt a model to a specific task based on few-shot experiences aggregated into context variables. The context can be derived via probabilistic methods (Rakelly et al., 2019;Liu et al., 2021), recurrent memory (Duan et al., 2016b;Wang et al., 2016), recursive networks (Mishra et al., 2018) or the combination of probabilistic and memory (Zintgraf et al., 2020;Humplik et al., 2019). Hybrid methods (Zintgraf et al., 2019;Gupta et al., 2018) combine optimization and context-based methods whereby task specific context parameters are obtained via gradient updates.
Neuromodulation. Neuromodulation in biological brains is a process whereby a neuron alters or regulates the properties of other neurons in the brain (Marder, 2012). The altered properties can either be in the cellular activities or synaptic weights of the neurons. Well known biological neuromodulators include dopamine (DA), serotonin (5-HT), acetycholine (ACh), and noradrenaline (NA) (Bear et al., 2020;Avery and Krichmar, 2017). Such neuromodulators were described in Doya (2002) within the reinforcement learning computation framework, with dopamine loosely mapped to the reward signal error (like TD error), serotonin representing discount factor, acetycholine representing learning rate and noradrenaline representing randomness in a policy's action distribution. Several studies have drawn inspiration from neuromodulation and applied it to gradient-based RL (Xing et al., 2020;Miconi et al., 2020) and neuroevolutionary RL (Soltoggio et al., 2007(Soltoggio et al., , 2008Velez and Clune, 2017) for dynamic task settings. In broader machine learning, neuromodulation has been applied to goal-driven perception , and also in continual learning setting (Beaulieu et al., 2020) where it was combined with meta-learning to sequentially learn a number of classification tasks without catastrophic forgetting. The neuromodulators used in these studies have different designs or functions: plasticity gating (Soltoggio et al., 2008;Miconi et al., 2020), activation gating (Beaulieu et al., 2020), direct action modification in a policy (Xing et al., 2020).

Problem Formulation
In a meta-RL setting, tasks are sampled from a task distribution p(T ). Each task T i is a Markov Decision Process (MDP), which is a tuple M i ={S, A, q, r, q 0 } consisting of a state space S, an action space A, a state transition distribution q(s t+1 |s t , a t ), a reward function r(s t , a t , s t+1 ), and an initial state distribution q 0 (s 0 ). When presented with a task T i , an agent (with a policy π) is required to quickly adapt to the task from few interactions. Therefore, the goal of the agent for each task is to maximize the expected reward in the shortest time possible: where H is a finite horizon and γ ∈ [0, 1] is the discount factor.

Context Adaptation via Meta-Learning (CAVIA)
The CAVIA meta-learning framework (Zintgraf et al., 2019) is an extension of the model-agnostic meta-learning algorithm (MAML) (Finn et al., 2017) that is interpretable and less prone to meta-overfitting. The key idea in CAVIA is the introduction of context parameters in a policy network. Therefore, the policy π θ,φ contains the standard network parameters θ and the context parameters φ. During the adaptation phase for each task (the gradient updates in the inner loop), only the context parameters are updated, while the network parameters are updated during the outer loop gradient updates. There are different ways to provide the policy network with the context parameters. In Zintgraf et al. (2019), the parameters were concatenated to the input.
In the meta-RL framework, an agent is trained for a number of iterations. For each iteration, N tasks represented as T are sampled from the task distribution T . For each task i, a batch of trajectories τ train i is obtained using the policy π θ,φ with the context parameters set to an initial condition φ 0 . The obtained trajectories for task i are used to perform a one step inner loop gradient update of the context parameters to new values φ i , shown in the equation below: where J T i (τ i , π θ,φ ) is the objective function for task i. After the one step gradient update of the policy, another batch of trajectories τ test i is collected using the updated task specific policy π θ,φ i . After completing the above procedure for all tasks sampled from T , a meta gradient step (also referred to as the outer loop update) is performed, updating θ to maximize the average performance of the policy across the task batch.
3.3. Probabilistic Embeddings for Actor-Critic Meta-RL (PEARL) PEARL (Rakelly et al., 2019) is an off-policy meta-RL algorithm that is based on the soft actor-critic architecture (Haarnoja et al., 2018). The algorithm derives the context of the task to which an agent is exposed through probabilistic sampling. Given a task, the agent maintains a prior belief of the task, and as the agent interacts with the environment, it updates the posterior distribution with the goal of identifying the specific task context. The context variables z are concatenated to the input of the actor and critic neural components of the setup. To estimate this posterior p(z|c), an additional neural component called an inference network q φ (z|c) is trained using the trajectories c collected for tasks sampled from the task distribution T . The objective function for the actor, critic and inference neural components are described below, whereV is a target network andz means that gradients are not being computed through it, p(z) is a unit Gaussian prior over Z, B is the replay buffer and β is a weighting hyper-parameter.

Neuromodulated Network
This section introduces the extension of the policy network with neuromodulation. A graphical representation of the network is shown in Figure  1a. The neuromodulated policy network is a stack of neuromodulated fully connected layers.

Computational Framework
A neuromodulated fully connected layer contains two neural components: standard neurons and neuromodulators (see Figure 1b). The standard neurons serve as the output of the layer (i.e., the layer's representations) and they are connected to the preceding layer via standard fully connected weights W s . The neuromodulators serve as a means to alter the output of the standard neurons. They receive input via standard fully connected weights W g from the preceding layer in order to generate their neural activity, which is then projected to the standard neurons via another set of fully connected weights W m . The function of the projected neuromodulatory activity defines the representation altering mechanism. For example, it could gate the plasticity of W s , gate neural activation of h or do something else based on the designer's specification. While different types of neuromodulators can be used (Doya, 2002), in this particular work, we employ an activity-gating neuromodulator. Such neuromodulator multiplies the activity of the target (standard) neurons before a non-linearity is applied to the layer. Formally, the structure can be described with three parameter matrices: W s defines weights connecting the input to the standard neurons, W g defines weights connecting the input to the neuromodulators and W m defines weights connecting the neuromodula-tors to the standard neurons. The step-wise computation of a forward pass through the neuromodulatory structure is given below: where x is the layer's input, h s is the weighted sum of input of the standard neurons, g is activity of the neuromodulators derived from the weighted sum of input, h m is the neuromodulatory activity projected onto the standard neurons, and h is the output of the layer. The key modulating process takes place in the element-wise multiplication of the h s and h m .
The tanh non-linearity is employed to enable positive and negative neuromodulatory signals, and thus gives the network the ability to affect both the magnitude and the sign of target activation values. When ReLU is used as the non-linearity for the layer's output h, h m has the intrinsic ability to dynamically turn on or off certain output in h.
A simpler version of the proposed model can be achieved by only considering the sign, and not the magnitude, of the neuromodulatory signal, using the following variation of Equation 10: This variation is shown to be suited for discrete control problems.

Results and Analysis
In this section, the results of the neuromodulated policy network evaluations across high dimensional discrete and continuous control environments with varying levels of complexity are presented. The continuous control environments are the simple 2D navigation, the half-cheetah direction (Finn et al., 2017) and velocity (Finn et al., 2017) Mujoco (Todorov et al., 2012) based environments and the meta-world ML1 and ML45 environments (Yu et al., 2020). The discrete action environment is a graph navigation environment that supports configurable levels of complexity called the CTgraph Ladosz et al., 2021;Ben-Iwhiwhu et al., 2020).
The experimental setup focused on investigating the beneficial effect of the proposed neuromodulatory mechanism when augmenting existing meta-RL frameworks (i.e., neuromodulation as complementary tool to meta-RL rather than competing). To this end, using CAVIA meta-RL method (Zintgraf et al., 2019), a standard policy network (SPN) is compared against the neuromodulated policy network (NPN) across the aforementioned environments. Similarly, SPN is compared against NPN using PEARL (Rakelly et al., 2019) method only in the continuous control environments because the soft actorcritic architecture employed by PEARL is designed for continuous control. We present the analysis of the learned dynamic representations from a standard and a neuromodulated network in Section 5.2. Finally, the policy networks were evaluated in a RGB autonomous vehicle navigation domain in the CARLA driving simulator using CAVIA and the results and discussions are presented in Appendix Appendix D.

Performance
The experimental setup for CAVIA and PEARL as in Zintgraf et al. (2019) and Rakelly et al. (2019) were followed. For PEARL, neuromodulation was applied only to the actor neural component. The details of the experimental setup and hyper-parameters are presented in Appendix A. The performance reported are the meta-testing results of the agents in the evaluation environments after meta-training has been completed (Figures 2, 3, 4 and 5). During meta-testing in CAVIA, the policy networks were fine-tuned for 4 inner loop gradient steps. Lastly, depending on the evaluation environment, the metric used to judge evaluation performance was either return 1 or success rate 2 .

2D Navigation Environment
The first simulations are in the 2D point navigation experiment introduced in Finn et al. (2017). An agent is tasked with navigating to a randomly sampled goal position from a start position. A goal position is sampled 1 return is a standard metric in RL that is computed as the sum of cumulative reward acquired by the agent.
2 success rate is a metric introduced in Meta-World, having a value of 1 if the agent has solved or is close to solving the task (i.e., if the distance between the current position of the task relevant object and goal position is smaller than some value, otherwise, it is set to 0.   Figure 3: Adaptation performance across tasks of the standard policy network (SPN) and the neuromodulated policy network (NPN) in continuous control environment using PEARL meta-RL framework. Across three seed runs, the performance was measured based on average return from the rewards acquired during evaluation.  Figure 3a for PEARL. The result shows that both policy networks had a relative good performance. Such optimal performance is expected from both policies as the environment is simple and the dynamic representations required for each task are not very distinct.

Half-Cheetah
The half-cheetah is an environment based on the MuJoCo simulator (Todorov et al., 2012) that requires an agent to learn continuous control locomotion. We employ two standard meta-RL benchmarks using the environment as proposed in Finn et al. (2017); (i) the direction task that requires the cheetah agent to run either forward or backward and (ii) the velocity task that requires the agent to run at a certain velocity sampled from a distribution of velocities. Although challenging (due to their high dimensional nature) in comparison to the 2D navigation task, these benchmark are still simplistic as the direction benchmark contains only two unique tasks and the velocity benchmark samples small range of velocities ([0, 2.0) or [0, 3.0)). Therefore, the optimal policies across tasks in these benchmarks possess similar representations. The results of the experiments for both benchmarks are presented in Figures 2c and 2b for CAVIA, and Figures 3c and 3b for PEARL. Unsurprisingly, the results show comparable level of performance between the standard policy network and the neuromodulated policy network across CAVIA and PEARL. These benchmarks are of medium complexity and the optimal policy for each task is similar to others.

Meta-World
The neuromodulated policy network was evaluated in a complex highdimensional continuous control environment called meta-world (Yu et al., 2020). In meta-world, an agent is required to manipulate a robotic arm to solve a wide range of tasks (e.g. pushing an object, pick and place objects, opening a door and more). Two instances of the benchmark ML1 and ML45 were employed. In ML1 instance, the robot is required to solve a single task that contains several parametric variations (e.g. push an object to different goal locations). The parametric variations of the selected task are used as the meta-train and meta-test tasks. ML45 is a more complex instance that contains a wide variety of tasks (each task with parametric variations). It consists of 45 distinct meta-train tasks and 5 distinct meta-test tasks. The standard policy network and neuromodulated policy network were evaluated in ML1 and ML45 instances using CAVIA and PEARL. The results 3 are presented in Figures 2d and 2e for CAVIA, and Figures 3d and 3e for PEARL. In these complex benchmarks, the results show that the neuromodulated policy network outperforms the standard policy network in both CAVIA and PEARL, highlighting the advantage neuromodulation offers in complex problem setting. In addition to judging the performance based on reward, results are also presented using the success rate metric (introduced in Yu et al. (2020) as a metric judge whether or not an agent is able to solve a task) in Figure 4. The results again show that the neuromodulated policy network achieved significantly higher average success rate both in CAVIA and PEARL in comparison to the standard policy network.

Configurable Tree graph (CT-graph) Environment
The CT-graph is a sparse reward discrete control graph environment with increasing complexity that is specified via parameters such as branch b and depth d. An environment instance consists of a set of states including a start state and a number of end states. An agent is tasked with navigating to a randomly sampled end state from the start state. See Appendix Appendix B for more details about the CT-graph. The three CT-graph instances used in this work were setup with varying depth parameter: with increasing depth, 0.00 0.25 0.50 0.75 1.00 Average return Figure 5: Adaptation performance across tasks of the standard policy network (SPN) and the neuromodulated policy network (NPN) in three discrete control environments using CAVIA meta-RL framework. Across three seed runs, the performance was measured based on the success rate metric from the evaluation.
the sequence of actions grows linearly, but the search space for the policy network grows exponentially. The simplest instance has d set to 2 (CTgraph depth2), and the next has d set to 3 (CT-graph depth3) and the most complex instance has d set to 4 (CT-graph depth4). The meta-testing results are presented in Figure 5. The results show a significant difference in performance between standard and neuromodulated policy network. The optimal adaption performance from the neuromodulated policy network stems from the rich dynamic representations needed for adaptation as discussed in Section 5.2.

Analysis
In this section, we conduct analysis on the learnt representations of the standard and neuromodulated policy networks for tasks in the 2D Navigation and CT-graph environments. The policy networks trained using CAVIA was chosen for the analysis as the single neural component in CAVIA (i.e. the policy network) makes it easier to analyse in comparison to PEARL which contain multiple neural components. Furthermore, PEARL experiments were conducted only in continuous control environments (similar to the original paper), whereas CAVIA experiments covered both discrete and continuous control environments. Hence, analysis in CAVIA allowed for more coverage across benchmarks.
To measure representation similarity across task, we employ the use of the centered kernel alignment (CKA) (Kornblith et al., 2019) similarity index, comparing per layer representations of both standard and neuromodulated policy networks across different tasks. There exist several similarity index measures such as canonical correlation analysis (CCA) (Morcos et al., 2018), T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1, before update T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1, grad 1 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1, grad 2 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1, grad 3 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1, grad 4 0.0 0.2 0.4 0.6 0.8 1.0 (a) standard policy network, first hidden layer.  representation similarity analysis (RSA) (Kriegeskorte et al., 2008), Hilbert-Schmidt Independence Criterion (HSIC) (Gretton et al., 2005) and more. The principle behind CKA is the generation of a similarity measure between two representations by comparing the similarity structure of both representations. Each similarity structure is produced from the measure of similarity between pairwise examples or data points in a representation. Furthermore, CKA is a generalised extension of HSIC, with the inclusion of normalisation which introduces the property of isotropic scaling invariance. Describing the formulation of the CKA is outside the scope of this work and we refer readers to the original paper for detailed theoretical formulations. While the RSA similarity index measure employed in Goerttler and Obermayer (2021) is a valid alternative, we chose the CKA due to its robustness to random initialization and enabling comparison within layers of the same network and comparison across networks. Furthermore, CKA has been employed previously in meta-RL setting, e.g., demonstrated in Raghu et al. (2020).
Task1 Task2 Task3 Task4   Task4   Task3   Task2   Task1 hidden 1, before update Task1 Task2 Task3 Task4   Task4   Task3   Task2   Task1 hidden 1, grad 1 Task1 Task2 Task3 Task4   Task4   Task3   Task2   Task1 hidden 1, grad 2 Task1 Task2 Task3 Task4   Task4   Task3   Task2   Task1 hidden 1, grad 3 Task1   the figure. After gradient updates, some dissimilarities between tasks begin to emerge. Additional analysis plots are presented in Appendix Appendix F. 2D Navigation. For the simple 2D Navigation environment, the plots for the first hidden layer of the standard policy network shown in Figure 6a depicts good dissimilarity between tasks, thus highlighting the fact that the learnt representations are sufficient to produce distinct task behaviours. The same is true as well for the first hidden layer of the neuromodulated policy network (see Figure 6b). This further justifies why both policies obtained roughly comparable performance in this environment. The simplicity of the problem enables task distinct representations to be obtained easily. Appendix Appendix F.1 contains the plots of the representation similarity for the second hidden layer of both policy networks.
CT-graph. In Figure 7a and 7b, we compare the representation similarity of the first hidden layer of the standard and neuromodulated policy networks in the CT-graph depth2 environment. We see that representations of the neuromodulated policy are more dissimilar between the tasks than those of the standard policy. Due to the complexity of the environment, the task specific representations required to solve each task are distinct from one another. Therefore, adaptation by fine-tuning the representations of a base network via few gradient steps of parameters update would require a significant jump in the solution space. Standard policy network struggles to enable such jump in the solution space. However, by incorporating neuro-modulators that dynamically alters the representations, such jump becomes possible. Appendix Appendix F.1 contains the plots of the representation similarity for the second hidden layer of both policy networks.
5.2.2. Analysis: Representation similarities of the neuromodulatory units across tasks Now we ask ourselves where the representational diversity (dissimilarity in representations across tasks) comes from. Is the neuromodulatory layer effectively contributing to rich representations as Figure 7b appeared to suggest? The analysis we present here shows task representation similarities measured more specifically across the neuromodulatory layers of the proposed architecture. From Figure 7b, it appears that such dissimilarity is enhanced by the neuromodulatory activities in the NPN. Again the centered kernel alignment (CKA) was employed and we compare the neuromodulatory activities per layer across different tasks. Figures 8, 9 and 10 present the heat map plots for the 2D navigation, CT-graph depth 2 and ML45 environments (additional plots for other environment are presented in Appendix F.4). The non-uniformity in the heatmap plots, in contrast to those of Figure 7a, indicates that those layers encode diverse or dissimilar representations across T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, before update T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, grad 1 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, grad 2 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, grad 3  task. We can therefore conclude that the neuromodulatory activities, when projected onto a layer's standard neurons, produce the desired dissimilar representations across tasks.
5.3. Control Experiments: larger SPN, equaling the number of parameters of the NPN Since the inclusion of neuromodulators increases the number of parameters in a neuromodulated policy network, a set of control experiments were conducted in which the number of parameters in a standard policy network was configured to approximately match that of a neuromodulated policy network. This was achieved by increasing the size of each hidden layer of the standard policy network (called SPN larger width) in one experiment, and increasing the depth or number of hidden layers by 1, i.e., an additional layer (SPN larger depth) in another experiment. Using CAVIA, experiments were conducted in the CT-graph depth4 and the ML45 meta-world environments, comparing the standard policy network (i.e., the original size), its larger variants and a neuromodulated policy network. The results are presented in Figure 11. We observe from the results that the increase in the size of the policy network does not lead to match of the performance of the neuromodulated policy network.  T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, before update T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, grad 1 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, grad 2 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, grad 3

Discussions
Neuromodulation and gated recurrent networks: The neuromodulatory gating mechanism introduced in this work is reminiscent of the gating in recurrent/memory networks (LSTMs (Hochreiter and Schmidhuber, 1997)   and GRUs (Cho et al., 2014)). In this respect (with the observation of improved performance as a consequence of neuromodulatory gating in this work), the noteworthy performance demonstrated by meta-RL memory approaches (Duan et al., 2016b;Wang et al., 2016) could also be a consequence of such gating mechanisms 4 . Nonetheless, the present study aims to highlight the advantage of a simpler form of gating (i.e., neuromodulatory gating) on a MLP feedforward network, and thus could help to pinpoint the advantage of such dynamics in isolation. Furthermore, the advantage of our approach over gated recurrent variants is somewhat similar to the advantages derived from decoupling attention mechanism from recurrent models (where it was originally introduced) and applying it to MLP networks (i.e., Transformer models) (Vaswani et al., 2017). By decoupling neuromodulatory (gating) mechanism from recurrent models and applying it to MLP models (as in our work), the advantages of faster training and better parallelization were achieved while maintaining the benefit of neuromodulatory gating. Therefore, our proposed approach is faster to train and more parallelizable in comparison to memory variants, while maintaining the advantages that neuromodulatory gating offers. Memory based approaches will still be required for problems where memory is advantageous such as sequential data processing and POMDPs.
Task similarity measure and robust benchmarks: Increasing task complexity was presented in this work by moving from simple 2D point navigation environment to half-cheetah locomotion and then to the complex robotic arm setup of the Meta-World environment. Furthermore, exploiting the benefits of configurable parameters in the CT-graph environment, we were able to control the complexity in the environment. Overall, task complexity was viewed through the perspective of task similarity (i.e., environments with dissimilar task were viewed as more complex and vice versa). Despite these efforts, a precise measure of task complexity and similarity was not clearly outlined in this work and this is widely the case in meta-RL literatures. There is a need for the development of precise metrics for measuring task similarity and complexity in the field. The CT-graph with its configurable parameters allow for tasks to be mathematically defined, which is a first step towards alleviating this issue. However, a separate future research investigation would be necessary to develop explicit metrics that can be incorporated into meta-RL benchmarks.
We hypothesize that such a task similarity metric should be able to capture the precise change points in a task relative to other tasks. For example, a useful metric could be one that capture task change either as a function of change in reward, or state space, or transition function, or a combination of these factors. Most benchmarks in meta-RL have been focused on task change as reward function change. However, a more robust benchmark could include the aforementioned change points in order to further control the complexity. The CT-graph, Meta-world, and the recently developed Alchemy (Wang et al., 2021) environment are examples of benchmarks with early stage work in this direction, albeit implicitly. Therefore, the development of a precise measure of task similarity and complexity, as well as robust benchmarks with configurable change points (i.e., reward, state/input, and transition) would be highly beneficial to the meta-RL field.

Conclusion and Future Work
This paper introduced an architectural extension of the standard meta-RL policy networks to include a neuromodulatory mechanism, investigating the beneficial effect of neuromodulation when augmenting existing meta-RL frameworks (i.e., neuromodulation as complementary tool to meta-RL rather than competing). The aim is to implement richer dynamic representations and facilitate rapid task adaptation in increasingly complex problems. The effectiveness of the proposed approach was evaluated in meta-RL setting using CAVIA and PEARL algorithms. In the experimental setup across environments of increasing complexity, the neuromodulated policy network significantly outperformed the standard policy network in complex problems while showcasing comparable performance in simpler problems. The results highlight the usefulness of neuromodulators to enable fast adaptation via rich dynamic representations in meta-RL problems. The architectural extension, although simple, presents a general framework for extending meta-RL policy networks with neuromodulators that expand their ability to encode different policies. The projected neuromodulatory activity can be designed to perform other functions apart from the one introduced in this work e.g., gating plasticity of weights, or including different neuromodulators in the same layer. The neuromodulatory extension could also be tested with a recurrent meta-RL policy, with the goal of enhancing the memory dynamics of the policy. Our analysis indicates that this framework is most suited to problems that require rapid change in optimal representations across tasks, while its advantage is reduced when tasks can be solved using similar representations.

Appendix A. Experimental Configurations
All experiments were conducted using machines containing Tesla K80 and GeForce RTX 2080 GPUs. Also note that across all experiments, the output layer in the neuromodulated policy network (in CAVIA and in PEARL) employed a regular fully connected linear layer while the preceding layers were neuromodulated fully connected layers.

Appendix A.1. CAVIA
Following the experimental setup of the original CAVIA paper (Zintgraf et al., 2019), the context variables were concatenated to the input of the policy network and were reset to zero at the beginning of each task across all experiments. Also, during each training iteration, the policy was adapted using one gradient update in the inner loop as employed in Zintgraf et al. (2019); Finn et al. (2017). After training, the iteration with the best policy performance or the final policy at the end of training was used to conduct meta-testing evaluations to produce the final result. During meta-testing, the policy was evaluated using a number of tasks sampled from the task distribution and it was adapted (fine-tuned) for each task using 4 inner loop gradient updates. All policy networks employed ReLU non-linearity across all experiments.
The CAVIA experimental configurations across all environments are presented in Table A.1, with 2D Nav denoting the 2D navigation benchmark, Ch Dir and Ch Vel denoting the Half-Cheetah direction and Half-Cheetah Velocity benchmarks, ML 1 and ML45 denoting the meta-world ML1 and ML45 benchmarks, CT d2, d3, d4 denoting the CT-graph depth2, 3 and 4 benchmarks respectively.
Across all experiments in CAVIA, Trust Region Policy Optimization (TRPO) (Schulman et al., 2015) was employed as the outer loop update algorithm. Vanilla policy gradient (Williams, 1992) with generalized advantage estimation (GAE) (Schulman et al., 2016) was employed as the inner loop update algorithm with learning rate of 0.5 for the 2D navigation and the CT-graph experiments, and 10.0 for half-cheetah and meta-world experiments. Both the inner and outer loop training employed a linear feature baseline introduced in Duan et al. (2016a). The hyperparameters for TRPO are presented in Table A  compute the Hessian-vector product for TRPO in order to avoid computing third-order derivatives as highlighted in Finn et al. (2017). During sampling of data for each task in environments, multiprocessing was employed using 4 workers.

Name
Value maximum KL-divergence 1 × 10 −2 number of conjugate gradient iterations 10 conjugate gradient damping 1 × 10 −5 maximum number of line search iterations 15 backtrack ratio for line search iterations 0.8 Appendix A.2. PEARL Similar to CAVIA, the original experimental configurations in PEARL were followed for the half-cheetah benchmarks. Also, most of the configurations of PEARL in the original meta-world experiments were followed.
Across all PEARL experiments in this work, the learning rate across all neural components (policy, Q, value and context networks) was set to 3e-4, with KL penalty (KL lambda) set to 0.1. Furthermore, for experiments that involved the use of neuromodulation, the neuromodulator was employed only in the policy (actor) neural component. goal state (one of the end states designated as the goal). A wait state is found between decision states (the tree graph splits at decision states). The wait state requires an agent to take a wait (forward) action, a decision state requires an agent to take one of the decision (turn) actions. Any decision action at a wait state, or wait action at a decision state leads to a crash where the agent is punished with a negative reward of -0.01 and returns to the start. When an agent navigates to the correct end state (the goal location), it receives a positive reward of 1.0. Otherwise, the agent receives a reward of 0.0 at every time step. An episode is terminated either at a crash state or when the agent navigates to any end state. The observations are 1D vector (with full observability of each state) whose length depends on the environment instance configuration. The environment's complexity is defined via a number of configuration parameters that is used to specify the graph size (using branch b and d), sequence length, reward function, and level of state observability. The three CT-graph instances used in this work were setup with varying depth parameter. The simplest instance has d set to 2 (CT-graph depth2), and the next has d set to 3 (CT-graph depth3) and the most complex instance has d set to 4 (CT-graph depth4). Figure B.12 depicts a graphical view of CT-graph depth2 and 3.

Appendix C. Implementation
A code snippet demonstrating the extension of the fully connected layer with neuromodulation is presented below using PyTorch code style.

Appendix D.1. CARLA Environment
Additional experiments were conducted in an autonomous driving environment called CARLA (Dosovitskiy et al., 2017) to provide preliminary evidence on whether the method scales to complex RGB input distributions such as those in autonomous driving. Given the limited nature of these experiments and the limited analysis, they are not included in the main paper, but provide additional validation on the robustness of the proposed approach. CARLA (see Figure D.13a), is an open source experimentation platform for autonomous driving research. It contains a host of configuration parameters that is used to specify an environment instance (for example, weather). MACAD (Palanisamy, 2020), a wrapper on top of CARLA with OpenAI gym interface, was employed to run the experiments. In this work, the environment was configured to use RGB observations (images of size 64x64x3), 9 discrete actions (coast, turn left, turn right, forward, brake, forward left, forward right, brake left, and brake right), and a clear (sunny) noon weather.
The agent (vehicle) is presented with a goal of navigating from a start position to an end position. The start and end points are randomly set from a pre-defined list of co-ordinates. We setup two distinct tasks in the environment -drive aggressively and drive passively -defined by reward functions, that can be sampled from a uniform task distribution. Although the tasks are quite similar, they are challenging due to the domain of the problem (learning to drive) and the RGB pixel observations from the environment. Therefore, it is a suitable environment to further scale up meta-RL algorithms.  Each experiment processes the environment's observations through a variational autoencoder (VAE) (Kingma and Welling, 2013;Rezende et al., 2014) that was pre-trained using samples collected from taking random actions in the environment. Using CAVIA, the latent features from the VAE were concatenated with the context parameters and then passed as input to the policy network. Only the policy network was updated during the meta-training and testing, while the VAE was kept fixed.
Due to the computational load of the environment, both the standard and the neuromodulated policy network were evaluated for 300 iterations, with 4 sampled tasks per iteration and context parameter size of 10. For each task, 2 episodes are collected before and after one step of inner loop gradient update. The results are presented in Figure D.13b, with the neuromodulated policy network showing an advantage over the standard policy network. In general, the results show promise towards scaling meta-RL algorithms to even more challenging problem domains.

Appendix E. Discussions on Meta-World Performance
The performance difference between the baselines in the original Meta-World paper and the present submission is due to an update of the reward function in the recent version of the Meta-World environment 56 . Is-sues about the reward function of the originally released Meta-World was reported and discussed in https://github.com/rlworkgroup/metaworld/ issues/226. This led the environment developers to rewriting many of the reward functions in the environment that are now part of the current version (informally referred to as v2 environments). The results reported in the original Meta-World paper used the old (and now replaced) reward function, while the results in the present submission are based on a recent version cited above. It is nevertheless possible that the baseline results could be improved with better hyperparameter tuning, although the same is true for the novel approach that we propose. As we aim to observe performance differences between the neuromodulated meta-RL and the standard meta-RL, we did not perform hyperparameter search and tuning.

Appendix F. Analysis Plots
This section presents additional analysis plots of the representation similarity across tasks for the standard and neuromodulated policy networks in the various evaluation environments employed in this work. The additional plots further highlights the usefulness of neuromodulation to facilitate efficient (distinct) representations across tasks in problems of increasing complexity as earlier showcased in Section 5.2.
Task1 Task2 Task3 Task4   Task4   Task3   Task2   Task1 hidden 2, before update Task1 Task2 Task3 Task4   Task4   Task3   Task2   Task1 hidden 2, grad 1 Task1 Task2 Task3 Task4   Task4   Task3   Task2   Task1 hidden 2, grad 2 Task1 Task2 Task3 Task4   Task4   Task3   Task2   Task1 hidden 2, grad 3 Task1 Task2 Task3 Task4   Task4   Task3   Task2   Task1   2D Navigation: Figure F.14 present the representation similarity between tasks, across inner loop gradient updates for the second hidden layer of both policy networks in the 2D navigation environment. Again, similar patterns as highlighted in the first hidden layer of the policy networks in Figure 6 emerge. Both networks are able to learn good (dissimilar) representations between tasks after few steps of inner loop gradient update. Also, both networks, before any gradient update, already have some level of representation disimilarity between tasks. Thus, this further highlights the fact the 2D Navigation environment has a low complexity and requires very little adaptation of network parameters.
CT-graph depth2: Figure F.15 present the representation similarity between tasks, across inner loop gradient updates for the second hidden layer of both policy networks in the CT-graph depth2 benchmark. With increased problem complexity in comparison to the 2D navigation, only the neuromodulated policy network succeeds in learning distinct representations across tasks. The distinct representations thus allow the neuromodulated policy network to adapt optimally across tasks while the standard policy network struggles, as indicated by the performance plot in Figure 5a.

Appendix F.3. Half-Cheetah and Meta-World Environments
The CAVIA policies analysis plots for the half-cheetah and meta-world benchmarks are presented in this section.
T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 T a sk 7 T a sk 8 Task8  Task7  Task6  Task5  Task4  Task3  Task2  Task1 hidden 2 nm, before update T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 T a sk 7 T a sk 8 Task8  Task7  Task6  Task5  Task4  Task3  Task2  Task1 hidden 2 nm, grad 1 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 T a sk 7 T a sk 8 Task8  Task7  Task6  Task5  Task4  Task3  Task2  Task1 hidden 2 nm, grad 2 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 T a sk 7 T a sk 8 Task8  Task7  Task6  Task5  Task4  Task3  Task2  Task1 hidden 2 nm, grad 3 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 T a sk 7 T a sk 8 Task8  Task7  Task6  Task5  Task4  Task3  Task2  Task1   hidden 2   T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 T a sk 7 T a sk 8 Task8  Task7  Task6  Task5  Task4  Task3  Task2  Task1 hidden 1 nm, before update T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 T a sk 7 T a sk 8 Task8  Task7  Task6  Task5  Task4  Task3  Task2  Task1 hidden 1 nm, grad 1 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 T a sk 7 T a sk 8 Task8  Task7  Task6  Task5  Task4  Task3  Task2  Task1 hidden 1 nm, grad 2 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 T a sk 7 T a sk 8 Task8  Task7  Task6  Task5  Task4  Task3  Task2  Task1 hidden 1 nm, grad 3 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 T a sk 7 T a sk 8 Task8  Task7  Task6  Task5  Task4  Task3  Task2  Task1 hidden 1 nm, grad 4 0.0 0.2 0.4 0.6 0.8 1.0 (a) first hidden layer.
T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 T a sk 7 T a sk 8 Task8  Task7  Task6  Task5  Task4  Task3  Task2  Task1 hidden 2 nm, before update T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 T a sk 7 T a sk 8 Task8  Task7  Task6  Task5  Task4  Task3  Task2  Task1 hidden 2 nm, grad 1 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 T a sk 7 T a sk 8 Task8  Task7  Task6  Task5  Task4  Task3  Task2  Task1 hidden 2 nm, grad 2 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 T a sk 7 T a sk 8 Task8  Task7  Task6  Task5  Task4  Task3  Task2  Task1 hidden 2 nm, grad 3 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 T a sk 7 T a sk 8 Task8  Task7  Task6  Task5  Task4  Task3  Task2  Task1 hidden 2   T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, before update T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, grad 1 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, grad 2 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, grad 3 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, grad 4 0.0 0.2 0.4 0.6 0.8 1.0 (a) first hidden layer.
T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 2 nm, before update T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 2 nm, grad 1 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 2 nm, grad 2 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 2 nm, grad 3 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1   hidden 2   T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, before update T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, grad 1 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, grad 2 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, grad 3 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, grad 4 0.0 0.2 0.4 0.6 0.8 1.0 (a) first hidden layer.
T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 2 nm, before update T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 2 nm, grad 1 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 2 nm, grad 2 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 2 nm, grad 3 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6  T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, before update T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, grad 1 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, grad 2 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, grad 3 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 1 nm, grad 4 0.0 0.2 0.4 0.6 0.8 1.0 (a) first hidden layer.
T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 2 nm, before update T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 2 nm, grad 1 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 2 nm, grad 2 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1 hidden 2 nm, grad 3 T a sk 1 T a sk 2 T a sk 3 T a sk 4 T a sk 5 T a sk 6 Task6   Task5   Task4   Task3   Task2   Task1   hidden 2