Shared communication for coordinated large-scale reinforcement learning control

Deep Reinforcement Learning (DRL) recently emerged as a possibility to control complex systems without the need to model them mathematically. In contrast to classical controllers, DRL alleviates the need for constant parameter tuning, tedious design of control laws, and re-identification procedures in the event of performance degradation. However, the application of DRL algorithms remains fairly modest, and they have not yet established a significant position in process industries. One major obstacle has been their sample inefficiency when facing tasks featuring large state-action spaces. In this work, we show that it is possible to use DRL for plant-wide control by decentralizing and coordinating reinforcement learning. Namely, we express the global policy as a collection of local policies. Every local policy receives local observations and is responsible for controlling a different region of the environment. To enable coordination among local policies, we present a mechanism based on message passing. Messages are encoded by a shared communication channel, which is equipped with a model-based stream to capture the dynamics of the system and enable effective pre-training. The proposed method is evaluated on a set of robotic tasks and a large-scale vinyl acetate monomer (VAM) plant. Experimental results highlight that the proposed model exhibits drastic improvements over baselines in terms of mean scores and sample efficiency.


Introduction
Industrial process control is a large and diverse field; its broad range of applications includes chemical, power, or semiconductor plant control.For example, chemical plants consist of several processing units that cooperatively produce chemical products.Such plants often comprise a huge number of sensors and control units.With ever-increasing demand for products in process industries, it is necessary to maintain optimal production in a variety of situations, including when the system encounters disturbances or drifts in process characteristics.
The current approaches for process control are either classic control approaches or optimization based.The former obeys rules that were depicted to the controller by experts.However, such conventional approaches require extensive knowledge from an expert to be transferred to the controller via control laws and other mathematical derivations.The latter, optimizationbased controllers, includes numerous methods such as proportional-integral-derivative (PID) or model predictive control (MPC).For instance, MPC looks ahead in the future and takes action considering future errors.Nevertheless, they suffer from the fact that the optimization step takes time to compute optimal control input, especially for complex high-dimensional systems.In addition, many deployed controllers achieve robustness at the expense of performance.In detail, the overall performance of a controlled process depends on the characteristics of the process itself and the tuning parameters that are employed.Even if a controller is well-tuned at the time of installation, drift in process characteristics or deliberate set-point changes can cause performance to deteriorate over time.Finally, current approaches are not robust in nature as they generally cannot handle novel situations such as disturbances.
An appealing alternative to classic controllers is reinforcement learning (RL) since RL does not need an explicit model of the plant for control design, can learn a control policy by trial-and-error interactions with the environment without supervision from experienced operators, and is often able to generalize to unseen situations.A few prior studies have explored the use of RL for controlling a chemical plant [1][2][3][4][5] and dealing with disturbances [6,7].However, plantwide control remains a challenge for current research.That is, its sample inefficiency in tasks that feature large state-action spaces and costly computations has been a barrier to its adoption by industries.
As mentioned above, applying standard RL appro aches in process industries is challenging as they do not scale well to high-dimensional problems and are typically expensive in terms of both learning time and computational resources.A prior study has attempted to employ a multi-agent framework to deal with large action spaces and reduce the problem complexity [3].Tavakoli et al. [8] address the problem of large action outputs by allocating a separate sub-network for each of the agent's continuous actions.Although these methods have achieved successes under restricted settings, successfully applying RL in large-scale and complex chemical plants remains unresolved.Instead, we aim at learning a collection of policies that receive a subset of observations and control a subset of actions.This approach was then extended to alleviate the curse of dimensionality by introducing Fastfood kernel approximation [2] and the parameter-tuning burden [9].However, one possible drawback of these methods is the necessity to discretize the action space and train a set of agents on a subset of the action space without explicit coordination, which may weaken the control capability of the overall model.By contrast, we propose a method that leverages a collection of local policies, and we coordinate them via message passing.Message passing has been employed in a multi-agent framework [10] where each agent models its influence on other agents' actions.On the other hand, Huang et al. [11] have proposed a modular architecture that shares parameters among tasks and learns messages by maximizing cumulative rewards.Instead, we employ a collection of local policies with restricted observations and actions, and employ a shared communication channel.The presented method employs a different architecture and objective for aggregating and encoding messages, which aims to enable sample-efficient RL in tasks featuring large state-action spaces.Besides, we propose to encode dynamic-aware messages.
This study is an extended version of the paper presented in the SICE 2022 annual conference [12].We introduce a novel way of generating messages that reduces the computational burden on local policies by employing a shared communication channel, which encodes specific messages for each local policy.We further augment the communication channel with a model-based stream to encode messages that capture the dynamics of the system and enable effective pretraining.Additionally, in Section 3 we have compared the novel approach with a number of baselines including our original method.We have further conducted additional ablation analysis to measure the effect of each component in our method.
To overcome the above-mentioned limitations, we propose sdcRL (shared communication for decentrali zed and coordinated RL), a new DRL approach that can learn the control policy for large-scale complex plants, in a model-free and adaptive environment.The central idea is to decompose the control problem into a set of local control tasks, where each control task is handled by one local policy.Each local policy only perceives local information from the local sensors and controls a small number of actions, significantly reducing the training time and the search space.To avoid being trapped in local optima due to the lack of coordination within the local policies, we introduce a communication channel to coordinate local policies via message passing between the local policies.The messages encode historical observations and actions by a recurrent neural network such that policies have access to all historical information.Namely, we use a shared communication channel that produces a different message for each local policy.To encode dynamic-aware messages and enable pre-training of the communication channel, we append a model-based stream to the communication channel.By coordinating the local policies, the present framework can learn coordinated behaviours such as optimizing the profit of a largescale vinyl acetate monomer (VAM) plant [13], while significantly improving sample efficiency and final performance compared to a plain agent that observes and controls the entire plant.Overall, this scheme constitutes a novel perspective for large-scale control and could help to expand the possible applications of DRL.

Method
In the standard paradigm as used in RL, a single controller receives all available input information and selects all actions as outputs, which leads to highdimensional input and output spaces in complex tasks.Therefore, as mentioned above, it is often impractical to solve tasks featuring large state-action spaces and complex dynamics such as in process industries.By contrast, the proposed framework consists of N local policies, each receiving a set of local observations and controlling a (small) set of actions.Thus each local policy controls a different region of the environment.This local control structure is biologically inspired by the organization of motor control systems in biological systems [14].Note that the present approach is comparable to multi-agent reinforcement learning as each local policy controls a different set of actions.However, the proposed framework employs a single agent (i.e.global policy) composed of multiple local policies that interact with the same environment.By doing so, the problem of controlling a complex large-scale environment is reduced to a set of easier local control problems, reducing the training time and improving sample efficiency.
Under this setting, each local policy only perceives local information and controls a subset of the action space, which clearly weakens control capability of the resulting global policy.To overcome this drawback, we propose a mechanism for coordination and information sharing among the local policies via message passing.This involves a communication channel trained to generate messages.This idea is central for enabling local policies to access meaningful contextual information received by other local policies and the emergence of complex coordination within local policies.
The overview of the system is shown in Figure 1.We consider a framework with a natural two-level hierarchy.The high level corresponds to handling communication, and the low level corresponds to a set of local controllers.The hierarchical learning problem is to simultaneously learn a communication channel, as well as the low-level policies, called local policies.The aim of the learner is to achieve a high reward when its communication channel and local policies are run together.

Decentralized control
Consider a global policy π G and a set of N local policies {π 1 , . . ., π N }.The parameters θ n of the nth local policy π n are obtained by maximizing local rewards with the following constraints for action and message generation: where a n t is the (local) action produced by π n , o n t is the local observation perceived by the local policy π n , and m n t is the message encoded by the communication channel for π n .
At every time step t, the final action that the global policy π G performs in the environment, a t , is defined as the concatenation of the actions selected by the local policies: (2) where "concat" denotes the concatenation operation, o 1 is the local observation received by the local policy π 1 , m 1 is the messages received by π 1 from the communication channel.Note that since the forward passes of local policies can be done concurrently, the method results in a running time similar to that of a single policy that controls the environment.

Communication channel
While decentralized control is effective for reducing the training time, the resulting global policy is likely to be suboptimal due to the lack of coordination among local policies.To overcome this pitfall, we propose to share messages among local policies.Specifically, we design a communication channel to enable communication and coordination between the local policies (Figure 2).The central concept is to aggregate observations and actions of local policies, and previous messages in order to encode messages.
In detail, the communication channel consists of two streams: a message stream and a model-based stream.The message and model-based streams share weights in the shallow layers of the channel.We use the message stream to encode messages that are broadcast to local policies.In contrast, we do not directly use the output of the model-based stream.The model-based stream is primarily designed to: (1) make the communication channel aware of the environment dynamics and (2) pre-train the communication channel.
Namely, at step t, the communication channel C takes the local observations o n t−1 and actions of local policies a n t−1 , and previous messages m n t−1 to produce a suitable message for each local policy {m 1 t , . . ., m N t }: ) where m n t is the message for the local policy π n , and θ H is a set of trainable parameters.
In the following section, we describe the key components of our communication channel.

Message stream
Messages, a learned vector, are encoded by the message stream.Messages are used to condition the act of local policies (see Equation 1).We find that message passing enables the emergence of complex coordination within local policies.For instance, messages can be used to explain which local policies influence the behaviour of others, or local policies' intention.A detailed description of message encoding and message stream training is provided below.Message Encoding.Messages are encoded by the communication channel.This involves a recurrent neural network (RNN) f that takes as input local observations, local actions, and previous messages.The recurrent neural network f (i.e.LSTM network) is followed by a feature encoder φ-a set of fully connected layers.To produce N different messages, the output of the fully connected layers φ(•) is passed through N heads, g N .
As such, the embedding produced h t by the RNN given the received observations, actions, messages . .} is defined as follows: where h t is the hidden states at time t, o n t−1 is the local observation perceived by the local policy π n , and m n t−1 is the message encoded by the communication channel at time t−1.
The embedding h t is then passed through fully connected layers and N heads-each head encodes a message for a different local policy.The message m n t can therefore be expressed as follows: As such, N messages are produced by passing the output of φ(•) through the N different heads g 1 , g 2 , . . ., g N .Messages are normalized before being broadcast to the local policies.The message m n t is then used to condition the input of the local policy π n .
Message Stream Training.The message stream parameters θ C are optimized by minimizing the loss reduction in local policies, defined as where L j P is the policy loss of the local policy π j and m j t contains the message received by π n at time t.θ j t and θ j t+1 respectively, denote the local policy's parameters before and after learning from an experience resulting from message advise.In other words, the communication channel parameters are trained to encode messages that measure improvement of the policy loss function used by local policies that receive this message.
In the present experiments, the local policies are trained using Proximal Policy Optimization (PPO) [15], a widely used RL technique for optimizing policies.Therefore, the policy loss at time t for π j is defined as where t is the probability ratio, t = , and is a hyperparameter.The ratio t is clipped to fall between (1 − ) and (1 + ).A t represents the advantage function.Thus, in the case of PPO, the objective is to minimize the value estimation error and maximize the expected log rewards.

Model-based stream
We propose augmenting the novice communication channel with a forward model-based stream.We therefore represent the dynamics function as a multilayer neural network f m , parameterized by θ M , and introduce an unsupervised mean absolute error (MAE) auxiliary loss to train the network to predict the next state s t+1 , given the current state s t and action a t .
As the message stream and model-based streams share some common weights, this architecture allows gradients from the auxiliary loss to contribute to improving messages.Therefore, cues from the forward model will provide gradients that allow the novice channel to improve its representation of the world.Moreover, this architecture also implicitly improves the communication channel's ability to model local policies, since it must correctly predict the next state st+1 given the current state s t and actions a t : Since in this study we rely on local policies, the reconstruction problem is expressed as where st is the predicted estimated of the state s t , and the model-based network parameters θ M are trained to optimize The learned model f m is also known as the forward dynamics model and the tuple The overall optimization problem that is solved for learning the communication channel's parameters can be written as where β is a scalar that weights the message stream loss and (1 − β) is a scalar that weights the importance of the model-based loss.
Pre-Training with Human Demonstrations.One key question is how to generate high-quality messages from the onset of the training.Since directly collecting messages is impractical, a solution is to pre-train the communication channel on demonstrations from an expert or an existing controller.Rather than pre-training the message stream on expert data, we leverage the fact that gradients from the model-based stream contribute to improving the communication channel.Therefore, we propose to pre-train the auxiliary model by leveraging a small set of demonstrations.Given a dataset D of state-action pairs (s t , a t , s t+1 ) that were collected by observing an expert attempting to achieve the goal being pursued in the task, the model is trained to predict the expert' intention.Specifically, given D = {(s i , a i , s i+1 )} |D| i=1 , the communication channel is trained offline to predict the correct next state.By doing so, the expert will provide gradients that allow the novice to improve its representation of the world even if it does not receive any reward from the demonstration, enabling better messages from the onset of the training phase.

Experiments
We evaluate the performance of our algorithm on two separate environments: (i) a small-scale robot control environment, and (ii) a large-scale VAM plant environment that reflects the characteristics and practical problems of real process plants.We further conducted ablation studies to measure the effect of each component in our method.

Robot control environment
We first assess our method on a continuous control task in the MuJoCo simulator through the OpenAI gym interface: Ant-v2 [16].In Ant-v2, the goal is to make a four-legged creature walk forward as fast as possible.This set of experiments aims to determine whether the present method remains effective on small-scale problems, and is further used for ablation analysis due to the low amount of computation required to train on this task.

Chemical plant environment
The second environment is a VAM plant.Since VAMs include typical processes in chemical plants, it can be used as a robust benchmark environment.The simulator is comprised of eight components for materials feeding, reacting, and recycling.The process is observed via 109 sensors that measure the volume, flux, temperature, concentration, and pressure of the chemical substances.The agent aims to optimize the profit of the plant while avoiding failures in equipment and maintaining the process in a steady state.This environment is particularly challenging due to: (1) large state and action spaces, (2) computationally expensive simulations, highlighting the need for sample-efficient agents, and (3) complex dynamics.Thus, because of the large amount of computation required to train on these difficult tasks, the cost of exploring in this space is high.
Concretely, the state space consists of the sensor readings.The action space consists of all the 26 PID controllers in the plant [13].The ranges of actions are defined as [−x 1 %,+x 2 %] from the initial values.In our experiments, we set x 1 = 0.60 and x 2 = 1.35.The agent interacts with the environment once every minute for 60 virtual minutes, which corresponds to one episode.The reward is defined as the profit of the plant.In the absence of domain knowledge, local rewards are defined as the profit of the plant.

Experimental details sdcRL
As our policy learning method, we rely on PPO [15] with Generalized Advantage Estimation [15].We use PPO with clipping parameter = 0.2.The actor and critic networks consist of three fully-connected layers with 128 hidden units.Tanh is used as the activation function, and the output value of the actor network is scaled to the range of each action dimension.Training is carried out with a fixed learning rate of 0.0007 using the Adam optimizer [17], with a batch size of 128.The policy is trained for 4 epochs after each episode.
As the communication channel, we employ LSTM with 128 hidden size with tanh as the activation function.The learned embedding is then passed through two fully-connected layers with tanh non-linearity and 64 hidden units.The message stream is a multihead network where the number of heads is equal to the number of local policies.The dimension of message vectors is 8.The model-based stream is a twolayered fully-connected neural network with tanh nonlinearity as the activation function.For pre-training the model-based stream, we used 1.0 × 10 3 transitions that were recorded by observing an expert attempting to solve the task.Local actions are normalized before passed to the communication channel.In our experiments we set β = 0.85.In the robot control environment, we employ two local policies {π 1 , π 2 }, each controlling two opposite legs of the robot.Each local policy receives the actuator values of the controlled legs and selects actions for the corresponding legs.In the chemical plant environment, our method is comprised by four local policies that observe and control actions of the components: 1-2 (π 1 ), 3-4 (π 2 ), 5-6 (π 3 ), and 7-8 (π 4 ), respectively.Each action is controlled by a single local policy.
We compare our method (sdcRL) against multiple baselines including: PPO [15], SAC [18], A2C [19], and dcRL [12].Table 1 presents the settings used by our method and dcRL, and the type of information shared.In the chemical plant task, we also compare dcRL against SAC-GP [5] and the PIDs [13] that originally controlled the chemical plant (chemical plant experiments).

Robot control
We first perform experiments on Ant-v2, a robot control task built on top of MuJoCo.As shown in Figure 3, the proposed method, sdcRL, outperforms plain PPO and dcRL.Note that the parameters of PPO and sdcRL were not finetuned on this task.This is because this study does not aim to achieve state-of-the-art performance on Ant-v2 but, rather, to analyse performance improvements of our approach over the standard PPO baseline.In future work, we anticipate combining sdcRL with the state of the art on these tasks.
We can observe that the use of decentralized control improves convergence speed compared to plain PPO.One reason is that each local policy receives a small set of observations and controls a small number of actions, effectively reducing the problem complexity.As a result, our method can achieve higher average reward in a shorter time than the comparative algorithms.Figure 3 further demonstrates that employing a shared communication channel improves training speed compared to multiple communication streams as done in dcRL.This is because sdcRL reduces the computational burden on local policies.Moreover, the present method encodes different messages for every local policy, improving their relevance to the target local policy.

Chemical plant
Second, we evaluate the proposed method on a set of tasks from the VAM environment, which replicates the features and problems of real-world process control plants.The training curves are shown in Figure 4.By analysing them, we draw a few conclusions.First, the proposed study enables the training of a RL agent on a large-scale and complex process task such as plantwide control.On the other hand, plain PPO achieves poor performance due to the state-space size.Second, in terms of convergence speed and final score, our algorithm is significantly faster than the plain PPO method and baseline methods, including dcRL.Third, the analysis of the generated trajectories reveals that our model quickly learns to avoid failures in the plant by coordinating local policies.Finally, in such a complex environment we found that taking into account the dynamics of the system produces drastically better messages than without, like done in dcRL (see Section 3.6.6).Overall, this set of experiments demonstrates that sdcRL can be used as an alternative to conventional controllers as our method outperforms the PID controllers that originally controlled the environment.

Ablation analysis
We also present ablation studies to investigate: (1) the message size in sdcRL, (2) the impact of the number of local policies on the performance, (3) the quality of the learned messages, (4) the learned behaviours, (5) the impact of the model-based stream on the performance, ( 6) the importance of model-based pre-training, and ( 7) the effect of fine-tuning on the performance.

Size of messages
In addition to the original Ant-v2 task, we designed a task (Ant-v2 special) that allowed us to evaluate the significance of the messages generated by the communication channel.To do so, we use Ant-v2 because of its relative simplicity compared to a real chemical plant.We employ two local policies where one policy observes all actuators while the other one does not observe any actuators but receives messages from the communication channel.Therefore, we can expect that the overall performance will be influenced by the quality and task-relevancy of the messages.To measure the effect of each component in our method, we compare the proposed method sdcRL with dcRL.Namely, dcRL was evaluated under four settings: without the coordination mechanism "dcRL(no-comm)", with action sharing "dcRL(action)", with message passing "dcRL(message)", and with action sharing and message passing "dcRL".
We now report evaluations showing the effect of the size of messages.Table 2 demonstrates that agents trained with messages of size 8 or 10 obtain higher mean returns after similar numbers of updates.However, we can also observe that a too large value (i.e.> 16) tends to hurt the performance.One reason is that a large message induces a higher computational load during message generation and message aggregation (i.e.lower sample efficiency).Generally, the size does not need to be fine-tuned for each task (i.e. 8 ≤ K ≤ 16) since the agent maintains acceptable performance across tasks.

Number of local policies
To quantify the impact of the number of local policies on learning and performance, we evaluate sdcRL using different numbers of local policies.Table 3 reports the mean episode-returns obtained on different scenarios.For simplicity, let us assume that on Ant-v2 the number of legs is the maximum number of local  policies.We notice that in general an increased number of local policies improves the final performance.However, on Ant-v2, employing more than two local policies does not necessarily improve the performance.One reason is the low task complexity, which does not require further decomposition to be solved effectively.
On the other hand, in complex tasks such as plantwide control, increasing the number of local policies can significantly boost learning compared to plain DRL algorithms.Besides, for all the number of local policies (i.e.N > 1) the present method outperforms the PPO baseline.

Analysis of learned messages
As depicted in Figure 5, even though sdcRL performs slightly worse under this setting, it still improves performance as compared to the plain methods including PPO, SAC, A2C, and SAC-GP.On the other hand, the absence of cooperation among local policies-"dcRL(no comm)", leads to poor performance.Note that since dcRL and sdcRL employ similar policy architectures, sdcRL without communication achieves the same performance as "dcRL(no-comm)".Besides, we can observe that the global policy can effectively control the plant even though one of the local policies only has access to messages.A reason is that messages carry enough relevant knowledge about observations, actions, and dynamics of the plant.Thus cooperation via message passing significantly improves the final performance and convergence speed.We can further observe that sdcRL improves coordination within local policies by encoding better messages than dcRL.In addition, Figure 5 highlights that action sharing as done in dcRL can be alleviated by directly aggregating actions within the communication channel.The results demonstrate that our method can produce relevant messages that can be used for improving coordination and information sharing among local policies.

Analysis of learned behaviours
The performance of the learned global policy was also analysed from the process control perspective.Namely, we report in Figure 6 the product quality, VAM loss, and production rate in our plant-wide control task.As can be noted, sdcRL successfully maintains the product quality below 100 ppm and learns to reduce the VAM loss.Besides, the production rate steadily increases over the course of the training process.After trial-and-error learning, our framework learns to control the plant production by coordinating local policies within the plant, resulting in an increased product rate.

Importance of model-based stream
One legitimate question is to study the importance of the model-based stream on the agent's performance.Ideally, it should improve the final performance by enabling the messages to capture the dynamics of the system.Therefore, we evaluate the proposed method without and with model-based stream on the three control tasks.Table 4 shows that sdcRL's final performance increases significantly when equipping the communication channel with a model-based stream.Observing the agent's behaviour reveals that in the absence of model-based stream, the coordination of local policies is less effective during late training.A reason is that the model-based stream enables the encoding of messages  that not only reduce the loss in local policies but also carry information related to the dynamics of the system.Overall, this experiment confirms that using a modelbased stream is beneficial for generating task-relevant messages.

Impact of model-based pre-training
An important idea in this work is that pre-training the model-based stream improves coordination at the onset of the training process.This could be particularly useful when the computation budget is low and/or the cost of exploration is high.Therefore we investigate the impact of model-based pre-training on performance.We compare the proposed method without and with pre-training.

Effect of fine-tuning on the performance
In this section, we perform ablation to analyse the effect of fine-tuning PPO on the performance.It is commonly accepted that fine-tuning improves task performance.Does this always hold? Table 5 presents the fine-tuning results after running hyperparameter search on: the  , and the number of epochs after each episode [3,4,5].We observe that after fine-tuning PPO, all the performance improve.Namely, we observe that both PPO and sdcRL achieve higher average returns after fine-tuning.This suggests that sdcRL can benefit from fine-tuning the control policy.

Limitations and future work
The proposed method takes a step towards large-scale reinforcement learning by decentralizing control.The core idea is to leverage a set of local policies that control a small number of actions, effectively reducing the complexity of the learning problem.To enable coordinated behaviours within local policies, we present a communication channel that encodes messages.We further equip the communication channel with a model-based stream in order to produce messages that capture the dynamics of the environment.Another advantage of the proposed method lies in the ability to generate taskrelevant messages from the onset of the training by pre-training the communication channel.The experiments demonstrate the effectiveness of this approach by achieving improvements on notoriously difficult tasks such as controlling a chemical plant.Notably, sdcRL could scale to large-scale environments including tasks featuring complex dynamics and large stateaction spaces.
That being said, we acknowledge that our approach has certain limitations and potential avenues for future research.One issue with neural networks is that they are largely opaque.As a result, the messages being broadcast are difficult to interpret for humans.Although the experiments demonstrated their relevance, it is not clear yet how to interpret them.In future work, we anticipate using techniques from explainable AI [20,21] to explain the messages encoded by the communication channel.
Another question is how to scale the method to a large number of local policies.One may notice that we employed a multi-head message stream, which allowed us to simultaneously generate all messages.However, if the number of local policies becomes very large, it may become costly to train the message stream.One solution is to use a single-head message stream and condition the generation with the target local policy.This research direction is left for future work An avenue for research is to leverage the modelbased stream to produce a curiosity signal.In a previous study [22], it has been shown that a forward and inverse models can be employed to generate a curiosity signal that will drive the agent's training when rewards are sparse.We argue that the present study could be augmented with a similar technique.Namely, we could use the difference between the predicted next-state and next-state as a curiosity signal, accelerating the local policies' training when extrinsic rewards are sparse or delayed.

Conclusion
In this work, we have investigated whether decentralized control can accelerate the training process in large-scale tasks.The central concept is to employ a set of local policies and coordinate them via message passing, improving sample efficiency and enabling RL in large-scale scenarios.Namely, the task complexity is significantly reduced by leveraging a set of local policies that control a small number of actions and observe different regions of the environment.Messages are encoded by a communication channel that employs an RNN to capture temporal information.The communication channel is equipped with a model-based stream to encode dynamic-aware messages.We further show that the communication channel can be pretained with expert data to generate task-relevant messages from the onset of the training process.Experimental results on a robot control task and a large-scale VAM chemical plant demonstrate that our method significantly improves sample efficiency and performance.Therefore, this study enables RL in complex tasks that feature very large state-action spaces such as controlling a chemical plant.

Figure 1 .
Figure 1.An example of the proposed framework.In this example, the system employs a collection of local policies {π 1 , . . ., π N }, each controlling a subset of the action space {a 1 , . . ., a K }.Each local policy receives a message m generated by the communication channel C. The communication channel sends a different message to each local policy.The global policy π G selects the final action based on the actions selected by the local policies.

Figure 2 .
Figure 2. Overview of the communication channel.The communication channel takes as input a set of local observations o, local actions a, and previous messages m.The input is passed through an LSTM layer, which produces an embedding h t .The embedding is passed through fully connected layers.The output is fed to a multi-head message stream that produces N messages.The output is also fed to a model-based stream that reconstructs the next state s t+1 .

Figure 4 .
Figure 4. Performance of our method and several baselines on a VAM plant (left).We also report average episode duration (right).A duration of 60 steps indicates that no failure in equipment was triggered by the controller.Results are averaged over 5 runs (±std): (a) average return and (b) episode duration.

Figure 5 .
Figure 5. Performance of our method (sdcRL) and several baselines the communication task built upon Ant-v2.Results are averaged over 5 runs (±std).

Figure 6 .
Figure 6.Analysis of controlled behaviour using learned sdcRL policy.The left column (ppm) shows quality sensor (the value should remain below 100 ppm), the middle column shows the VAM loss (kg/h), and the right column shows product rate (t/h).The top row shows the results at t = 0, the middle row at t = 150, and the bottom row at t = 300.Results are averaged over 20 runs (±std).
Figure 7  demonstrates that pretraining the model-based stream benefits in sample efficiency at the onset of the training process.Namely, sdcRL with model-based pre-training achieves higher returns after 50 and 100 episodes.This is because our agent can benefit from the cues of experts to improve the early message generation.Moreover, during late training, although the performance gap between the two methods is reduced, the final performance can be improved by leveraging the expert demonstrations.A possible explanation for this phenomenon is that since we pre-train the model-based stream with a small number of examples, there is a gap between those examples and states encountered during exploration.As a result, sdcRL (with pre-training) learns fast at the beginning but soon needs to adapt the model-based stream to the agent's experience.On the other hand, sdcRL without pre-training, gradually acquires knowledge that matches the true state distribution.We hypothesize that this would not be an issue anymore when pre-training the model-based stream with a large number of examples.Another interesting observation we can draw is that encoding dynamic-aware messages at the onset of the training allowed us to enhance reward.A reason is that examples used for pre-training the model can be distilled into easy-to-understand messages, reducing the learning workload.

Figure 7 .
Figure 7. Performance of our method (sdcRL) without pretraining of the model-based stream and with pre-training of the model-based stream.The vertical lines depict the standard deviation across 10 runs of each experiment.

Table 1 .
Summary of the information being shared by the methods described in the manuscript.

Table 2 .
Reward (± std) for different values of the message size, on Ant-v2 and Ant-v2 special.

Table 3 .
Reward (± std) for a different number of local policies, on VAM planet, Ant-v2, and Ant-v2 special.

Table 5 .
Average return (± std) without and with fine-tuning of PPO on Ant-v2 and Ant-v2 special.