Towards practical reinforcement learning for tokamak magnetic control

Reinforcement learning (RL) has shown promising results for real-time control systems, including the domain of plasma magnetic control. However, there are still significant drawbacks compared to traditional feedback control approaches for magnetic confinement. In this work, we address key drawbacks of the RL method; achieving higher control accuracy for desired plasma properties, reducing the steady-state error, and decreasing the required time to learn new tasks. We build on top of \cite{degrave2022magnetic}, and present algorithmic improvements to the agent architecture and training procedure. We present simulation results that show up to 65\% improvement in shape accuracy, achieve substantial reduction in the long-term bias of the plasma current, and additionally reduce the training time required to learn new tasks by a factor of 3 or more. We present new experiments using the upgraded RL-based controllers on the TCV tokamak, which validate the simulation results achieved, and point the way towards routinely achieving accurate discharges using the RL approach.


Introduction
Feedback control is vital for the operation of tokamak devices.A control system actively manages the magnetic coils to tame the instability of elongated plasmas (Lazarus et al., 1990), preventing damaging vertical disruption events (Lehnen et al., 2015).Furthermore, precise control of the plasma current, location, and shape enables management of heat exhaust and plasma energy (Leonard and the DIII-D Divertor Team, 2005;Silburn et al., 2017).In research contexts, scientists study the effects of changes in the plasma configuration on these quantities of interest (Anand et al., 2017(Anand et al., , 2020;;Hofmann et al., 2001;Moret et al., 1997;Pesamosca, 2021), requiring control systems for novel configurations and rapid variation around the nominal scenario.A flexible toolset that enables rapid iteration and delivers precise magnetic control is thus a boon to tokamak research and development.
Traditionally, accurate control of the plasma is achieved through successive loop closures of the plasma current, shape, and location (De Tommasi, 2019).In this paradigm, the control designer pre-computes a set of feedforward coil currents (Blum et al., 2019;Hofmann, 1988), and then builds feedback loops for each of the controlled quantities.These quantities (e.g.plasma shape and location) cannot be directly measured, and must be indirectly estimated in real-time from magnetic measurements.In particular, the shape of the plasma must be estimated in real-time using equilibrium reconstruction codes (Blum et al., 2008;Ferron et al., 1998;Moret et al., 2015).Such systems have successfully stabilized a wide range of discharges, but design can be challenging and time-consuming (in researcher and experimental time), especially for novel plasma scenarios.
Reinforcement learning (RL) has recently emerged as an alternative paradigm for building realtime control systems.While historically high-profile successes of reinforcement learning were confined to simulated systems such as StarCraft (Vinyals et al., 2019) and DOTA (Berner et al., 2019) or highly structured systems such as Go (Silver et al., 2017), more recent successes include systems with high amounts of hidden knowledge like Stratego (Perolat et al., 2022) and Diplomacy (FAIR Team et al., 2022), and large unstructured spaces such as language (Christiano et al., 2017;Ouyang et al., 2022).RL is also being deployed for increasingly sophisticated real-world applications such as data center cooling (Luo et al., 2022) and memory management (Wang et al., 2023).
Alongside these more general successes, reinforcement learning is becoming increasingly used for plasma control.While early works focused on supervised learning, for example to perform realtime estimation of coarse plasma shapes (Bishop et al., 1995;Lister and Schnurrenberger, 1991), recent efforts have expanded to using RL, for example to construct feedforward trajectories of plasma parameters such as  (Seo et al., 2021(Seo et al., , 2022)), control of   (Char et al., 2020(Char et al., , 2021(Char et al., , 2022)), and direct control of the vertical instability (Dubbioso et al., 2023).For a recent overview of ML applications to fusion research, including the use of RL, please see (Pavone et al., 2023).
Recent work by Degrave et al. (2022) demonstrated the ability for an RL-designed system to perform the main functions of tokamak magnetic control.In particular, this work presented a system where an RL "agent" learns to control the Tokamak à Configuration Variable (TCV) (Hofmann et al., 1994;Reimerdes et al., 2022) by interacting with the FGE tokamak simulator (Carpanese, 2021).The control policy learned by the agent was then integrated into the TCV control system, whereby the policy observed TCV's magnetic measurements, and output control commands for all 19 magnetic control coils.Degrave et al. (2022) demonstrated the capability for RL agents to control a wide variety of scenarios, including plasmas that are highly-elongated, snowflakes (Anand et al., 2019), and even demonstrated a novel stabilization of a "droplet" configuration with two separate plasmas in the vacuum chamber simultaneously.This work presents a strong case for RL-designed control systems, where the control designer expresses a final goal (quantified using a reward function) that is maximized by the agent.This shifts the focus away from the exact specifics on "how" to achieve such goals, and toward "what" goals should be achieved.RL approaches, however, have a number of drawbacks which have limited their uptake as a practical solution for the control of tokamak plasmas.Here we address and begin to alleviate three of these challenges: the difficulty in specifying a scalar reward function that is both learnable and provokes accurate controller performance; the steady-state bias in tracking errors; and long training times.First, in Reward Shaping, we propose a method for reward shaping as an intuitive and simple solution to improve control precision.We then address the issue of steady-state error, in Integrator Feedback, by providing explicit signals for error and integrated-error to the agent.This reduces the accuracy gap between classical and reinforcement-learned controllers.Finally, in Episode Chunking and Transfer learning, we address the issue of training time required to generate control policies.RL algorithms are known to have high computational cost and low sample efficiency Cabi et al. (2019), a problem exacerbated for tokamaks where even low-fidelity plasma simulators are significantly more computationally expensive than simulators used in traditional RL applications.We address this by using a multi-start approach for complex discharges and show substantial reductions in the training time of new policies.Furthermore, we show that warm-starting training with existing control strategies can be a very effective tool when the new scenario of interest is close to a previous scenario.In combination, these techniques lead to a significant reduction in training time and improvement in accuracy, making substantial strides towards enabling RL to be a routinely usable technology for plasma control.

Background Reinforcement Learning
Reinforcement learning (Sutton and Barto, 2018) is a subset of machine learning that represents control problems as the interaction between an agent and an environment.The agent can be seen as the control mechanism, and the environment can be seen as the system to be controlled.The agent receives a set of signals, called observations, and sends a set of control signals, known as actions.In standard RL settings, the agent-environment loop operates at discrete intervals.The state of the environment at time  is denoted by   , and the measurements observed are a function of this state   = (  ) .The agent chooses actions according to a control strategy, also known as a policy, which is a potentially non-deterministic function   = (  ).The environment is influenced by the action taken, and evolves according to a (potentially non-deterministic) dynamics function  +1 = (  ,   ).In traditional control techniques, the goal is generally to keep the error small between a desired reference value and the actual measured value (or measurement-estimated state).Reinforcement learning, in contrast, works with the more general concept of reward.While optimal control also uses the concept of reward (or cost respectively), practical algorithms typically force restrictions on the formulation of the system and/or reward.The reward function   = (  ,   ,  +1 ) is a real-valued function whose output indicates the quality of executing action   in state   , where higher is better.Note that in general the reward is a generic function of the environment state and action, though often rewards will just depend on  +1 , the state reached as a result of the action.While it can relate to an error signal (with increasing reward for decreasing error), it can also be a function combining multiple different error signals, or include terms that penalize certain actions or states (e.g.control coil currents being too large).
An episode of interaction between the agent and the environment begins with the environment in an initial state  0 .The agent observes  0 = ( 0 ), and takes an action  0 = ( 0 ).The environment then advances state to  1 = ( 0 ,  0 ), and the agent receives a reward of  0 .The agent then takes action  1 , the environment proceeds to  2 , and the agent receives  1 .This continues iteratively until a termination is triggered (e.g. a limit of  steps, or an off-nominal condition).In simulation, this discrete time approximation is natural, though of course is only an approximation to the continuous time environment evolution in physical experiments.The goal in RL (and of RL algorithms) is to find the optimal policy to maximize the discounted accumulated reward over an episode when starting from an initial state  0 : where 0 <  ≤ 1, where  is a discount factor controlling the myopia of the agent.

RL for Plasma Magnetic Control
We follow the work of Degrave et al. (2022) in translating the challenge of magnetic control on TCV into a reinforcement learning problem.During a discharge, the environment state,   , which for the present purpose we can consider as the complete state of the tokamak, includes the plasma, the electric currents in the control coils and passive structures, and derived quantities such as the resulting magnetic field (as will be discussed, this state is simplified in simulation).The agent acts at a 10kHz control rate, sending a voltage command to each of the 19 magnetic control circuit voltage power supplies at every time step (2 ohmic circuits, 8 each of high-field-side and low-field-side poloidal field coils, and one in-vessel vertical control circuit consisting of up-down antiseries-connected windings).
The agent observation,   , consists of two components: the real-time sensor measurements and the control targets, known as the references.There are 92 real-valued sensor measurements; 34 wire-loop magnetic flux measurements, 38 magnetic field probes, 19 control coil current measurements, and one (redundant) measurement of the difference in currents between the ohmic coils.The references represent the desired plasma configuration, including the location of the plasma, the limit point, and any desired X-points.The plasma boundary is defined by the last closed-flux surface (LCFS), which is the outermost closed surface of the iso-flux contours.The desired LCFS is represented in the references by the  and  coordinate locations for 32 control points.The references may additionally include explicit targets for properties of the LCFS, such as reaching a desired elongation or triangularity.X-points (Grad and Rubin, 1958) are saddle points of the magnetic flux corresponding to points with   =   = 0, where  is the magnetic field, and again, the desired X-point locations are represented with their  and  coordinates.The limiting point is also represented with an  and  coordinate, either a location along the limiter for limited plasmas, or the location of the active X-point for diverted plasmas.Finally, we additionally provide a desired value for the plasma current,   .
For most tasks, the references are not static, but instead a function of time.We pre-determine a desired evolution of the plasma during the discharge, an example of which is shown in Fig. 1.During execution of the experiment, the measurements and references at time  are combined together to create the observation,   , from which the voltage commands are determined.Note that the RL agent receives the raw magnetic measurements, not the direct state or a reconstruction of it.No explicit error measurements are provided to the agent, and no observers are needed to convert measurements into error values.0.0872s, Ip: -110k 0.3000s, Ip: -150k 0.4000s, Ip: -150k 0.5000s, Ip: -150k 1.0000s, Ip: -70k

Task showcase_xpoint
Figure 1 | A simplified set of references (desired plasma configurations) for the showcase_xpoint task.Time (measured from the start of the discharge) and the plasma current (  ) are displayed above each plasma shape.X-points marked with red "X".The limit point is displayed in orange.For the full reference set, see Fig. 12a.
In this work, we use the same basic experimental design as in Degrave et al. (2022).We learn a control policy, , for a specific experiment through interaction with a simulated environment, and then deploy the resulting policy for a discharge on TCV.Specifically, the dynamics are modeled with the free-boundary simulator, FGE, with additional stochasticity added to model the noise in sensor values and power supplies, and to vary the parameters of the plasma.The sensor noise is applied per environment step, while the plasma parameter variation (the plasma resistivity,   , the normalized plasma pressure,   , and the safety factor at the plasma axis,   ) is simplified so the values are constant within an episode, but are randomly sampled between episodes.
Like Degrave et al. (2022), we use the Maximum-a-Posteriori Optimization (MPO) algorithm (Abdolmaleki et al., 2018) to develop control policies.MPO relies on two neural networks: an "actor" network that outputs the current policy , and a "critic" network that approximates the expected accumulated reward of that policy.The agent interacts with 1000 copies of the FGE environment, collecting the observations seen, the actions taken, and the reward received.The reward received at each step is computed based on how close the plasma state is to the target values contained in the references, augmented by other factors such as avoiding undesirable plasma states.A straightforward translation from the optimal control paradigm to reinforcement learning would be to have a reward component for each error term to be minimized, where each component  is mapped to a scalar value   .These values are then combined into a single scalar reward value.Based on recorded sequences of observations, actions, and rewards, the agent alternates policy updates and critic updates using gradient descent on a regularized loss function (see Abdolmaleki et al. (2018) and the Reward Shaping section for details).The updated actor network parameters are used in future interactions with the environment.For plasma discharges, the actor network is restricted to a small architecture that can execute at 10kHz, but the critic network is only used during training, and so can be sophisticated enough to learn the environment dynamics (which the actor network then uses to learn what is needed for control).The major implementation differences with the setup in Degrave et al. (2022) and the work presented here are that: • All experiments in our setup use JAX (Bradbury et al., 2018) and Haiku (Hennigan et al., 2020) for neural network construction and parameter training.• The error tolerance on the residual of FGE's Newton-Krylov solver was reduced from 1e-8 to 1e-4.This lead to a training speedup of 2-3x without significantly affecting the simulated control outcome.
As noted above, we can define a task, i.e. desired evolution of the plasma state during a discharge, through a time-varying set of references.In this work, we consider the following tasks: • shape_70166: a simple stabilization task where the plasma state is maintained as a lowelongation limited plasma.The reference values are constant throughout the episode.• showcase_xpoint: a time-varying discharge where the agent needs to elongate the plasma and ramp up the plasma current, then shift its vertical position, then divert the plasma (create an active X-point), and finally ramp down the plasma current and reduce the elongation.Fig. 1 illustrates a subset of states for this task.For all references see Fig. 12a.This corresponds to the task in Fig. 2 of Degrave et al. (2022).• snowflake_to_perfect: the snowflake configuration (Ryutov, 2007) consists of a hexagonal saddlepoint structure (second order null) by placing two X-points closely together.In this task, we initially establish these two X-points at a distance, and then bring them together.The reference is shown in Fig. 13.This corresponds to the task in Fig. 3d of Degrave et al. (2022).

Towards Practical RL Controllers
In this section, we detail the contributions of this paper with regard to the agent training process.First, we discuss improving the control accuracy through reward shaping.This is followed by our work toward reducing the steady state error through integral observations.We then discuss episode chunking which is used to improve wall clock training time.Finally we explore transfer learning as a means to improve training time.

Reward Shaping
Where traditional control algorithms take actions to minimize the error of an actively measured (or estimated) quantity, RL algorithms instead seek to maximize a generically-defined reward signal.This reward maximization objective drives the evolution of the agent's behavior during training.However, the reward value is not computed during deployment.
In classical control algorithms, the performance of the controller can be adjusted by explicitly tuning control gains (e.g. to modify responsiveness or disturbance rejection) and adjusting trade-off weights for multi-input multi-output (MIMO) systems.By contrast in RL, the reward function is of central importance to the learned controller's behavior.Careful design of the reward function is therefore necessary to adjust controller behavior.In this section, we explore how modifying the design of the reward can elicit desired behaviors in the resulting trained agent.We will see that, by adjusting the design of the reward function, we can quickly adapt agent behavior and trade off elements of our objective.Moreover, we demonstrate that shaping the reward function is essential for creating accurate RL control policies.We further show that it is possible to adapt an agent to a new objective by continuing training with an updated reward function.

Reward Design Overview
We modify the reward function designed for magnetic control by Degrave et al. (2022) (detailed in Tables 3, 4 and 5 of their manuscript).The reward function is a combination of separate components, and in this section, we consider how the design of these components can be altered to influence the behavior of the trained agent, in particular where there are trade-offs between performance across reward components.
Reward components correspond to different desiderata of an ideal agent (accurate shape, accurate plasma current, etc.).Each reward component is calculated by taking the difference between the desired value and the value reported by the simulated environment.A non-linear scaling and transformation is applied to this difference, giving an effective reward for that component.The overall (scalar) reward is computed using a non-linear combination of the individual component rewards.It is in designing the reward components where we have the most fine-grained control over incentives for the agent.
In this work, we combine the reward component values using a weighted SmoothMax function.
In some cases, an individual reward component is built from several related error quantities, such as the shape error at multiple control points.We also utilise the SmoothMax function to combine these errors into a single scalar component reward.The definition of the SmoothMax function is provided below.

SmoothMax(𝑥
The choice of the   directly provides the (relative) importance of each component which itself is carefully designed.The value  affects the trade-offs between "easy" and "hard" (to satisfy) components.A value of  much less than zero (say,  < −5) means that the reward received by the agent is nearly equal to the component it is performing least well on, while a value of  close to zero means that all components are equally emphasized.A value of  greater than zero is unsuitable, since it would accentuate components that are controlled well at the exclusion of components where it is doing poorly.
Many individual components that feed in to the SmoothMax are constructed similarly to those of classical controllers (e.g.keep the plasma current close to a desired quantity).However, reward components are not constrained to be (easily) measurable from sensor measurements, providing additional flexibility in their construction.Reward components can also be multi-modal, for example to encourage behavior away from regions of state-space that are undesirable or less well modeled by the simulator.
In this work, we use a SoftPlus transformation to arrive at scalar reward components: 1 (3) The good and bad parameters act to scale the reward signal into a region of interest.If the true value is worse than bad, the reward rapidly decays toward a value of 0, while if it is at or better than good, the reward saturates to a value of 1.The parameter  affects the "sharpness" of the reward scaling between the good and bad reference points 2 .The set of scalar reward components are then combined with a (weighted) SmoothMax operator to attain a final scalar reward.These functions as well as the influence of hyperparameter choices is depicted in Figure 2.
Components with tight good and bad parameters must be controlled well to receive a high component reward, while components with looser constraints are more easily satisfied.
In theory, many parameter choices should be approximately equivalent as they are monotonic adjustments to the reward and should not strongly affect the optimal policy.In practice, however, we are reliant on gradient descent, and do not have a perfect global optimizer.We need to deal with exploring a global space in the face of stochastic returns.Tight values of good and bad make it difficult to find a region to get any appreciable reward (or an appreciable gradient on how to improve).On the other hand, a loose value of bad makes it easier to find a reward signal, but harder to discover precise control as there is a smaller change in reward upon improvement.Intuitively, "tight" reward parameters may therefore be more appropriate where the initial conditions are close to the goal state and thus the reward does not need to shape goal discovery as well as fine-grained accuracy.
Relatedly, a value of  near zero will provide a reward gradient on all components, but provides a more diluted signal on all components.In contrast, a substantially negative value of  will provide a strong incentive to improve the worst component, but takes away incentive to improve other components and thus harms exploration.These trade-offs are especially prevalent in the plasma domain where reward components are often complementary (or at least orthogonal) rather than truly being in contrast, e.g.accurate X-point control assists in accurate LCFS control, and does not detract from accurate plasma current control.

Reward Shaping in a Simple Setting
For our initial experiments, we consider three training approaches focusing on minimizing shape error by modifying the hyperparemeters of the reward component for shape error in the shape_70166 task: 1 Our SoftPlus implementation is based on the lower half of the logistic function instead of the standard SoftPlus since we want it to be bounded from 0 to 1, with the good value being exactly 1.
2 In our experiments we set  = − log(19) = −2.9444such that the bad reference point corresponds to a reward of 0.1.2022)good = 0.005, bad = 0.05.3 2. Narrow Reward: updating the parameters to good = 0 and bad = 0.025.These reference values produce a more exacting reward function.This concentrates the reward signal at lower error values and offers a guiding signal even for small error values, providing incentive for increased accuracy in controlling the shape.3. Reward Schedule: scheduling the values of good and bad to gradually become more peaked as training progresses, with good = 0 and bad decreasing from 0.1 to 0.025 over 6 million policy update steps.This schedule provides a wider reward basin at the beginning of training to help exploration, gradually tightening the reward to encourage accuracy as training progresses.Historical data is not relabelled as the reward function evolves; however, stale data does eventually drop out of the learning agent's replay buffer.
The results from this first set of experiments, shown in Table 1, demonstrate that the reward chosen for training can have significant influence over the performance of the final trained agent.Focusing on the shape error, we note that the greatest impact came from the Narrow Reward with its highly demanding static reward function.In this simple task, the more precise reward function provides a strong incentive for the controller to improve accuracy.While such a sharp reward signal can harm policy discovery, as discussed above, the task is to maintain the handover position, and so exploration is not a strong challenge in this task.With little need for exploration to find the highly rewarding states, the agent can focus on satisfying the demanding reward signal.Furthermore, the simplicity of the task means that there is little-to-no trade off in accurate control between the reward components.
The agent trained with the Reward Schedule, progressively reducing the value of bad, shows no significant improvement over the baseline agent in terms of the targeted reward component (LCFS RMSE).We hypothesize that the poor performance of the agent trained with a reward schedule, as compared to one trained with a fixed reward, is due to the non-stationarity that a changing reward function induces during the more exploratory phase of learning.The agent is able to learn a 'good enough' policy with the initial loose reward function.The value function which estimates the expected sum of future rewards, which is in turn used to update the policy, struggles to stay in line with the latest reward function as it evolves.In essence, the value function is receiving different learning signals for the same states at different points in training without any explicit engineered ability to adapt to this.This therefore makes the learning problem much more challenging.We note this behavior despite the schedule becoming stationary at approximately 60% through training and therefore the final 40% of training is with a fixed reward function.An additional factor to consider is that the Gaussian noise added to the actions during training generally decreases throughout training as is natural under the MPO learning algorithm (Abdolmaleki et al., 2018).This leads to less exploration as training continues and therefore a potentially reduced capacity to adapt to an evolving reward as training continues.The results from this simple setting of maintaining and refining the plasma's handover shape, suggest that a more exacting reward can improve agent performance.

Reward Shaping for Complex Tasks
We now turn to the snowflake_to_perfect task where training is more costly and the reward-tuning more complex, due to time-varying objectives and a greater number of metrics of interest.In particular, we seek to improve X-point location accuracy through reward shaping.
We consider the following reward shaping approaches for X-point location accuracy: This allows us to distinguish between effects from more training and changing the reward function.
We compare the performance of four differing training configurations listed above.The results are summarized in Table 2.This set of experiments demonstrate the value of a flexible reward specification.We see in the X-Point Fine Tuned result that this two-stage training procedure emphasizing X-point accuracy leads to a 57% reduction in X-point location error.Note this is not merely due to the extra data, we see a significant improvement in X-point accuracy when compared with the Additional Training experiment.This demonstrates the importance of reward shaping rather than extra training cycles in this result.Note, however, that if we compare the X-Point Fine Tuned agent with the Additional Training agent, we see that accuracy on other quantities, such as   and shape error, degrades slightly -the fine-tuned agent performs less well in these respects compared to the agent that continued training with the  Degrave et al. (2022).This agent was then used as a base for the additional training agent (which was allowed to continue training under the same conditions) and the fine-tuned agent which continued training with an updated reward emphasizing X-Point location accuracy.The Narrow X-Point Reward agent was trained with the shaped reward from scratch.The values are averages over 5 training runs with the the 95% confidence interval indicated.
original reward.While these differences are not statistically significant, they suggest a possible trade off that arises from reward shaping.In such cases, a control designer can assess the trade-offs of shaping different components and search for the most desirable policy.Reward shaping acts as an intuitive tool for performing this policy search in the space of reward functions.
The agent trained with the more exacting reward throughout training (Narrow X-Point Reward) fails to learn an effective policy.Investigation of the agent's performance shows that the agent typically struggles to successfully establish the snowflake shape, failing to bring in the second X-point at the desired location.The increased sharpness in the X-point reward causes the region where the agent receives appreciable positive signal to decrease, so when a second X-point is brought in incorrectly the agent receives a low reward, and also low signal on how to improve.Note that the standard deviation is quite high -it is possible for the exploration to succeed and the learning to do well, but the narrow reward decreases the probability of finding the right local minimum.This is compounded by the low value of  which emphasizes the least performing element, and thus further disguises policy improvements.For example, improving LCFS accuracy might help the agent discover how to divert the plasma and thus control its limiting by the X-point location.However, a harsh (i.e.large magnitude)  and an exacting X-point location reward obfuscates this progress since reward calculation is dominated by X-point accuracy for which the agent receives little to no signal on how to improve.
The failure of the agents trained with the Narrow X-Point Reward from scratch is contrasted to the success of the agents trained with a more exacting shape error component in the simpler shape_70166 task.This demonstrates the importance of matching the design of the reward not only to the desired agent policy characteristics but also the task at hand.In the case of shape_70166, there is no exploration needed to find the highly rewarding region of exacting reward components since the initial state is the one to be stabilized and persisted.In contrast, the snowflake_to_perfect task requires manipulating the plasma through a series of target shapes and locations.Therefore, learning from scratch with an exacting reward is difficult.This trade-off motivates our explorations into reward scheduling and agent fine-tuning.It must be stated, however, that reward function design is an art, and there may be a better structure that maintains flexibility while mitigating this trade-off.
Our results from the snowflake_to_perfect task demonstrate the benefits from a two-stage training regime, training initially on a more forgiving reward and switching to a more exacting reward.We have demonstrated this using RL for both stages, though one could, for example, use RL to find the right "local-minimum" for the control strategy, and then in the second stage use an optimal-control based algorithm to fine-tune on the more exacting reward.This two-stage training regime also suggests the potential of a general base controller to be rapidly adapted to a precisely defined set of targets.This in turn is an intermediate step towards a general and precise controller for a multitude of plasma control tasks.Figure 3 | Simulated errors in the plasma current and shape on the shape_70166 task over a 1 second (s) control window.We compare policies with and without the average-error feedback for three random seeds.The figures demonstrate that incorporating the additional signal considerably reduces the bias in controlling the current, while the plasma shape errors are slightly higher but comparable.

Integrator Feedback
Continuous control trajectories typically include a transient phase, where the agent state is rapidly changing towards achieving the target values, and a steady-state phase where the agent is close enough to the desired target and control involves reacting to disturbances to remain close to the target.In traditional proportional-integral-derivative (PID) control, the policy includes linear feedback on the control error, its integral, and its derivative.The integral term in particular is designed to reduce/eliminate the bias in the steady-state error, while the derivative term helps to dampen the response to transient disturbances and reference changes.The feed-forward neural network policies used by Degrave et al. (2022) do not use and cannot compute or construct an error integral, as the commanded actions are purely a function of the current inputs.An integral error approximation could be computed by a recurrent neural network, however, they have a greater risk of overfitting to the simulation dynamics.In this work, we implement a simpler solution: rather than having a policy learn the error integral, we manually compute it and append it to the set of observations seen by the feed-forward policy.We focus in particular on reducing the steady-state error in the plasma current (  ), for which policies trained in Degrave et al. (2022) exhibited significant bias and which can easily be computed.Diverging slightly from the traditional approach, we provide the network with the average plasma current error at time ,    , defined as where   is the difference between the plasma current measurement and reference values at time .This choice of average keeps the numerical inputs better conditioned.Other choices are of course possible, for instance using the integrated error directly, or using an exponentially decaying average to put greater focus on the recent past.
We evaluate the benefits of incorporating the average error signal on the shape_70166 task where the reference values for the plasma current and shape are constant, and the environment is initialized so that actual values are close to the references.Thus the main goal of the agent is to control the steady-state error.Figure 3 shows the simulated plasma current error trajectories for policies trained with and without the integrator feedback, with three random runs for each case.We note that the integrator feedback considerably reduces the plasma current bias, as expected.The experiments on TCV last 1-2s, which corresponds to 10, 000 -20, 000 time steps at the 10kHz control rate.The FGE simulator (Carpanese, 2021) (used to train the agents, as discussed above) takes around 2 seconds for a typical simulation step during training with stochastic actions on one core of AMD EPYC 7B12 CPU4 .Thus, FGE generates an episode with 10,000 steps in approximately 5 hours.This means that in the best case scenario, when the agent knows the best policy before the first trial, the training time would still be ∼5 hours (to observe the high-quality result).
In practice, RL agents need to explore the action space to find the best policy.Thus, depending on the task complexity, training times vary from days to weeks.Moreover, our tasks are structured such that the agent needs to learn somewhat independent "skills" sequentially.For example, in the showcase_xpoint task, the agent must first elongate the plasma, then shift its vertical position, then divert it and finally restore the original shape (see Fig. 1).We observe that learning for this task happens in two distinct stages (see Fig. 4a).First, the agent learns to manipulate limited plasmas, understanding how to elongate, move and hold the plasma, which corresponds to the smooth reward curve from 0 to around 80. At this stage, it is trying (but failing) to generate a diverted shape, instead it obtains the round LCFS with an inactive X-point, as demonstrated on Fig. 4b.The reward plateaus at this level, until finally, the agent discovers how to successfully divert the plasma, where we see a transition in reward from 80 to near 100.In our standard setup, it takes several days of training for the agent to discover this transition.This plateau lasts for an extended period because it is hard to learn to divert the plasma.This is compounded by the fact that the agent needs to divert at the specified time, so the right exploration needs to happen at the right time.On top of this, the long sequential episodes require the agent to go through the entire initial discharge phase before attempting to divert the plasma.On a slow simulator, Figure 5 | FBT takes shape references and reconstructs the entire plasma state.Two different plasma shapes on the left are constructed via FBT and shown on the right.The initial state is visualized via the isolines of the magnetic flux.
these effects compound to create training times ranging from days for simple tasks to weeks for more complicated ones.Furthermore, Degrave et al. (2022) shows that increasing the number of actors yields diminishing returns, so one cannot speed up training further by simply using more resources.
Our observations above (see Fig. 4 and the related discussion) suggest that training can be accelerated by splitting the single long episode into a set of shorter episodes ("chunks") and letting individual actors explore each chunk separately from the very beginning of training.This effectively parallelizes exploration.Fig. 12 demonstrates different training setups with two and three chunks5 for the showcase_xpoint task.To achieve this in practice we divide actors into groups, where one group trains on the full-length discharge, and the other groups are each assigned a chunk.Each group starts the episode with the simulator in a state relevant to the selected chunk.The agent gets many more attempts at exploration from the shorter episodes.The specific division for any task is a hyperparameter; in general we have found success assigning 25-50% of actors to the full episode and dividing the rest equally among the chunks.More complex schemes could also be employed, for instance adjusting the number of actors based on difficulty.The simulator state at the beginning of each episode is specified by a given plasma shape (as on Fig. 1) and is computed using the FBT algorithm Hofmann (1988), which calculates the poloidal field coil currents needed to achieve a given plasma shape.This calculation neglects eddy current effects as well as time-varying Ohmic coil currents required to induce the plasma current, hence it is not guaranteed that this initial condition matches the end point of the previous chunk leading to chunk "discontinuities".Fig. 5 demonstrates how two desired plasma shapes on the left are constructed with FBT and presented on the right.Plasma states are visually represented via the isolines of the magnetic flux (lines at which the magnetic flux is constant).We also display the last closed flux surface (solid black line), X-points (red crosses) and the limiter points (magenta circles).
In theory, misalignment between the tokamak state at the end of one chunk and the starting state at the next chunk (i.e.discontinuities) could create problems.This is particularly, but not exclusively, worrisome with regards to the control coil currents.For example, the agent may learn to aggressively ramp the coils, as coil currents "reset" between chunks and so risk of current saturation is reduced.Alternatively, the agent may learn to produce large swings in coil currents to "smooth-out" these discontinuities, which is undesirable and could lead to problems not necessarily captured in our model.In practice, we do not observe this to be a problem, as shown below.The agent learns to naturally resolve these discontinuities through training on the full episode, and so we do not see discontinuities in simulation in practice.This issue could also be resolved through more principled strategies, for example by widening the distribution of starting states for the chunks to provide greater overlap.
Application of the chunking technique to the showcase_xpoint task with two/three chunks (depicted on Fig. 12) leads to significantly faster training times, as shown on Fig. 6.The two-chunked setup (orange curve) is already faster than the baseline (blue).The three-chunked setups (3_chunks and 3_chunks_eq_weights) not only provide further training speed up but also a much smoother learning curve.The agent achieves the reward of 96 (out of 100) in ∼ 10 hours compared to 40 hours for the baseline.Here we try two different three chunked setups: all actors are split in equally sized-groups (3_chunks_eq_weights); three times more actors are used for the whole episode compared to each other chunk.Both setups give similar results.Figure 6 | Results for episode chunking applied to the showcase_xpoint task.Two and threechunked setups are not only faster than the baseline (default), but also have smoother learning curves.The two three-chunked setups considered are: equal actor distribution among all chunks (3_chunks_eq_weights) and more actors for the whole episode (3_chunks).In all experiments the results are averaged over three random seeds.

Transfer learning
When attempting to reduce training time, a natural question is to ask if training from previous discharges can be re-used, that is, to what extent does the knowledge accumulated by the agent while solving an initial task transfer6 to a related target task.
Tokamak operators often experiment with different variations around a base task.Thus, we examine the transfer learning question when the target task is a variation on the initial task.Specifically, we examine performance when adjusting the reference plasma current and also shifting the location of the plasma.
We examine the performance of transfer learning in two forms 1. Zero-shot: We run the policy learned for the initial task on the target task, without any additional data collection or policy parameter updates.
2. Fine tuning: We initialize the policy and value function with the weights of the model learned on the original task, and then use those weights to train on the adjusted task through interacting with the environment with the new task as the reward.Note that this requires the same architecture to be used for both tasks (actor and critic network).
In both cases, we use the parameters from an agent trained on the showcase_xpoint task as the initial parameters for the transfer.In the first experiment, we examine transfer when the reference plasma current is adjusted to new reference levels.Concretely, we choose three variations where the target   is adjusted from the baseline −150kA to first −160kA, then −170kA, and finally −100kA (specifically, we adjust the reference current in all time slices in Figure 1 except for the initial handover level and the final rampdown level).We test the policy trained on showcase_xpoint, first without any additional training on the target task, and then allowing new training on the target task.The zero-shot results for reward and   error are shown in 3, where we see the agent does well for small changes to   , though struggles for larger shifts, in particular diverting the plasma for larger shifts.The fine tuning results can be seen in Fig. 7a, 7b, 7c and show that the fine tuning agents converge to a near-optimal policy faster than agents trained from scratch in all cases, although the difference is less pronounced for the largest 50 shift.
We see that transfer learning is, in general, very effective when modifying the target plasma current.For small shifts, the unadjusted baseline agent performs almost as well as a specially trained agent.As the adjustments to the current get larger, the zero-shot performance suffers, but we see that performance can be recovered with a small amount of fine-tuning.Assuming these results generalize, it implies a small range of plasma currents can be tested with the same base agent, and a larger range can be quickly trained from the base agent.
The second experiment examines variations in the plasma target location.Specifically, we adjust the target shape downward along the z axis, shifting by 2cm, 10cm and 20cm.For this experiment we observed the following:  4. We see that the zero-shot transfer works really well for the smallest shift (2cm), achieving over 97% of best achievable performance of 100 for the task and a small shape error.For the larger 10cm shift, the performance is mediocre, seeing a reward of only 85, with a much larger error in the shape location.For the largest shift (20cm), the performance is poor, seeing a reward of only 35, due to a failure to successfully divert the plasma.2. Fine tuning: The fine tuning results can be seen in Fig. 7d, 7e, 7f, and show that transfer learning is clearly effective for the 2cm shift, and effective for two out of three seeds for the 10cm shift.For the larger 20cm shift, transfer learning seems detrimental rather than beneficial.
Like with plasma current, we see that the baseline agent is capable of performing well for small adjustments to the shape.This similarly provides promise that an agent trained for a single condition can actually perform well for a variety of small adjustments without further training.However, the story is more mixed for fine-tuning, where using a baseline agent dramatically speeds up training for small adjustments, but is harmful for larger adjustments.
Overall, the results indicate that transfer learning can be useful, but also has limitations in its current form.As expected, the further the target task is from the initial task, the more the performance of transfer learning degrades, particularly for zero-shot learning.However, note that it is relatively low cost (in CPU-hours) to run a simulated zero-shot evaluation to test performance before running a hardware experiment.We also observed that some types of task changes allow for easier transfer than others -in our experiments, relatively large   shifts seemed more suited to transfer learning than large position shifts, which is understandable given the relative complexity of the tasks.Further study is needed to understand which tasks are amenable to transfer learning, and how to expand the regions of effective transfer, both zero-shot and fine tuning.

Tokamak Discharge Experiments on TCV
The previous sections have focused solely on simulation, training and assessing control policies with the FGE simulator.Given the complexity and challenges of Tokamak modeling, it is important to not blindly accept performance improvements in simulation as identical to performance improvements in physical discharges.While better simulation results might be necessary for improved results on the actual Tokamak, they are not always sufficient and model mismatch errors may begin to dominate without additional explicit work to reduce sim-to-real gaps.This is especially the case for policies obtained using RL which are known to overfit to imperfect simulators (Zhang et al., 2018).We thus tested several of the aforementioned simulation enhancements on dedicated discharges on the TCV tokamak.This way we can assess the strengths and limitations of our current work, and provide direction for the next set of improvements.

Reward Shaping for Plasma Shape Accuracy
We examine the accuracy improvements seen from reward shaping for two different configurations and objectives: reducing the LCFS error in a shape stabilization task and improving X-point accuracy for the snowflake_to_perfect task configuration.We compare the results seen in simulation with those on TCV, and to comparable TCV experiments from Degrave et al. (2022).Like Degrave et al. (2022), we deploy the control policy by creating a shared-library object out of the actor network (defined by a JAX graph), where the commanded action is taken as the mean of the output Gaussian distribution.
We first test a control policy trained to reduce the LCFS error in the shape_70166 stabilization task using the reward shaping approach discussed above in the Reward Shaping section.
For this stabilization task, we use TCV's standard breakdown procedure and initial plasma controller.At 0.45s, control is handed over to the learned control policy, which then tries to maintain a fixed plasma current and shape for a duration of 1s.After the discharge, we compute the reconstructed equilibria using the LIUQE code (Moret et al., 2015).At each 0.1ms slice during the 1s discharge, we compute the error in the plasma shape.We compare the accuracy from three experiments, measuring the shape error from a simulated discharge and a TCV discharge: (a) a baseline RL controller that pre-dates this work ("Previous"), (b) an updated baseline agent using the updated training infrastructure used in this work ("Updated"), (c) a an agent trained using reward shaping, as in Fixed Reward describe in the Reward Shaping section.
The results of these runs are reported in Table 5.
Both recent policies, the updated baseline and reward-shaped policy, significantly outperform the pre-existing baseline for the goal of reducing LCFS error as shown in Table 5.This reduction is due to improvements in the training infrastructure.These improvements are also seen on the TCV experiments -both the updated baseline and the reward-shaped agent outperform the previous baseline.However, if we compare the two contemporary experiments, entitled Updated and Reward-Shaped respectively, we see that the shaped policy performs worse on TCV relative to the updated baseline despite achieving better results in simulation.One hypothesis for this difference is that the plasma resistivity during the TCV discharge was near the edge of the range of variation used during training.It is possible that the shaped controller was less robust to these variations.Over the course of the discharge, the difference in Tokamak state caused by the higher plasma resistivity (coil currents, etc.) compounds.This could explain the fact that the error is low during the initial stage of  5 | Comparison of policies on LCFS shape error for the shape_70166 stabilization task the discharge, but then grows in time as the discharge proceeds.Overall, we see that improvements to performance in simulation are beneficial, the accuracy of the updated infrastructure is higher that the previous baseline.However, there is a limit to optimizing simulation performance.Indeed, it seems that for this case, there is little to be gained by further reducing the simulation RMS error, and instead we should now focus on addressing the sim-to-real gap.

Reward Shaping for X-Point Location Accuracy
We next compare the effects of reward shaping on the more complicated "snowflake" configuration, as shown in Figure 9.The training reward for this policy was shaped to increase the accuracy of X-point control.As in the stabilization experiment, the plasma is created and initially controlled by the standard TCV procedures, handing over to the RL controller at 0.45s.In this experiment, the RL-trained policy successfully established a snowflake with two X-points at a distance of 34cm.The policy then successfully brought the two X-points to a targeted distance of 6.7cm, coming close to establishing a so-called "perfect snowflake".However, at 1.0278s (0.5778s after handover), the plasma disrupted due to vertical instability.Upon inspection, it seems that the controller struggled to keep a consistent shape, where the vertical oscillations increased, and the active X-point switched between the two X-points, leading to a loss of control.Table 6 shows the accuracy of X-point tracking during the window where the plasma was successfully controlled.The performance during this experiment is compared with the equivalent snowflake experiment reported in Degrave et al. (2022).Similar to above, we compute the error from the plasma states re-constructed by LIUQE.We see that the substantial improvements to X-point accuracy seen in simulation do indeed lead to significant improvements in X-point accuracy seen on hardware.The improvements from reward-shaping result in a 59.7% reduction in RMSE tracking distance over the control window compared to a previous TCV experiment.Other metrics, such as the LCFS, report a minimal decrease in accuracy, which is Table 6 | Comparison of policies for X-point tracking for the Snowflake configuration expected, as described in Reward Shaping.Here, we do indeed see notable benefits from reward shaping, though work remains on bridging the sim-to-real gap for maintaining highly-accurate perfect snowflakes.
Figure 9 | Evolution of TCV equilibrium reconstructed post-shot using LIUQE based on magnetic measurements, for Snowflake TCV shot (77505).

Validation of Accelerated Training via Episode Chunking
Finally, we validate the use of Episode Chunking to reduce training time, especially to verify that the possible "discontinuities" from episode chunking do not show up in TCV discharges.We ran an experiment for the showcase configuration trained using 3 chunks.The time-trace of reconstructed equilibria for this experiment can be seen in Figure 11 .We find that the experiment went as expected, with no noticeable artifacts due to the episode chunking.This demonstrates that there is no loss of quality from this training acceleration approach.

Conclusions and Future Work
There is excitement around the potential impact of reinforcement learning for magnetic control of Tokamaks, but drawbacks limit their uptake within the community.In this paper, we significantly improved upon several key limitations, with a focus on policy accuracy and overall training speed.
We first addressed the issue of controller accuracy.In the Reward Shaping section, we showed that reward shaping and tuning can significantly improve the controller accuracy, reaching a 65% reduction in LCFS error in simulation.We further showed that providing an integral observation to the agent significantly reduces the long-term bias of the agent.Combined, these provide promise in the ability for RL to generate highly accurate controllers.
We then demonstrated the effectiveness of Episode Chunking to alleviate exploration challenges and help the agent discover control policies for complex configurations.Dividing training episodes in this way significantly reduced the training time for an example diverted plasma.We additionally showed that Transfer learning by warm-starting training can allow for the rapid generation of policies when minor adjustments are made to the training task.These are two powerful tools that provide significant reductions in the amount of time needed to train new policies.
While these results significantly reduce the limitations on reinforcement learning controllers, there is still a lot of room for improvement.Going forward, there will need to be increased focus not only on improving performance in simulation, but on matching that level of performance during actual plasma discharges on hardware.In particular, the experiments in Table 6 show that the gap in accuracy between simulation and hardware is now close to dominating any remaining improvements in simulation.
There are a number of promising directions for improving hardware transfer, for example by improving the modeling of plasma parameter variation (to expose the agent to more realistic scenarios), and by improving the real-time knowledge of the agent, for example by integrating real-time plasma observers directly into the observations of the agent.More ambitiously, substantial improvement could be gained by using fine-tuning to update policies in response to experimental data.Such data could be used to directly fine-tune the weights of a policy for a specific experiment, or alternatively be used to improve simulation capability, thus indirectly improving agent quality.In either case, this will be challenging given the paucity of data.
Similarly, there are many opportunities for continued reduction of the training time requirements.Our results on episode chunking suggest that exploration is a significant bottleneck to training time.Explicit exploration techniques, for example Taïga et al. (2019), could overcome the bottlenecks in taking the right action at critical moments, and thus significantly reduce training time.Relatedly, using training data from previous experiments combined with offline RL approaches (Levine et al., 2020) could provide key demonstrations, avoiding the need for the agent to 'unlock' each difficult moment anew during training.One could also use pre-computed feed-forward coil current trajectories from existing optimizers as a starting point for policy creation.
Alternative model architectures to the existing MLPs and LSTMs could provide significant benefits.State-space models (Gu et al., 2021) are one promising approach to modeling long range dependencies without sacrificing inference speed.Another promising direction is looking into foundation models (Bommasani et al., 2021), which have shown impressive generalization and fine-tuning capabilities (Brohan et al., 2022;Team et al., 2023).Potentially, a single large-scale model could learn to control many plasma discharges, and adapt to specific scenarios after a few trials.One significant challenge for this direction, however, is generating a policy that can execute at the high frequencies required for plasma control.
Overall, reinforcement learning remains an attractive alternative for plasma control.This work has begun to alleviate some of the remaining blockers to adoption for the application of magnetic control, and there are many promising directions for continued enhancement.
Plot depicting the impact of the good and bad reference values on reward component values.Note that the bad reference points correspond to a reward of 0.1.1.000.75 0.50 0.25 0.00 0.25 0.50 0.75 1

1.
Baseline: train with default parameters taken from Degrave et al. (2022)good = 0.005, bad = 0.05.2. X-Point Fine Tuned: first train with default parameters and then perform a second phase of training with a more exacting reward which emphasizes X-point accuracygood = 0, bad = 0.025.3. Narrow X-Point Reward: train with a more exacting reward function from the inception of traininggood = 0, bad = 0.025.4. Additional Training: perform the additional phase of training without updating the reward.
Learning curves for the "showcase_xpoint" task for 3 different seeds.(b) References and the actual plasma state during the first plateau phase.

Figure 4 |
Figure 4 | RL agent learns the showcase_xpoint task in two distinct stages.(a) Smooth reward curve from 0 to 80 (out of 100), then a plateau and another smooth transition to the reward 100.(b) Before the transition LCFS (black solid line) is not being aligned with the references (green circles), which means that the agent knows how to hold the plasma but cannot generate the desired diverted shape.

Figure 7 |
Figure 7 | Fine tuning results from showcase_xpoint to two target tasks with different reference current (a, b, c) and vertical plasma position (d, e, f).The plots show total reward computed by the deterministic evaluator.The blue lines correspond to training on the target task baseline, while the orange lines correspond to transfer_learning by initializing the agent with the model trained on showcase_xpoint.All experiments are run three times with different seeds.

Figure 10 |
Figure 10 | X-point tracking distance RMSE comparison against a previous comparison TCV discharge.

Figure 11 |
Figure 11 | Evolution of TCV equilibrium reconstructed post-shot using LIUQE based on magnetic measurements, for showcase TCV shot (77620).

Table 1 |
Figure 2 | Visualisation of the influence of reward hyperparameters.Results from reward adjustment in the simple stabilization task, shape_70166.The values are averages over 5 training runs with the the 95% confidence interval indicated.The   target is -120kA.

Table 2 |
Results from reward adjustment in snowflake_to_perfect task.The baseline agent was trained with the reward used by

Table 3 |
Reward for and   error for zero-shot transfer with changes to the plasma current targets on the showcase_xpoint task.Values represent changes magnitude; a change from −150kA to −160kA is a 10kA increase.The maximum reward for this task is 100.The   tracking error is computed during the phase of constant   (0.25s to 0.85s).

Table 4 |
Reward and shape error for zero-shot transfer shifting the location of the plasma on the showcase_xpoint task.Values represent shifts downward in the domain.The maximum reward for this task is 100.The shape RMS error is computed between 0.25s to 0.85s.1.Zero-shot: Results are shown in Table