Distributed deep reinforcement learning for simulation control

Several applications in the scientific simulation of physical systems can be formulated as control/optimization problems. The computational models for such systems generally contain hyperparameters, which control solution fidelity and computational expense. The tuning of these parameters is non-trivial and the general approach is to manually `spot-check' for good combinations. This is because optimal hyperparameter configuration search becomes impractical when the parameter space is large and when they may vary dynamically. To address this issue, we present a framework based on deep reinforcement learning (RL) to train a deep neural network agent that controls a model solve by varying parameters dynamically. First, we validate our RL framework for the problem of controlling chaos in chaotic systems by dynamically changing the parameters of the system. Subsequently, we illustrate the capabilities of our framework for accelerating the convergence of a steady-state CFD solver by automatically adjusting the relaxation factors of discretized Navier-Stokes equations during run-time. The results indicate that the run-time control of the relaxation factors by the learned policy leads to a significant reduction in the number of iterations for convergence compared to the random selection of the relaxation factors. Our results point to potential benefits for learning adaptive hyperparameter learning strategies across different geometries and boundary conditions with implications for reduced computational campaign expenses.


Introduction
In recent years, there has been a proliferation in the use of machine learning for fluid mechanics problems that use the vast amount of data generated from experiments, field measurements, and large-scale high fidelity simulations [1][2][3]. A key ML algorithm is given by the artificial neural network and it has been widely used to tackle challenges in fluid mechanics. The artificial neural network is popular due to its ability to approximate complex nonlinear functions from large amounts of data, although this comes at an expense of reduced interpretability. Different types of neural networks have been adopted for a variety of tasks such as turbulence closure modeling [4][5][6][7][8], reduced order modeling [9][10][11][12][13], super-resolution and forecasting [14][15][16], solving partial differential equations [17,18], flow control [19,20], etc. Apart from fluid mechanics, there is an ever-increasing trend in the use of machine learning in material science [21,22], computational chemistry [23][24][25], and medical imaging [26,27]. A common thread that links a large majority of these investigations is that machine learning has been deployed in a supervised learning fashion. Supervised learning relies on labeled data to learn relationships from observables to targets and is often limited by the lack of pre-existing data.
In contrast, reinforcement learning (RL) is another type of machine learning philosophy, where an agent interacts with an environment and learns a policy that will maximize a predefined long-term reward [28]. Multiple methods can be employed as an agent in RL, but the most successful is a neural network in conjunction with RL known as the deep RL. Deep RL is gaining widespread popularity due to its success in robotics, playing Atari games, the game of GO, and process control [29][30][31][32][33]. Recently, the field of quantum reinforcement learning is emerging which aims at developing quantum agent that can interact with the outer world and adapt to it with the strategy of achieving certain final foal [34][35][36]. Deep RL is also making inroads in fluid mechanics, as a means to solve fundamental problems in flow control. RL has been utilized to learn collective hydrodynamics of self-propelled swimmers to optimally achieve a specified goal [37,38], to train a glider to autonomously navigate in turbulent atmosphere [39,40], for controlled gliding and perching of ellipsoid shaped bodies [41], and navigation of smart microswimmers in complex flows [42]. RL has also been used for closed-loop control of drag of a cylinder flow [43], control of two-dimensional convection-diffusion partial differential equation [44], designing control strategies for active flow control [45,46], discovering computational models from the data [47], and shape optimization [48]. RL's attractiveness stems from the ability to perform both exploitation and exploration to learn about a particular environment.
Many real-world system optimization problems can be formulated as RL problems and offer a potential for the application of the recent advances in deep RL [49]. Several problems in scientific simulations share a similarity with system optimization problems. For example, the iterative solver in numerical simulations of fluid flows may be considered as a plant to be controlled. The iterative solver consists of different parameters such as relaxation factors, smoothing operations, type of solver, residual tolerance, etc., and all these parameters could be controlled during run-time with RL to achieve faster convergence. Another example is the turbulence closure model employed in computational fluid dynamics (CFD) simulations to represent unresolved physics. The closure coefficients of the turbulence model can be calibrated similarly to the optimization of hyperparameters in neural networks with RL [50]. Indeed, recently, the multi-agent RL was introduced as an automated turbulence model discovery tool in the context of sub-grid scale closure modeling in large eddy simulation of homogeneous and isotropic turbulence [51].
To illustrate the application of RL for the control of the parameters in computational models employed in scientific simulation, we consider the problem of accelerating the convergence of steady-state iterative solvers in CFD simulations by dynamically controlling under-relaxation factors of the discretized Navier-Stokes equations. More concretely, we train the RL agent to control under-relaxation factors to optimize the computational time or the number of iterations it takes for the solution to converge. This control problem differs from standard control problems, as the target state (i.e., the converged solution) is not known a priori. There have been some studies done for dynamically updating under-relaxation factors with the goal to maximize the speed of convergence. One of the methods is to use a fuzzy control algorithm based on the oscillations of the solution norm observed over a large interval prior to the current iteration [52]. Other methods based on neural network and fuzzy logic have also been proposed to automate the convergence of CFD simulations [53,54]. Some of the limitations of the fuzzy logic methods are that they are based on certain assumptions about the system that might not always be accurate, and they do not scale well to larger or complex systems. On the other hand, the neural network agent trained using RL can discover a more accurate decision-making policy to achieve a certain objective and can be applied for complex systems. The problem of controlling underrelaxation factors in CFD simulations is analogous to controlling the learning rate of stochastic gradient descent (SGD) algorithm and RL has been demonstrated to achieve better convergence of SGD than human-designed learning rate [55]. In the present work, we introduce a novel method based on RL to accelerate the convergence of CFD simulations applied to turbulent flows and demonstrate the robustness of the method for the backward-facing step test case.
One of the major challenges in using RL for scientific simulations is the computational cost of simulating an environment [38,56]. For example, the cost of CFD solver depends on several factors such as mesh refinement, complexity of the problem, level of required convergence, and can range between the order of minutes to order of several hours. RL is known to be less sample efficient and most popular RL strategies require a large number of interactions with the environment. Therefore, the RL framework should take the computational overhead of simulating an environment into account for scientific computing problems. One of the approaches to accelerate RL training is to run multiple environments concurrently on different processors and then collect their experience to train the agent. We refer to this multi-environment approach as distributed RL in this study. There are many open-source packages such as RLLib [57], Tensorforce [58], SEED RL [59], Acme [60] that can be exploited to achieve faster training with distributed RL. In addition to the distributed execution of the environment, the CFD simulations can also be parallelized to run on multiple processing elements [61,62]. The list of contributions of this work can be summarized as follows: • A distributed deep RL framework is presented for the optimization and control of systems parameters encountered in scientific simulations.
• We apply the deep RL framework to the task of restoring chaos in transiently chaotic systems, and for accelerating the convergence of a steady-state CFD solver.
• We highlight the importance of distributed RL and how it aids in accelerated training for scientific applications.
• We also provide an open-source code and set of tutorials for deploying the introduced framework on distributed multiprocessing systems.
This paper is organized as follows. Section 2 provides background information on RL problem formulation and present proximal policy optimization (PPO) algorithm. In Section 3, we validate our framework using the test case of restoring chaos in chaotic systems. Then, we describe how controlling relaxation factors in CFD simulations converted to sequential decision-making problem and discuss the performance of an RL agent to generalize for different boundary conditions and different geometry setup. Finally, we conclude with the summary of our study and outline the future research directions in Section 4.

Reinforcement Learning
In this section, we discuss the formulation of the RL problem and briefly describe the proximal policy optimization (PPO) algorithm [63] that belongs to a class of policy-gradient methods. In RL, the agent observes the state of the environment and then based on these observations takes an action. The objective of the agent is the maximization of the expected value of the cumulative sum of the reward. This problem statement can be framed as a Markov decision process (MDP). At each time step t, the agent observes some representation of the state of the system, s t ∈ S, and based on this observation selects an action, a t ∈ A. As a consequence of the action, the agent receives the reward, r t ∈ R and the environment enters in a new state s t+1 . Therefore, the interaction of an agent with the environment gives rise to a trajectory as shown below τ = {s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , . . . }.
The advantage of the MDP framework is that it is flexible and can be applied to different problems in different ways. For example, the time steps in the MDP formulation need not refer to the fixed time step as used in CFD simulations; they can refer to arbitrary successive stages of taking the action. The goal of the RL agent is to choose an action that will maximize the expected discounted return over the trajectory τ and mathematically, it can be written as where γ is a parameter called a discount rate and it lies between [0, 1], and T is the horizon of the trajectory. The discount rate determines how much importance to be given to the long term reward compared to immediate reward. If γ → 0, then the agent is concerned with maximizing immediate rewards and as γ → 1, the agent takes future reward into account more strongly. The RL tasks are also classified based on whether they terminate or not. Some tasks involve agent-environment interactions that break into a sequence of episodes, where each episode ends in a state called the terminal state. On the other hand, there are tasks, such as process control, that that goes on continually without limit. The horizon T of the trajectory for these continuing tasks goes to infinity in Equation 2.
In RL, the agent's decision making procedure is characterized by a policy π(s, a) ∈ Π. The RL agent is trained to find a policy so as to optimize the expected return when starting in the state s at time step t and is called as V-value function. We can write the the V-value function as Similarly, the expected return starting in a state s, taking an action a, and thereafter following a policy π is called as the Q-value function and can be written as We also define an advantage function that can be considered as an another version of Q-value function with lower variance by taking the V-value function as the baseline. The advantage function can be written as In deep RL, the neural network is used as an RL agent, and therefore, the weights and biases of the neural network are the parameters of the policy [64]. We use π w (·) to denote that the policy is parameterized by w ∈ R d . The agent is trained with an objective function defined as [28] where an episode starts in some particular state s 0 , and V πw is the value function for the policy π w . The policy parameters w are learned by estimating the gradient of an objective function and plugging it into a gradient ascent algorithm as follow where α is the learning rate of the optimization algorithm. The gradient of an objective function can be computed using the policy gradient theorem [64] as follow There are two main challenges in using the above empirical expectation. The first one is the large number of samples required and the second is the difficulty of obtaining stable and steady improvement. There are different families of policy-gradient algorithms that are proposed to tackle these challenges [65][66][67]. The performance of policy gradient methods is highly sensitive to the learning rate α in Equation 7. If the learning rate is large it can cause the training to be unstable. The proximal policy optimization (PPO) algorithm uses a clipped surrogate objective function to avoid excessive update in policy parameters in a simplified way [63]. The clipped objective function of the PPO algorithm is where r t (w) denotes the probability ratio between new and old policies as follow The in Equation 9 is a hyperparameter that controls how much new policy is allowed to be deviated from the old. This is done using the function clip(r t (w), 1 − , 1 + ) that enforces the ratio between new and old policy (r t (w)) to stay between the limit [1 − , 1 + ].

Numerical Experiments
In this section, we demonstrate the application of deep RL for two problems where the parameters of the systems need to be controlled to achieve a certain objective. The first test case is restoring chaos is chaotic dynamical systems [68].
We utilize this test case as a validation example for our distributed deep RL framework. The second test problem is accelerating the convergence of steady-state CFD solver through deep RL and we illustrate this for turbulent flow over a backward-facing step example.

Chaotic process control
The chaotic state is essential in many dynamical systems and the phenomenon called crisis can cause sudden destruction of chaotic attractors [69]. This can be negative in many applications where the chaos is essential such as to enhance mixing by chaotic advection [70] or to prevent the collapse of electric power systems [71]. Some of the methods to sustain chaos is to use small perturbations based on knowledge of dynamical systems and a priori information about the escape regions from chaos [72]. Vashishtha et. al. [68] proposed an approach using deep RL to determine small perturbations to control parameters of the Lorenz system [73] in such a way that the chaotic dynamics is sustained despite the uncontrolled dynamics being transiently chaotic. This method does not require finding escape regions, target points, and therefore attractive for systems where the complete information about the dynamical system is unknown.
The Lorenz system is described by following equations where x, y, z are the state of the Lorenz system, and σ, ρ, β are the system's parameter. The Lorenz system give rise to chaotic trajectory with σ = 10, ρ = 28, and β = 8/3 as shown in Figure 1. However, the Lorenz system exhibits transient chaotic behavior for ρ ∈ [13.93, 24.06] [74]. For example, if we use ρ = 20 for the Lorenz system, the solution converges to a specific fixed point P − = (−7.12, −7.12, 19) as illustrated in Figure 1.
In order to sustain the chaos in transiently chaotic Lorenz system, we utilize the similar RL problem formulation as Vashishtha et. al. [68]. The RL agent observes the solution vector and its time-derivative as the state of the system. Therefore, we have where the dot represent the time-derivative, i.e., velocity. Based on this observations, the agent choose an action which is the perturbation to the control parameters of the Lorenz system as given below The perturbation to the control parameters is restricted to lie within a certain interval which is ± 10% of its values. Therefore, we have ∆σ ∈ [−σ/10, σ/10], ∆ρ ∈ [−ρ/10, ρ/10], and ∆β ∈ [−β/10, β/10]. We note here that these limits remain fixed throughout the episode based on the initial values of (σ, ρ, β). For a transiently chaotic system, the trajectory converges to a fixed point with time and therefore the magnitude of velocity will decrease consistently with time. This fact is utilized to design the reward function. When the magnitude of velocity is greater than certain threshold velocity V Th , the agent is rewarded and the agent is penalized when the magnitude of velocity is less than V Th . In addition to assigning reward for each time step, a terminal reward is given at the end of each episode. The reward function can be written as where r t is the average of stepwise reward over the last 2000 time steps of an episode. The value of the threshold velocity V Th is set at 40.
For training the agent, we divide each episode into 4000 time steps with time step size dt = 2 × 10 −2 . We train the agent for 2 × 10 5 time steps which corresponds to 50 episodes. The RL agent is a fully connected neural network with two hidden layers and 64 neurons in each layer. Additionally, the output of the second hidden layer is further processed by the LSTM layer composed of 256 cells. The hyperbolic tangent (tanh) activation function is used for all hidden layers. Figure 2 displays how the mean reward is progressing with training for a different number of workers. We note here that by a different number of workers, we mean that the agent-environments interactions are run concurrently on different processors. In distributed deep RL, the environment is simulated on different processors and their experience is collected. These sample batches encode one or more fragments of a trajectory and are concatenated into a larger batch that is then utilized to train the neural network. More concretely, in this particular example, we collect experience of an agent for 8000 time steps to train the neural network. When we employ two workers, the agent-environment interaction is simulated for 4000 time steps on each worker and then this experience is concatenated to form a training batch for the neural network. Similarly, for four workers, the agent-environment interaction is simulated for 2000 time steps on each worker to collect a training batch size of 8000 time steps. Figure 2 also reports the computational time required for training. We see that as we increase the number of workers from two to four, the computational speed is approximately reduced by 25%. However, as we increase the number of workers to 8, there is only a marginal improvement in terms of CPU time. This is not surprising, as the communication frequency is increased with eight workers and hence there is no significant reduction in CPU time. The moving average window size is set at 5 episodes. Figure 3 depicts the performance of the trained RL agent in sustaining chaos in a transiently chaotic Lorenz system. In Figure 3, the black trajectory is for uncontrolled solution of the Lorenz system for parameters (σ, ρ, β) = (10, 20, 8/3). When the policy learned by an RL agent is used for the control, the Lorenz system does not converge to a fixed point of P − . We can also see that the RL agent trained with a different number of workers can learn the policy effectively and is able to sustain chaos over a period of temporal integration. This observation is very important for employing RL in physical systems where the simulation of a computationally intensive environment is the bottleneck and distributed deep RL can be employed to address this issue.

Solver convergence acceleration
In CFD, the Navier-Stokes equations are discretized which leads to a system of linear equations. This system of linear equations is usually solved using iterative techniques like Gauss-Seidel methods, Krylov subspace methods, multigrid methods, etc. [75]. The main logic behind iterative methods is to start with some approximate solution and then reduce the residual vector at each iteration, called relaxation steps. The relaxation steps are performed until the convergence criteria are reached. In the relaxation step, the value of the variable at the next time step is obtained by adding some fraction of the difference between the current value and predicted value to the value of the variable at the current time step. The SIMPLE algorithm [76] is widely used in CFD and it utilizes underrelaxation method to update the solution from one iteration to another. The underrelaxation methods improve the convergence by slowing down the update to the solution field. The value of the variable at the k th iteration is obtained as the linear interpolation between the value from the previous iteration and the value obtained in the current iteration as follow where 0 < α < 1 is the relaxation factor, x (k) P is the value of the variable at node P to be used for the next iteration, x (k−1) P is the value of the variable at node P from the previous iteration, x nb is the values of the variables at the neighboring nodes, and a P , a nb , b are the constants obtained by discretizing Navier-Stokes equations.
The underrelaxation factor α in Equation 19 should be chosen in such a way that it is small enough to ensure stable computation and large enough to move the iterative process forward quickly. In practice, a constant value of the relaxation factor is employed throughout the computation and this value is usually decided based on some previous computations. Also, the suitable value of the relaxation factor is problem-dependent and a small change in it can result in a large difference in the number of iterations needed to obtain the converged solution. In Figure 4, the number of iterations required for convergence with different values of relaxation factors for momentum and pressure equation is shown for the problem investigated in this study (i.e., the backward-facing step example). It can be noticed that the number of iterations is less for α u = 0.9 (momentum equation relaxation factor), for all values of α p (pressure equation relaxation factor). However, for α u = 0.7, the solution converges only for α p = 0.9, and for other values of α p the solution does not converge within 3000 iterations. Even though for most of the CFD simulations, a good value of relaxation factors can be found through trial and error, the process can become intractable for large parameter space. The complexity increases even further when the relaxation factor needs to be updated on the fly during run-time for multiphysics problems. Therefore, we attempt to automate the procedure of dynamic update of the relaxation factors using deep RL to accelerate the convergence of CFD simulations.
We illustrate our RL framework to accelerate the convergence of CFD simulations by considering a two-dimensional backward-facing step example. First, we validate our CFD model using the experimental data provided in [77]. The height of the step in the experimental setup was H = 0.0127 m and the free-stream reference velocity was U ref = 44.2 m/s. The corresponding Reynolds number based on step height and reference free-stream velocity is approximately Re H = 36, 000. The turbulent intensity at the inlet is I t = 6.1 × 10 −4 and the eddy-viscosity-to-molecular-viscosity ratio at the inlet is µ t /µ = 0.009. We use the computational mesh distributed by NASA [78] for our study. We utilize the steady-state simpleFoam solver in OpenFoam [79] and apply the k − ω SST turbulent model [80] with the standard values of closure coefficients to simulate the turbulent flow. Figure 5 displays the contour plot of the magnitude of the velocity along with the normalized stream-wise velocity at four different axial locations obtained experimentally and numerically. We can observe that there is a very good agreement between the experimental results and numerical results for normalized stream-wise velocity.
To learn the policy that is generalizable for different inlet boundary condition, we train the RL agent for multiple values of inlet velocities. For each episode, the inlet velocity is selected randomly from the interval between [35,65] m/s.   [77].
We highlight here that, in our study, the frequency at which the relaxation factors are updated is different from the frequency of CFD iterations. In particular, we update the relaxation factor for the momentum and pressure equation after every 100 iterations of the CFD simulation. We use the average value of the square of the velocity magnitude as the characteristic quantity that represents the CFD solution. Thus, the state of the environment can be written as where U b is the velocity at the inlet boundary, U m is the nodal value of the velocity at internal mesh, N b is the total number of nodes at inlet boundary, and N m is the total number of nodes in the internal mesh. At the start of an episode, the velocity in the interior of the domain is initialized with zero velocity, and hence the initial state for each episode will depend only on the value of the velocity at the inlet boundary. Figure 4 displays how the state of the environment progress towards the converged state for two different values of relaxation factor for momentum equations. As the solution starts converging, the change in the state of the system from one iteration to other will also be reduced. Based on the observation of the state of the system, the agent decides the relaxation factor for the momentum and pressure equations. The action of the agent can be written as where α u and α p are the relaxation factors for momentum and pressure equation, respectively. The action space for both relaxation factors lies within the interval [0.2, 0.95]. Even though the action space is composed of only two relaxation factors, this problem is complicated due to its dynamic nature and simple exploratory computation will not be feasible. For more complex multiphysics problems, the action space can be easily augmented in this framework to include relaxation factors for other governing equations. The goal of the RL agent is to minimize the number of iterations required for obtaining the converged solution. We give the total number of iterations of the CFD simulation between two time steps as the reward for that time step. Accordingly, the reward at each time step can be written as where K is the iteration count of CFD simulation. Since each episode comes to end with the terminal state, the task of accelerating CFD simulations can be considered as an episodic task. Therefore, we set the discount factor γ = 1.0 for this test case. The stepwise reward function computed using Equation 22 and with γ = 1.0 will give cumulative reward equal to the total number of iterations it took for CFD simulation to converge. Specifically, if the CFD simulation required 450 iterations to converge and the relaxation factor is updated every 100 iterations, then the trajectory for the reward will look like {−100, −100, −100, −100, −50}.
For this example, the RL agent is a fully connected neural network with 64 neurons in each hidden layer. The RL agent is trained for 2000 episodes and the learning rate is set to 2.5 × 10 −4 . We assume that the convergence for CFD simulation is reached once the residual for momentum equation falls below 1 × 10 −3 and the residual for pressure equation is reduced below 5 × 10 −2 . Figure 6 shows how the moving average mean reward progress with training for a different number of workers. The moving average window size is set at 100 episodes, i.e., the average of the reward is taken over the last 100 episodes. At the beginning of the training, the agent is dominantly exploring and the mean reward at the start of training is in the range of 650-700 iterations for CFD simulation to converge. The mean reward improves over the training and the final mean reward is around 450 iterations for the convergence of CFD simulations. This corresponds to approximately 30% improvement in the number of iterations required for convergence. We highlight here that, for many CFD problems, suitable values of relaxation factors can be chosen to achieve faster convergence based on prior experience. However, for complex systems, the process of learning a rule to dynamically update relaxation factors through exploratory computations can become unmanageable. For such systems, the proposed RL framework can be attractive to achieve acceleration in the convergence. In Figure 6, the CPU time for training is reported for a different number of workers. As we increase the number of workers/ environments from 4 to 8, we get around 40% improvement in the CPU time for the training. The improvement in CPU speed is only marginal as the number of processor is increased to 16, which might be due to increased communication between the processors. We note here that our environment, i.e. the CFD simulation utilizes a single core as the backward-facing step is a relatively small problem and can be simulated within an order of seconds or minutes depending upon the computer architecture. For high-dimensional problems, the parallelization through domain-decomposition can also be exploited to reduce the CPU time of each CFD simulation run. i.e., [35.0, 65.0]. We run ten different samples of CFD simulation for these three inlet velocities, and utilize the RL agent to select the relaxation factor after every 100 iterations. Figure 7 shows the box plot to depict the data obtained from ten samples through their quartiles. It should be noted that the policy learned by an agent is not deterministic, and hence the number iterations required for convergence for a single inlet velocity is not constant for all samples. The mean and standard deviation of these samples are also shown in Figure 7. From both these plots, we can see that the mean of the samples for all three different velocities is around 450 iterations. Therefore, we can conclude that the policy learned by the RL is generalizable to inlet boundary condition different than the training boundary condition. We plot the variation of relaxation factor for momentum equation and pressure equation in Figure 8 to understand the policy learned by an RL agent. The RL agent has learned to keep the relaxation factor for momentum equation at its highest value (i.e., 0.95) throughout the CFD simulation. We do not see any specific pattern for the variation of pressure equation relaxation factor. Since, the relaxation factor for momentum equations is constant, the difference in number of iterations to convergence for different samples is mainly due to how the pressure relaxation factor is decided by an RL agent at various stages of the CFD simulation.
Furthermore, we test the policy for different backward-facing step geometry. The test geometry has the step height twice that of the train geometry. In Figure 9, the flow feature for train and test geometry is shown, and it can be seen that there is a sufficient difference in terms of the flow in the recirculation region. We run a similar experiment with the test geometry for three different values of inlet velocity. Figure 10 displays the summary of the RL agent's performance in accelerating CFD simulation of the test geometry. Overall, the number of iterations required for the convergence for the test geometry is higher than the train geometry. The mean of the samples for test geometry is around 550 iterations for all three velocities. Interestingly, the variance of test geometry samples employed with the policy trained using 8 workers is less compared to the policy trained using 4 and 16 workers.

Concluding Remarks
In this study, we have investigated the application of distributed reinforcement learning (RL) to automatically update the computational hyperparameters of numerical simulations, which is a common task in many scientific applications. We illustrate our framework for two problems. The first problem is tasked with sustaining chaos in chaotic systems, and the second problem is the acceleration of convergence for steady-state CFD solver. In the first example, we demonstrate that the policy learned by an RL agent trained using proximal policy optimization (PPO) algorithm can automatically adjust the system's parameters to maintain chaos in transiently chaotic systems. The problem of controlling relaxation factors in steady-state CFD solvers is different from the standard control problem. In a standard control problem, the target value is known in advance and this target value is used to change action variables in such a way that the state of the system is guided towards the target state. However, the problem of accelerating convergence of steady-state CFD solvers is different in a way that the target point, i.e., the converged solution is not known. Therefore, RL is a suitable algorithm for this problem to discover the decision-making strategy that will dynamically update the relaxation factor with an objective to minimize the number of iterations required for convergence. Our results of numerical experiments indicate that the learned policy leads to acceleration in the convergence of steady-state CFD solver and the learned policy is generalizable to unseen boundary conditions and different geometry setup.
Despite the relative simplicity in terms of the RL problem formulation and its application to backward-facing step example, this framework can be readily extended for more complex multiphysics systems. The state and the action space of an RL agent can be augmented to tackle complexities in these systems. Our future work is centered around applying this framework for more challenging problems like a differentially heated cavity, and to improve the computational efficiency of the software pipeline through direct I/O communication between OpenFoam solver and RL libraries. One of the major challenges in using RL for scientific simulations is the computational time due to a slow environment, and we believe this challenge can be addressed by utilizing distributed environment evaluation along with high-performance computing to reduce the CPU time of the solver.   . The top plot is the box plot in which the green marker is for the mean, red marker is for outliers, and the orange line is for the median of the data. The bottom plot shows the mean of the sample with error bars denoting the standard deviation of samples.