Deep Reinforcement Learning for Computational Fluid Dynamics on HPC Systems

Reinforcement learning (RL) is highly suitable for devising control strategies in the context of dynamical systems. A prominent instance of such a dynamical system is the system of equations governing fluid dynamics. Recent research results indicate that RL-augmented computational fluid dynamics (CFD) solvers can exceed the current state of the art, for example in the field of turbulence modeling. However, while in supervised learning, the training data can be generated a priori in an offline manner, RL requires constant run-time interaction and data exchange with the CFD solver during training. In order to leverage the potential of RL-enhanced CFD, the interaction between the CFD solver and the RL algorithm thus have to be implemented efficiently on high-performance computing (HPC) hardware. To this end, we present Relexi as a scalable RL framework that bridges the gap between machine learning workflows and modern CFD solvers on HPC systems providing both components with its specialized hardware. Relexi is built with modularity in mind and allows easy integration of various HPC solvers by means of the in-memory data transfer provided by the SmartSim library. Here, we demonstrate that the Relexi framework can scale up to hundreds of parallel environment on thousands of cores. This allows to leverage modern HPC resources to either enable larger problems or faster turnaround times. Finally, we demonstrate the potential of an RL-augmented CFD solver by finding a control strategy for optimal eddy viscosity selection in large eddy simulations.


Introduction
In recent years, there have been increasing efforts to transfer advances in machine learning (ML) to the field of computational fluid dynamics (CFD) in order to enhance simulations for a myriad of different applications [1], which cover the fields of turbulence modeling [2,3,4,5], shock detection [6,7], the formulation of turbulent inflow conditions [8,9] or applications in flow control [10]. As of today, most of these advances are based on the supervised learning (SL) paradigm. In SL, the ML model is trained based on a dataset that is obtained a priori from CFD simulations or experiments. During training, the parameters of the ML model are optimized to approximate the functional relationship between the input and output quantities of the dataset. Since the training dataset is fixed and does not interact with the predictions of the ML model, the training itself is thus, in a sense, static and offline. This oftentimes leads to inconsistencies, if, during inference, the SL-trained model is confronted with a varying and dynamic environment. In this case, the model's prior predictions affect how the system evolves and thus which input states the model will see in the future. Ensuring that all potential states of a non-linear dynamical system are sufficiently represented within the training dataset is by no means trivial and for many applications elusive. The reinforcement learning (RL) paradigm addresses this issue by training ML models not on a static dataset, but instead trains models by letting the model interact with the actual environment it will later be deployed in. Thus, the training process not only aims at predicting outputs, but it does so by taking dynamically generated inputs into account. The goal of RL is to find an optimal strategy for moving forward from the current, observed state. In the course of the training process, predictions of the model change the state of the environment and the model will be rewarded based on how beneficial this transition is. Training based on such sequences of actions and transitions allows incorporation of the long-term implications of the model's predictions into the training process. Thus, the training itself becomes dynamic and requires "online" joint runs of the environment and the model. CFD simulations are typically computationally expensive and thus rely heavily on high-performance computing (HPC) systems. Coupling such HPC flow solvers with the more recently emerged ML libraries is tough, since both rely on different hardware, algorithms and overall programming paradigms.
For instance, most HPC codes are still written and optimized for CPU architectures and are parallelized with the Message Passing Interface (MPI), while ML libraries run most efficiently on GPUs. The HPC environment thus has to provide sufficient hardware resources for both components of the application and efficient communication between them. In addition, most HPC codes are written in compiled languages like C/C++ or Fortran. On the other hand, ML libraries are oftentimes also written in compiled languages but are exposed to the user via a Python interface. As these languages typically lack a well-defined interface, bridging the language gap in an efficient manner for HPC is tough. These problems are less pronounced for SL tasks, since the generation of the training dataset with HPC codes and the training itself are mostly decoupled and can be performed separately on appropriate hardware. The HPC solver and the ML model only have to be coupled for inference, i.e. when the trained ML model is evaluated in actual simulations. For this, Maulik et al. [11] proposed to link the TensorFlow library directly to the flow solver and execute the model via Ten-sorFlow's C-API. Ott et al. [12] proposed the Fortran-Keras-Bridge, which allows convenient execution of trained Keras models from Fortran code, and is based on the Neural-Fortran library by Curcic [13]. For RL however, the integration of ML and HPC workloads is much more involved, since the training process itself requires running simulations and thus needs to run HPC simulations and the optimization of the ML model in parallel. An efficient RL framework for HPC workloads needs to manage the simulations in the HPC environment and has to implement efficient communication between the simulation codes and the employed ML library. In [14], Novati et al. used the smarties library to couple a flow solver and Tensor-Flow in order to train a data-driven turbulence model for large eddy simulation (LES) on an HPC system. Bae and Koumoutsakos applied a similar framework to the wall-modeling of wallbounded flows in [15]. Pawar and Maulik [16] developed a distributed RL framework called PAR-RL, which they applied to optimize the time-stepping of a CFD simulation. Rabault and colleagues developed an RL framework for flow control [17,18,19], which was also coupled with a spectral flow solver by Li and Zhang in [20]. Similarly, Fan et al. [21] coupled a spectral element LES solver with TensorFlow to control the flow around a cylinder.
However, most of those applications are limited in the problem sizes that can be investigated, since the simulations cannot take advantage of the parallel computing resources provided by modern HPC systems. In this paper, we present a scalable RL framework that overcomes the gap between numerical simulation and ML workflows on HPC systems providing both components with its required specialized hardware. Moreover, we demonstrate the prospects of the reinforcement learning paradigm in scientific computing by applying our framework to derive data-driven turbulence models for large eddy simulation.
The paper is structured as follows. Section 2 gives a brief outline of the reinforcement learning paradigm. In Section 3, we present the different software components of our framework and how they are integrated. The hardware environment for our benchmarks is presented in Section 4. In Section 5, we outline how our framework can be applied to derive turbulence models for large eddy simulations. The results of our scaling studies and the performance of the derived data-driven turbulence models are presented in Section 6. Section 7 summarizes and concludes the paper.

Reinforcement Learning -A Brief Outline
The following section gives a brief outline of the general reinforcement learning (RL) paradigm. However, this summary is by no means exhaustive and provides only the bare fundamentals required to motivate our software implementation, which is introduced in Section 3. For a more thorough discussion of RL, the reader is referred to [22]. In the RL paradigm, an agent trains by interacting with an environment, as illustrated in Figure 1. In each point in time t, the environment is in some state s t , based on which the agent's (possibly parametrized) policy π θ (a | s t ) prescribes which action a t the agent should perform. 2 This action causes the environment to change its state to a new state s t+1 , which is determined by the environment's transition function T (s t+1 | a t , s t ). The transition function thus encodes the dynamics of the environment, which could be for instance the spatial and temporal integration of the discretized Navier-Stokes equations. Alongside the new state s t+1 , the agent receives a reward r t+1 = R(s t+1 ), which quantifies how favorable the transition is with respect to some performance metric. The reward function R(s) is highly problem-specific and has to be designed by a domain expert. Based on the new state s t+1 , the agent performs another action, until the environment reaches a final state s n . Eventually, such an episode results in a trajectory τ of states, actions and rewards: τ = {(s 0 , a 0 ) , (s 1 , a 1 , r 1 ) , ...... , (s n , a n , r n )} . ( This problem formulation is typically framed as a Markov decision process (MDP), which again can be interpreted as a discretetime control task. The behavior of the agent is described by its policy π θ , which determines the action the agent performs for each of the possible states of the environment. In deep RL, this policy is represented by an ANN with the weights θ. The key quantity to distinguish between favorable and unfavorable actions is the expected future return along a trajectory τ with n steps Here, γ ≤ 1 is the discount factor, which balances the importance of short-term and long-term rewards. The overall goal of the RL algorithm is then to find the optimal policy, i.e. the set Figure 1: General outline of the Markov decision process (MDP). At step t, the environment is in state s t . Following its policy π θ (a | s t ), the agent performs an action a t . In deep RL, the policy is a deep artificial neural network (ANN) with parameters θ. The action causes the environment to transition from state s t to a new state s t+1 that is prescribed by the environment's transition function T (s|a t , s t ). Based on how desirable the new state is, the agent receives a reward, which is determined by the reward function r t+1 = R(s t+1 ).
of optimal model parameters, which maximizes the expected return on all possible initial states.
The key purpose of an RL algorithm is to state an optimization task that allows optimization of the policy based on sampled interactions of the agent with the environment. RL algorithms thus differ primarily in the way those interactions are sampled and how these interactions are used to update the policy. Throughout this work, we employ the clipping version of proximal policy optmization (PPO) [23] as our RL algorithm of choice. We highly recommend [24] for a clear and concise summary of the PPO method.

Software Architecture
In this work, we present a novel and modular RL framework named Relexi 3 that implements an RL training loop for applications in the field of scientific computing, as illustrated in Figure 2. As discussed beforehand, coupling HPC codes with modern ML libraries is oftentimes tedious. This holds especially for RL, where the interaction between the HPC application and the RL algorithm becomes much more intricate. A considerable amount of these difficulties stem from the differences in the required hardware and the different preferences in terms of programming languages. Many approaches in coupling HPC and ML tried to either rewrite basic ML capabilities in languages used for HPC or writing an HPC solver in Python from scratch. Considering the enormous codebases on both sides and the enormous pace of progress in the ML community, this approach is, from our perspective, doomed to fail. In addition, both families of methods require significant computational effort: While the parameter update of the policy requires access to the collected trajectories to compute the gradient vectors through backpropagation, running CFD simulations mandates high parallel efficiency, in particular for problems of practical interest. Thus, compromising the performance of any one component, e.g. through re-writing, is not advisable. Relexi tries to bridge this gap differently by employing the SmartSim library to couple a state-of-the-art RL library with modern HPC solvers. Special focus is put on the modularity of the framework, which allows the exchange of RL algorithms and simulation environments with only minimal changes of the underlying code. Relexi is designed with application on HPC system in mind, scaling applications up to thousands of cores on modern HPC systems. Relexi comprises the following building blocks, which will be discussed in more detail in the following sections.
• SmartSim: The SmartSim library [25] fulfills two major tasks. Firstly, the library is used to start and manage the MPI-parallelized HPC workloads on the allocated hardware resources. Moreover, it implements the communication between the HPC workload and Relexi by deploying an in-memory database and providing communication clients for several programming languages.
• FLEXI: In this work, the flow solver FLEXI [26] provides simulations of turbulent flow as a training environment for the agent. FLEXI is used here for illustrative purposes only and can be easily replaced by other simulation codes.
• TensorFlow/TF-Agents: Relexi is build upon Tensor-Flow [27], which provides a framework for efficient ML workflows and allows for distributed training on multiple GPUs in parallel. In addition, its RL extension TF-Agents [28] provides implementations of several RL algorithms and allows for custom environments that can be integrated in a straight-forward fashion.

SmartSim
The SmartSim library [25] is a workflow library that simplifies the convergence of traditional HPC workloads and ML. The library comprises two components: the SmartSim Infrastructure Library (IL) and SmartRedis. The IL provides extensive functionalities to start, manage and distribute workloads in HPC environments as well as submitting workloads automatically as batch jobs to the job scheduler. In Relexi, we employ the IL to repeatedly start the MPI-parallelized simulation instances on demand for each training iteration inside of a single allocated batch job. The simulations can be either started individually or multiple simulations can be lauched within a single call by using the multiple-program-multiple-data (MPMD) paradigm. We also use another key aspect of the SmartSim IL, which is its ability to configure and launch an in-memory, Redis-based datastore referred to as the Orchestrator. This database serves as a data intermediary between the simulations and the main program. For Relexi, we used a non-clustered Orchestrator, which is launched on the head node. SmartSim also supports the distribution of the Orchestrator across multiple nodes, but a single instance on the head node was sufficient for our application. The simulation communicates with the Orchestrator using the SmartRedis clients (available in Python, Fortran, C, and C++) which can send and receive data from the orchestrator or trigger actions within the Orchestrator, i.e. running scripts or ML model inference previously loaded into the database. In Figure 2: General architecture of Relexi. Before entering the training loop, the SmartSim IL is used to launch the SmartSim Orchestrator, which provides a database to exchange data between Relexi and the HPC workload. At the beginning of each training loop, the SmartSim IL starts a batch of FLEXI simulations and distributes them to the worker nodes. The simulations are initialized on a randomly drawn initial state and are then evolved in time by the flow solver. For our application in turbulence modeling, FLEXI sends each time interval ∆t RL its current state to the agent and receives a new action, i.e. a new set of parameters for the turbulence model. Since a syncronous RL algorithm is applied here, Relexi waits until all simulations have terminated. The collected episodes τ (i) are then used by the gradient ascent algorithm to improve the model weights θ. Here, ∇ θ J(τ (i) ) is the gradient of the loss function, evaluated on the sampled experience. The training loop is then repeated with the new set of weights, i.e. the new policy, until the policy converges. our case, during training, FLEXI sends its current flow state via the Fortran SmartRedis client to the Orchestrator. In addition, a scalar flag is sent which indicates whether FLEXI has reached its final state and will terminate. Relexi then uses the Python SmartRedis client to read this data from the Orchestrator. Subsequently, the agent's actions are sent by Relexi to the database via its Python SmartRedis Client and are read by FLEXI via its Fortran client. While SmartSim itself employs a Redis database by default, we used the multi-threaded fork of Redis called KeyDB, which provided significantly more performance for our application.

FLEXI
The open-source flow solver FLEXI [26] is based on the discontinuous Galerkin (DG) method, which can be seen as a hybrid of the finite element and the finite volume methods. For the DG method, the computational domain is partitioned into individual elements. In each element, the solution is represented by a polynomial basis of degree N and the individual elements are then coupled by consistent surface fluxes at the element faces. This results in a small communication stencil, which allows FLEXI to scale perfectly on hundreds of thousands of cores using a pure distributed-memory parallelization with MPI. A detailed description of FLEXI can be found in [26]. The communication between FLEXI and the database is implemented by means of the SmartRedis clients provided by the SmartSim library. To this end, SmartRedis is linked to FLEXI during compile time. Implementing the data transfer in FLEXI is straightforward and requires only a few lines of additional code. Since FLEXI is parallelized with a distributed-memory approach, every rank contains only a chunk of the global flow state, which represents the environment's state s t in the RL formulation. To communicate with the database, the flow state is thus first gathered across all ranks of the respective FLEXI instance before it is written to the database by the root rank. Analogously, the predictions sent by Relexi are retrieved from the database by the root rank and are then scattered across the other ranks of the respective FLEXI instance. It seems important to stress once again that FLEXI is used primarily for demonstration purposes and can be easily replaced by other simulation codes by using the SmartRedis Clients.

Relexi
Relexi implements the main RL training loop by means of the TF-Agents [28] library, as illustrated in Figure 2. In TF-Agents, custom environments can be implemented by subclassing the provided environment class and implementing a few mandatory methods like initialization and time-stepping. Since the TF-Agents library interacts with the environment via these pre-defined function interfaces, all intricacies of the coupling with the HPC workload are "hidden" inside of the environment class. This has the advantage that all RL algorithms and tools provided by TF-Agents can interact natively with the custom environment and thus, with the HPC workload. The hardware resources are distributed as follows. Relexi itself is started as a single thread on a node termed head. The head node executes all ML-specific work and thus should be equipped with a GPU. Relexi also supports setups with multiple GPUs on the head node by using the distribution strategies provided by Tensor-Flow. The SmartSim Orchestrator is also started on the head node at the beginning of Relexi. During training, the HPC workload is repeatedly started with MPI on the available worker nodes. To ensure that each MPI rank is placed correctly on the available hardware and to avoid double occupancy, Relexi generates rankfiles on-the-fly based on the available hardware resources. These rankfiles are then passed to MPI to ensure the correct placement of the MPI ranks.
After the simulations are launched, each simulation instance writes its initial states to the database. Relexi reads these states, provides the respective actions based on the current policy and writes them back to the database. Each FLEXI instance reads its prescribed actions from the database and proceeds with its simulation to obtain the next state. In the meantime, Relexi polls until the new state becomes available and reads it from the database. With the new state, the agent can compute the reward and get the new actions from the current policy. These steps are repeated until the required amount of experience is sampled, on which the policy then can be trained. This algorithmic outline of Relexi is also summarized in Algorithm 1. 4 for t = 1, n do Run simulation for n steps 7: a t ← π θ (a | s t ) 8: Write a t 9: Polling for s t+1 Here, the HPC solver runs for j = 1, n epochs do Train ANN for n epochs 15: θ ← θ + α ∇ θ J(τ) 16: end for 17: end for 18: Shutdown SmartSim Orchestrator A potential bottleneck we identified is the overhead introduced by repeatedly starting hundreds of parallel environments with thousands of MPI ranks. For some configurations, the time required for starting the simulations exceeded the actual simulation time. To tackle this issue, we implemented two major improvements. First, we employed the multiple-programmultiple-data (MPMD) functionality provided by OpenMPI's implementation of MPI, which is also supported by SmartSim. With MPMD, all simulations can be started with individual commandline arguments within a single call of MPI. Secondly, we implemented a functionality to copy all files required by the simulation, e.g. parameter files and restart files, to local drives located in the random access memory (RAM) of each node. This reduced the access times compared to using a parallel file system like Lustre significantly. With these improvments in place, the performance penality of launching large amounts of environments became negligible.

Hardware Configuration
All benchmarks and experiments are performed on the HPE Apollo 9000 supercomputer (Hawk) and the Hawk-AI extension, a HPE Apollo 6500 Gen10 Plus at the High-Performance Computing Center Stuttgart (HLRS). In the following, the hardware of both systems is given in detail.

Hawk -HPE Apollo 9000
Hawk consists of 5,632 dual socket nodes with 256 GiB of main memory each. Each node is equipped with two 64core AMD EPYC 7742 (Rome) processors with a base frequency of 2.25 GHz. The nodes are connected via an enhanced 9D-hypercube. For the node to node interconnect, the highperformance interconnect InfiniBand HDR200 is used. This leads to a homogeneous massively parallel system with 720,896 compute cores and approximately 1.37 TiB of main memory. Hawk has a theoretical peak-performance of 25

Hawk-AI -HPE Apollo 6500 Gen10 Plus
The Hawk-AI extension consists of 24 dual socket nodes, which are each equipped with two 64-core AMD EPYC 7702 processors, 1 TiB of main memory, a local hard disk of 15 TiB and eight Nvidia A100 GPUs. 20 Nodes are equipped with Nvidia A100 with 40 GiB of memory and four nodes are equipped with Nvidia A100 with 80 GiB of memory. The nodes are also connected via the InfiniBand HDR200 interconnect. The Hawk-AI extension is connected directly to the enhanced 9D-hypercube of Hawk and to the same Lustre filesystem.

Turbulence Modeling
Turbulent flows are notoriously hard to resolve accurately, since turbulence is a multiscale phenomenon. The resolution requirements for numerical simulations to resolve the wide range of active length scales are usually intractable. Instead, reduced order descriptions of turbulence can be derived, which only resolve the large energy-containing flow scales, while employing a turbulence model to account for the influence of the nonresolved fine scales. From a mathematical standpoint, this is equivalent to applying a low-pass filter to the flow field. This approach is commonly referred to as large eddy simulation (LES). A myriad of different LES turbulence models have been proposed in literature. However, such models typically contain empirical parameters, which have to be tuned to the employed numerical discretization and the specific test case. As a consequence, no universal or overall best model has been found to this date.
The most important property of the LES model is to mimic the energy drain from the large to the small flow scales. Based on this reasoning, a popular modeling strategy is to approximate the unresolved scales by introducing an additional turbulent viscosity ν t that is added to the physical viscosity. The common model by Smagorinsky [29] computes this turbulent viscosity as Here, S i j is the rate-of-strain tensor of the coarse-scale velocity field u i with respect to the coordinates x j with i, j = 1, 2, 3. The filter width ∆ is a measure of the employed spatial resolution and C s is a model coefficient, which has to be tuned for each specific test case. Another common modeling approach is the implicit modeling paradigm, which assumes that the numerical dissipation error introduced by the numerical scheme acts as an implicit LES model. The implicit LES model can obviously be seen as a special case of Smagorinsky's model with C s = 0. With our RL framework, we strive to improve the performance of Smagorinsky's model given in Eq. (3) by employing an RL algorithm to tune the model coefficient C s dynamically in space and time during the simulation. In this regard, we show that RL represents a promising approach to enhance current simulation workloads and that our novel RL framework is capable of handling computationally intensive simulation environments on modern HPC architectures.

Stating the Reinforcement Learning Task
As test case, i.e. as training environment, we perform LES of homogeneous isotropic turbulence (HIT) at a Reynolds number of Re λ ≈ 200 with respect to the Taylor microscale. The cubic domain of side length 2π is equipped with periodic boundary conditions and is discretized by a Cartesian mesh with the resolutions given in Table 1. This test case of turbulence in a box can be seen as the building block of turbulence and describes freely decaying turbulence in the absence of boundaries. In order to obtain a quasi-static solution, an isotropic linear forcing is employed as proposed by [30,31] to balance the dissipation of the turbulence model. This results in a quasistationary distribution of the turbulent kinetic energy in the system, which is mainly characterized by the energy drain from the large to the small scales. Since this energy cascade is a fundamental property of turbulence that the LES should reproduce, we define our optimization target for the RL task in terms of this energy spectrum. The reward for the RL algorithm is thus computed based on the error of the instantaneous energy spectrum of the LES E LES (k) compared to the mean energy distribution of the underlying ground truth solution E DNS (k), which was obtained beforehand from a high-fidelity simulation. For this, we use the mean relative error across the wavenumbers k up to k max , which can be computed as To ensure that the reward is normalized to r t ∈ [−1, 1], the reward is eventually computed as with α as a scaling parameter.
With the reward function in place, the optimization task for the RL algorithm is framed as follows. The state of the environment observed by the agent is the current coarse-scale velocity field u i . The agent predicts as action a single C s coefficient for each element in the computational mesh solely based on the local flow state in the respective element. The environment's state is then evolved with the flow solver for some time ∆t RL before requesting new predictions for the elementwise C s . The time inteval ∆t RL is generally chosen much larger than the computational timestep of the simulation. The reward for the agent is computed based on the differences in energy distribution compared to the ground truth, as given in Eq. (5). This loop is repeated until some final time t end is reached.

Computational Setup
In this work, we perform LES at different resolutions, which are listed in Table 1. All simulations are run up to t end = 5 and actions are performed with in a time interval of ∆t RL = 0.1, which corresponds to 50 predictions per simulation. The initial state for each simulation run is drawn randomly from a set of flow states that are computed by filtering different realizations of the high-fidelity solution. A single initial state is kept hidden to evaluate the model performance on unseen test data. Based on these setups, two different approaches can be followed to reduce the required computational time for the training by using parallel HPC resources. First, the number of MPI ranks per environment can be increased to reduce the simulation time of the environments by exploiting the strong scaling capabilities of FLEXI. Second, the number of simulated environments per training update can be increased, which can be seen as a weak scaling approach for the full framework. Since an increased number of sampled episodes might result in a better estimate of the gradient for the optimizer, this approach can reduce the number of iterations required for convergence and thus the necessary training time. Since both perspectives highlight crucial scaling properties of the underlying framework, both will be investigated in Section 6.1. Here, we employ up to 16 cores per FLEXI environment and up to 1024 parallel environments per training iteration using a maximum of 2,048 cores.
As RL algorithm, we use the clipping variant of the proximal policy optimization (PPO) [23] Table 2: Architecture of the policy ANN for N = 5 with the dimensions of each layer's output. The model's input dimensions follow from the (N + 1) 3 solution points in each element times the three velocity components u i , which are used as input features. The ANN is build from three-dimensional convolutional layers with a specific kernel size and number of filters, each. The first layer uses zero-padding, while the rear layers employ no padding to convolve the highdimensional input to a single scalar. All layers except the last convolutional layer employ the rectified linear unit (ReLU) as activation function. The final scaling layer uses the operation y = 1 2 σ s (x) to scale the input to the interval [0, 0.5] with σ s (x) denoting the sigmoid activation function. means that the algorithm performs the steps of sampling experience and updating the policy in a sequential fashion. This means that first, FLEXI simulations are performed based on the current policy and thereafter, the policy is updated based on the collected experience to maximize the reward on future runs. For all experiments we used a discount factor of γ = 0.995 and a learning rate of 10 −4 with the Adam optimizer [32] to train the policy for 5 epochs per iteration. For the PPO algorithm, we used a clipping parameter of 0.2 and set the entropy coefficient to zero. The employed policy ANN comprises around 3,300 parameters with its architecture given in Table 2.

Scaling
To analyze the scaling behaviour of our framework, we use the HIT test case for the two different configurations listed in Table 1. For the benchmarks, we used a single Hawk-AI node of the HPE Apollo 6500 Gen10 Plus as head node and up to 16 Hawk compute nodes of the HPE Apollo 9000 as worker nodes in a single batch job. This results in up to 2048 compute cores used for generating training data. In the scaling analysis, we investigate two types of scaling. First, we examine how the framework scales if the number of parallel environments is increased, while keeping the amount of MPI ranks per environment constant. This corresponds to a weak scaling approach, since the parallel load and the available resources are increased simultaneously, while keeping the load per rank constant. Since the used resources, and thus also the simulation time, for a FLEXI simulation should theoretically stay unchanged, eventually emerging losses in parallel speedup can be attributed solely to the communication overhead, the limited throughput of the database, the cost of the data managment in Relexi, the increasing amount of policy evaluations and the start as well as the termination of the simulations. The measured executation time includes launching the simulations and running the simulation with the policy until all simulations terminated. As a performance metric, we use the "Speedup", which is computed as the quotient of the time needed for sampling the n envs parallel environments and the time that would be needed to run n envs environments sequentially, i.e. the speedup of running the environments in parallel instead of sequentially. For each configuration, we ran Relexi in two separate jobs for 6 iterations each to account for fluctuations in hardware and communication performance and computed the mean of the 12 measurements. The scaling is performed with either 2, 4, 8 or 16 MPI ranks for each FLEXI run. Further, we start with two parallel FLEXI instances and double them until the 16 Hawk compute nodes are fully occupied. It seems important to note that the results are not compared directly to the performance of standalone FLEXI simulations, since the performance differences in running a single FLEXI instance within Relexi and running it standalone were negligible.
The results in Figure 3 demonstrate that the framework can scale efficiently up to a thousand parallel environments on thousands of cores. Two major trends can be identified in the scaling results. As mentioned before, the observed decrease in parallel efficiency when running more environments in parallel can be attributed mainly to the sequential work done by Relexi. The framework should scale better if the FLEXI instance gets more time-consuming, since then, this sequential work becomes less relevant and the perfect scaling abilities of FLEXI can be recovered. In contrast, if the necessary time to compute the FLEXI simulation decreases, i.e. FLEXI gets more ranks, the sequential work of Relexi becomes more dominant, which decreases the scaling efficiency. This causes the runs with fewer ranks per FLEXI instance to scale better than the runs using more ranks. The second interesting behavior is that the decrease in performance when switching from a single to two parallel environments is most pronounced for the FLEXI environment with only two cores, which is counterintuitive. We attribute this to the limited memory bandwith and the hierarchical architecture of the processors used. 5 If a single FLEXI instance with two ranks is spawned on a compute node, this instance gets all available memory bandwidth. If however, a second FLEXI is spawned, both instances have to share the available bandwidth, which slows down the simulation as well as the interaction with the database. This effect vanishes with an increasing amount of used cores. The observed loss of parallel efficiency especially for the last data points, which correspond to using all 2,048 available cores, can in parts be attributed to single simulations running significant slower than the average. These outliers can probably be attributed to fluctuations in the load of the interconnect. This issue is subject of current investigations.
For the strong scaling, we examine the configurations with 2, 8, 32 and 128 parallel FLEXI instances. The results of the strong scaling given in Figure 4 match the general behavior observed in the weak scaling benchmark. In cases where the computational cost of the HPC workload is dominant, i.e. if only few MPI ranks are used per environment, the optimal scaling of FLEXI can be recovered. If the amount of MPI ranks is increased, the time spent for the simulation decreases and the work done by Relexi becomes the limiting factor. This causes the parallel efficiency to decrease for low simulation loads per core. It seems important to note that for both cases, using 16 MPI runs per simulation falls quite below the optimal load per core for FLEXI. For the more realistic cases of up to 8 ranks per FLEXI, most of the FLEXI performance can be recovered.

Turbulence Modeling
To investigate the performance of the RL-based turbulence model against established analytical closure models, we trained the agent on the 24 DOF configuration for 4,000 Iterations, which corresponds to 20,000 gradient ascent steps for the policy overall. To quantify the benefits of gathering more episodes per policy update, the training was carried out using 8 MPI ranks per environment and 16, 32 and 64 parallel FLEXI instances, respectively. For the 16 and 64 runs configurations, the overall training required 20 and 30 hours, respectively. Sampling the trajectories took 15 and 18 seconds per iteration, respectively, while updating the policy on a single GPU took 0.5 and 2 seconds, respectively. The results in Figure 5 indicate that increasing the number of parallel simulations does indeed improve the training performance in several ways. Firstly, the training runs with more episodes converge faster, i.e. need less training iterations to achieve a given episode return. Moreover, the collected return of the 16 episode configuration increases less consistently after around 2,000 iterations than the training configurations with more parallel episodes. This confirms our prior assumption that the gradient estimator becomes more reliable if more episodes are sampled, which again leads to more efficient training updates and thus to faster convergence. In the same vein, the training runs using more episodes tend to converge to a higher total return. This notion holds especially for the return collected during training, where the return is computed on a variety of different initial states. While the return on the unseen initial state for testing gives similar results, the reward of the 16 episodes configuration fluctuates between 2,000 and 3,000 iterations and temporarily exceeds the return of the 32 episodes configuration before converging to a stable return. This behavior probably stems from the high variance of the 16 episodes training which causes the less consistent improvement on the unseen test state. The results in Figure 5 also clearly demonstrate that for our application, the RL-based model outperforms both Smagorinsky's model and the implicit modeling strategy with regard to the energy spectrum. Especially surprising is that the RL-enhanced model agrees with the high-fidelity data almost up to k max with deviations only at k = 6. This indicates that our novel data-driven model does not only replicate the vital flow physics related to the energy cascade, but also pushes the resolution limits of the underlying discretization. The predictions of the initial untrained model appear to be almost normally distributed, as is expected from the initialization of the policy ANN. Interestingly, the trained model modifies this distribution heavily by predicting tiny C s parameters for the vast majority of the flow field while increasing the parameter only in selective elements. The model also takes advantage of the entire admissable range of C s ∈ [0, 0.5]. The reported results demonstrate that the application of the RL paradigm to turbulence modeling (and control tasks in CFD in general) can contribute to major advancements of the state-of-the-art given that the necessary resources can be used efficiently on massively parallel systems.

Conclusions
Supervised learning is generally suitable for learning in situations when input-output pairs can be defined a priori. In contrast, in reinforcement learning, training data is gathered in an online process, in which the policy is sampled for the current state of the dynamical system. This makes RL more suitable for learning optimal behaviors in dynamical systems, e.g. those described by the equations of fluid dynamics, where an a priori definition of an admissible and complete training data space is illusive and would, at best, be cumbersome. Flow control problems thus lend themselves naturally to an RL approach, however, other modeling tasks can be expressed in this context as well. Here, we have chosen to interpret the task of finding an optimal eddy viscosity in space and time as a control problem, and have the RL-trained agent predict a strategy. Our results show that this approach outperforms existing, established models and indeed returns a near optimal behavior in the chosen reward norm. This highlights the potential of combining CFD and RL into an inclusive optimization framework.
However, before we can explore or leverage this potential, we need to enable training and deploying such algorithms at scale, recognizing that they pose different challenges for software development and HPC. In RL, trajectory data consisting of both actions and states needs to be gathered along the solution evolution during training and made available to the gradient update. The RL agent and the CFD scheme thus need to be coupled during training, and they need to exchange large datasets continuously. In addition, the current policy must be explored efficiently, meaning that many parallel runs of the environments are required. Thus, in this work, we propose Relexi as a novel framework for coupling RL algorithms with essentially generic solvers for partial differential equations on heterogeneous hardware. The framework is made up of three components: The PDE solver of choice, in our case, the LES solver FLEXI, TensorFlow for the ML definition and training, and the SmartSim library, which handles job allocation and man-agement and provides an in-memory database for communication and intermediate storage of solution trajectories and policy commands. The PDE solver and TensorFlow can run independently on their chosen hardware, thus exploiting their full parallel potential. We show two methods for improving the overall time-to-solution of the RL problem with Relexi: First of all, the weak scaling across the gradient estimator, that is increasing the amount of runs of the environment for a given policy. More parallel runs explore the current policy more efficiently and thus allow an overall better gradient update and thus quicker convergence. Our scaling results here indicate that the framework scales up to hundreds of parallel environments and thousands of compute cores with very good performance. For the second approach, we scaled the individual simulation runs across more MPI ranks, i.e. we exploit the strong scaling characteristics of the standalone PDE solver. Here, we recover the expected behavior up until the individual core load becomes too small. In combination, these results demonstrate that the Relexi framework is capable of efficient training on HPC systems at scale and can enable RL-methods for CFD for complex flow cases. Further work will focused on applying Relexi to more complex cases and to other combinations of RL and PDE solvers. by the HLRS through the project "hpcdg" and the support by the Stuttgart Center for Simulation Science (SimTech). Moreover, the authors gratefully acknowledge the SmartSim developers for their support.