University of Birmingham Scalability of an Eulerian-Lagrangian large-eddy simulation solver with hybrid MPI/OpenMP parallelisation

Eulerian-Lagrangian approaches capable of accurately reproducing complex ﬂuid ﬂows are becoming more and more popular due to the increasing availability and capacity of High Performance Computing facilities. However, the parallelisation of the Lagrangian part of such methods is challenging when a large number of Lagrangian markers are employed. In this study, a hybrid MPI/OpenMP parallelisation strategy is presented and implemented in a ﬁnite difference based large-eddy simulation code featuring the immersed boundary method which generally employs a large number of Lagrangian markers. A master-scattering-gathering strategy is used to deal with the handling of the Lagrangian markers and OpenMP is employed to distribute their computational load across several CPU threads. A classical domain-decomposition-based MPI approach is used to carry out the Eulerian, ﬁxed-mesh ﬂuid calculations. The results demonstrate that by using an effective combination of MPI and OpenMP the code can outperform a pure MPI parallelisation approach by up to 20%. Outcomes from this paper are of interest to various Eulerian-Lagrangian applications including the immersed boundary method, discrete element method or


Introduction
The constant evolution of High Performance Computing (HPC) systems has accelerated the development of sophisticated Computational Fluid Dynamics (CFD) codes. This has facilitated expensive Direct Numerical Simulations (DNS) and Large-Eddy Simulations (LES) of turbulent flows at low to moderately high Reynolds numbers in complex geometries in the fields of aeronautics, bio-flows or hydro-environmental engineering [1][2][3] . Recent applications of DNS and LES include complex multiphase flows which often require adopting Eulerian-Eulerian frameworks, e.g. scalar transport [4] or free-surface flows [5] ; or Eulerian-Lagrangian frameworks, e.g. fluid-structure interaction [6] or air-gas interaction [7] . Most of the cited computations were performed at a reduced scale and for validation purposes demonstrating the methods' adequacy to reproduce the relevant physics. There is a fast-growing interest in applying these advanced eddy-resolving techniques to real-life problems, which means that the range of spatial and temporal scales to be resolved increases by one or several orders of magnitude and hence these computations become extremely expensive.
Particle Hydrodynamics (SPH) whose applicability is somewhat limited by the number of Lagrangian particles due to their inherent expensive computations, e.g. neighbour searching. Novel hybrid techniques are under development in order to enhance its computational efficiency, such as an MPI/CUDA scheme presented by Dominguez et al. [11] , which improved load-balancing. In a similar fashion, the most efficient way to parallelise Discrete Element Methods (DEM) depends on the homogeneity of the particles' distribution. Gopalakrishnan et al. [12] presented good performance results for a pure MPI implementation of the open-source DEM code MFIX. A similar approach was followed by Yang et al. [13] achieving good scalability results in simulations with more than 10 6 particles. Liu et al. [14] implemented multi-threading in MFIX showing that a hybrid MPI/OpenMP code could overcome load-balancing problems outperforming the MPI scheme when 5.2 · 10 6 particles were simulated. Additionally, Amritkar et al. [15] showed that for several DEM applications pure OpenMP can be notably faster than MPI especially when a reduced number of processes were used in the simulations.
Yakubov et al. [16] adopted a hybrid MPI/OpenMP strategy in an Euler-Lagrange framework to simulate the flow around an airfoil where bubbles were injected into the cavitation areas, and results proved this scheme features good performance. Shi et al. [17] studied in detail the performance differences between pure MPI and hybrid MPI/OpenMP implementations of a DNS code with application to Taylor-Couette flows. They found that a mixed scheme has many benefits in reducing inter-node communications which is key to scale to hundreds or thousands of cores. Similar results were obtained by Guo et al. [18] for the finite element model Fluidity achieving better performance using a hybrid scheme compared to pure MPI due to lower communication overheads.
Overall the development of hybrid MPI/OpenMP parallelisation strategies is of interest to many CFD research areas: in DNS and LES of single-phase flows the main aim is to reduce inter-node communications between hundreds-to-thousands of cores, or in multiphase flow applications in which load-balancing problems due to Lagrangian computations (e.g. interpolation functions reconstruction or neighbour searching algorithm) need to be overcome.
In the research reported here a refined hybrid MPI/OpenMP parallelisation strategy is implemented in the open-source LES-based code Hydro3D. The code features coarse-grained MPI parallelisation the efficiency of which is studied first for the lid-driven cavity test-case. The performance of the hybrid strategy is then evaluated by comparing MPI/OpenMP to pure MPI computations for two Eulerian-Lagrangian fluid-structure interaction test-problems for which the Lagrangian part of the solution is a known bottleneck to the overall speed of the code.
The paper is organised as follows: the governing equations for fluid flow and the immersed boundary method are described in Section 2 . Section 3 presents the hybrid parallelisation strategy. Section 4 presents the performance results comparing pure MPI and hybrid MPI/OpenMP schemes for the multiphase applications. A discussion based on the overall results towards the application of the hybrid parallelisation approach within Eulerian-Lagrangian applications is presented in Section 5 together with the main conclusions from this study.

Numerical framework
Hydro3D is an in-house open-source [19] large-eddy simulation code that has been well-validated in a number of hydroenvironmental engineering flows such as compound channels [20] , contact tanks [21][22][23] , free-surface flows [24][25][26][27] , aeronautical engineering applications such as flow around pitching airfoils [28] or geophysical flows [29,30] . Hydro3D efficiently solves the filtered Navier-Stokes equations for unsteady, incompressible, viscous flows solved in an Eulerian frame reading, where u ( x, t) represents the velocity field, p ( x, t) is the pressure, ν is the kinematic viscosity of the fluid, and f ( x, t) is a volume force from a source external to the fluid, e.g. Lagrangian particles. The sub-grid scale stress (SGS) tensor, τ , can be calculated in Hydro3D with the Smagorinsky [31] or the Wall-Adapted Local Eddy viscosity (WALE) [32] SGS models. Here the latter is adopted as it is preferred when dealing with moving boundaries as it implicitly calculates the SGS viscosity near solid boundaries [6,9] . Hydro3D is based on finite differences with staggered storage of the three velocity components on three rectangular Cartesian grids and the storage of the pressure in their respective cell centre. Second-and fourth-order Central Differences Schemes (CDS) are available to approximate convective and diffusive velocity fluxes. In this research, second-order CDS is adopted as the direct forcing method used to resolve moving boundaries is also second-order accurate [33] .
Hydro3D employs a refined direct forcing Immersed Boundary (IB) and Lagrangian Particle Tracking (LPT) methods to perform simulations of Eulerian-Lagrangian flow problems, allowing the simulation of moving bodies or particles inside a fixed Eulerian fluid domain. Fig. 1 depicts a fixed Cartesian rectangular grid as the Eulerian together with an unstructured Lagrangian domain moving at a given Lagrangian velocity. As depicted in this figure, lower-case variables (here coordinates, velocities and pressure) belong to the Eulerian framework whereas upper-case variables are used in the Lagrangian framework.

Time integration
The advancement in time of the governing equations is performed using the fractional-step method [34] . This adopts the Helmholtz decomposition to calculate the velocity field from a solenoidal and an irrotational vector fields obtained throughout several steps. The first is to predict the non-divergence free velocity, ˜ u ( x, t), from the explicit computation of convection and diffusion terms using a low-storage third-order Runge-Kutta scheme together with pressure values from the previous time step as, where l = 1,2,3 is the Runge-Kutta sub-step for which l = 1 denotes values from the previous time step t − 1 , and α l and β l stand as the Runge-Kutta coefficients with values: α l = β l = 1/3, 1/6, 1/2.
In Eulerian-Lagrangian simulations using the IB method, the external forces are represented by the forcing term f in the r.h.s of Eq. (2) , which is used to correct the predicted velocity ˜ u obtaining the updated intermediate velocity ˜ u * ( x, t) as indicated in Eq. (4) . This source term f enforces the fluid to have the solid velocity at its location fulfilling the no-slip equation as explained in Section 2.2 .
A projection scalar function ˜ p is obtained in Eq. (5) resolving the Poisson pressure equation, which in Hydro3D is accomplished using an iterative multi-grid technique. This equation is iterated until the predicted intermediate velocity field ˜ u * satisfies the incompressiblity condition.
Here, the corrected velocity field satisfies the divergence-free condition once Eq. (6) achieves a residual lower than a set tolerance ε often set to a value ≤ 10 −7 .
∇ ·˜ u * ≤ ε (6) The predicted velocity field ˜ u * is projected onto the divergencefree field following Eq. (7) to obtain the solution velocity field at the current step u t ( x, t).
Note the latter velocity field differs from that obtained right after the Lagrangian correction ˜ u * , i.e. the final flow velocity is not exactly that enforced by the solid in Eq. (4) . Cristallo et al. [35] showed that the error associated to this step can be deemed negligible. Finally, the pressure field at the current time step, p t , is calculated in Eq. (8) resulting from the value at the previous time step p t−1 and ˜ p field.

Immersed boundary method
The IB method in Hydro3D was successfully validated in a series of applications ranging from fluid-structure interaction [36] , flow around pitching airfoils [28] , multi-chamber tanks [23] , tidal turbines [6,37] and geophysical flows [38] . The numerical method used to accomplish fluid-structure interaction is a refined version presented in Kara et al. [36] based on the direct forcing IB method introduced by Fadlun et al. [39] and refined by Uhlmann [33] . In this approach, the immersed solid is comprised by a collection of N L Lagrangian forcing points conforming the targeted geometry ( Fig. 1 ), which directly enforce a no-slip boundary condition at their location through the forcing term f .
The direct forcing method follows a multi-step predictorcorrector procedure which is detailed in the following. First, the predicted Eulerian velocities ˜ u , calculated according to Eq. (3) , are transferred to the closest Lagrangian markers. This procedure is accomplished using interpolation functions which can feature a different number of neighbours depending on their stencil, i.e. support width. Hereinafter n e stands for the number of Eulerian cells used to transfer the information to each neighbouring Lagrangian marker, the total number of Eulerian cells and Lagrangian markers are N e and N L respectively, and n L is the number of Lagrangian markers used to interpolate solid quantities to the closest Eulerian cells. Therefore, the interpolated Lagrangian velocity U L at the marker L is obtained interpolating ˜ u from its n e closest Eulerian neighbours as, Here x i and X L are the coordinates of the Eulerian cell i and Lagrangian marker L , respectively, and x i = x i x j x k represents the volume of an Eulerian cell. The second step of the direct forcing method is to compute the force F L each Lagrangian marker exerts on the fluid to satisfy the no-slip condition at the marker's position [33] . This force term results as the difference between the desired (or forced) marker velocity, U * L , and the interpolated velocity from the fluid U L , calculated as, The forced velocity U * L is computed depending on the solid body movement pattern, and there are three different possible scenarios. Firstly, when the body is static the forced velocity is zero. In case the body moves as a reaction to the fluid action, a fluidstructure interaction algorithm is needed to compute the forced velocity, which results from the time rate-of-change of the marker position as indicated in Eq. (11) [36,40] . The last case is when the solid body is moving with a prescribed velocity and pattern which permits a straightforward calculation of its coordinates at any time, such as the cases of vertical axis turbines rotating at fixed velocities or pitching airfoils oscillating at a given reduced frequency [6,28] .
The third step constitutes the backwards procedure where the Lagrangian force F L is transferred to all the Eulerian cells in the fluid domain affected by Lagrangian particles obtaining the Eulerian force f as, Here the interpolation of F L from the closest n L Lagrangian markers to each fluid cell adopts the delta functions values from the forwards step, Eq. (9) . This implies performing the neighbour searching just in the forwards step which is the most computationally expensive operation in the direct forcing method when dealing with moving boundaries. Note that the forwards interpolation (from Eulerian to Lagrangian) uses the fluid cell volume x i while the backwards process adopts the volume assigned to each of the Lagrangian markers V L , as represented in Fig. 1 . The direct forcing method needs to satisfy mass and torque conservation requiring the force transferred to all fluid cells ( N e ) in the backwards procedure equals the total force exerted by the solid markers ( N L ). This condition is indicated in Eq. (13) which implies the same total force is interpolated between frameworks.
Delta functions based on discrete kernels do not directly satisfy the partition of unity condition [41] and this condition is then fulfilled whenever the volume of each solid marker approximated that of a fluid cell, i.e. V L ≈ x i .

Reconstruction of interpolation functions
Delta functions [42][43][44] or Moving Least Squares (MLS) [41,45] are commonly used to reconstruct the interpolation functions used in the direct forcing method framework. To date, the use of delta functions in IB applications is more frequent than MLS and thus adopted herein. Note that these two techniques have a similar computational expense as the most time-consuming operation in the interpolation procedure is the neighbour searching, which is accomplished in both methods. Therefore performance results obtained here for the delta functions can be extrapolated to MLS. Delta functions δ are calculated as a result of three onedimensional kernels φ as, The kernel functions of φ 3 by Roma et al. [44] , φ 4 by Peskin [43] , and φ * 4 by Yang et al. [42] as a function of the normalised Fig. 2 a shows the support width of these kernels utilising between 3 and 5 neighbours in each direction yielding a total number of neighbours, n e , of 27, 64 and 125, respectively. Fig. 2 b depicts how the interpolation area is distributed in a two-dimensional plane (bounded by solid, dashed or dotted lines, respectively) increasing with the kernel width. Yang et al. [42] showed that a larger number of neighbours reduces the nonphysical force oscillations during the variable exchange procedures. Nevertheless, the improvement of the interpolations results in additional computational load that is directly proportional to the number of neighbours used.
The computational overhead of the IB method is relatively low if Lagrangian markers are static because the interpolation functions are constructed only once at the first time step. However, it increases significantly for moving bodies, especially in DNS/LES as fine grids are required and a solid body can comprise thousands of markers. Searching for the closest fluid cells in the vicinity of Lagrangian markers is the most time-consuming operation due to the fact that the staggered storage of the three components of velocity entails operations on three different grids, i.e. three interpolation functions for each marker needs to be computed. Hence, two relevant computational aspects in the present efficient implementation of the IB method are: (1) delta functions and indices of the closest neighbours to each Lagrangian marker are computed only during the forwards interpolation and stored to be used during the backwards interpolation avoiding a second neighbour searching (2) a master-scatteringgathering technique (explained in Section 3.2 ) is developed to efficiently deal with moving Lagrangian particles that travel along different sub-domains and computed by different processes throughout the simulation.

Parallelisation strategy
Large eddy simulations require sufficiently fine grids to explicitly resolve the large-scale turbulence in the flow [1,46] , which usually implies an enormous computational load especially when the Reynolds number is relatively high. Hydro3D features a Local Mesh Refinement (LMR) methodology that allows refining the fluid mesh in certain areas of interest or of steep gradients while using coarser grids away from them. The implementation of LMR in Hydro3D is detailed in [9] and allows performing highresolution LES with a reasonable amount of MPI processes. The next Section 3.1 details the way that the Eulerian field is divided and mapped using pure MPI, and Section 3.2 focuses on the Eulerian-Lagrangian computation using a new hybrid MPI/OpenMP environment.

Eulerian field parallelisation using MPI
In most LES and DNS the solution of the Poisson pressure equation often constitutes the most time-consuming operation within the fractional-step method. The concept of the coarse-grained MPI parallelisation is used to divide the computational domain of the Eulerian field into sub-domains or blocks and execute them in parallel using multiple processing units by means of the Single Program Multiple Data (SPMD). These blocks are assigned to different MPI ranks and communication is accomplished via MPI operations. The blocks overlap at their boundaries by one or several layers of ghost cells (or halos) through which information is exchanged. Fig. 3 depicts an example of four sub-domains with a solid body in two of them. Each sub-domain overlaps with its immediate neighbours through two layers of ghost cells and the in-  formation stored in these cells is exchanged, guaranteeing a continuous Eulerian field across domain interfaces. This is a standard parallelisation approach for block-structured grids and increases the calculation speed of the solution of the Poisson equation when using a multigrid solver because a big matrix solution problem is broken down into many small sub-problems [47] .

Hybrid MPI/OpenMP strategy
The use of Eulerian-Lagrangian methods featuring a large number of Lagrangian markers in a relatively confined Eulerian domain [6,7,28,37,48] calls for a parallelisation strategy beyond pure MPI, as most likely Lagrangian markers are not equally distributed amongst the Eulerian sub-domains. For instance, in Fig. 3 the Lagrangian markers comprising the airfoil are all located in domains 2 and 3 and consequently there is a greater computational effort solving the flow in these domains than in domains 0 and 1. Adopting a parallelisation strategy that combines MPI and OpenMP aims at taking advantage of the benefits of each technique [49][50][51] . The proposed hybrid scheme in the Eulerian-Lagrangian solver combines a coarse-grained message-passing parallelism in the Eulerian calculations with a fine-grained multi-thread parallelism for the Lagrangian calculations [52] .
A global layout of the proposed hybrid approach using as example 4 MPI processes with 2 of them spawning 3 OpenMP threads to fork the Lagrangian calculations is depicted in Fig. 4 . Firstly, the predicted Eulerian velocities are calculated by each MPI rank according to Eq. (9) . The next step, denoted as "particle allocation", concerns the distribution of the Lagrangian points among the different MPI processes. This is accomplished with a strat-egy combining the "master-slave" concept from Uhlmann [53] with the "scattering-and-gathering" strategy from Wang et al. [54] . The master processor (hereinafter referred to as master) gathers the general information from all the Lagrangian markers, e.g. their coordinates. Based on these data, the master calculates for each marker in which sub-domain i is located and the MPI process that is assigned to deal with it, and also generates a vector X L i with the indices of the markers contained within the sub-domain i . The latter is distributed via SCATTER together with the integer n L i that indicates the number of markers within the sub-domain i . Depending on whether the marker moves (dynamic) or is static (fixed) the particle allocation and scattering-and-gathering are performed at every time step or just once at the first one, respectively.
The computation of the IB method equations ( Eqs. (9) - (12) ) is performed by all MPI processes whose assigned sub-domains contain Lagrangian markers. This differs from Wang et al.'s [54] strategy in which only the master deals with these equations exchanging the Lagrangian velocities and forces via MPI communications with the other MPI processes once resolved. This is an efficient strategy whenever the number of Lagrangian points is relatively small as the communication of large arrays can lead to significant overheads. In the present cases there is a great number of markers causing an important load-unbalancing and hence the proposed alternative strategy adding multi-threading to the "master-scattering-gathering" strategy aims at improving the code's speedup by (i) reducing inter-node communication between MPI processes, and (ii) improving the locality of the IB method computations. No MPI calls are made within the OpenMP parallelised loops, which simplifies the present implementation and avoids some of the drawbacks associated with such hybrid parallelisation schemes [49,55] .
After performing the Lagrangian forcing correction on the Eulerian field via Eq. (12) , an inverse operation sending the Lagrangian forces from the processors to the master is performed via GATHER . The latter is needed when the Lagrangian forces are physically meaningful, such as in the analysis of tidal turbines [6,37] . The final step is to compute the remaining Eulerian variables of the scalar ˜ p correcting the final velocity u t and pressure p t fields using Eqs. (5) - (8) .
In many hybrid MPI/OpenMP applications the distribution of MPI/OpenMP threads is homogeneous, while in our implementation not every MPI process spawns OpenMP threads. This adds certain flexibility in the hybrid scheme especially when allocating cores dedicated to either MPI or OpenMP within the same node to reduce inter-node communications [49] . Fig. 5 shows a schematic core distribution for 38 MPI tasks among 4 nodes from which 12 ranks uses 2 threads. Different distributions among the processes are combined with nodes dedicated to pure MPI processes, mixed use of pure MPI and hybrid MPI/OpenMP, and exclusively dedicated to hybrid MPI/OpenMP.
In order to ensure a correct sequential mapping of the processes to the compute cores in the mixed MPI/OpenMP scheme, it is necessary to tune the launch configuration of the code in the job scheduler. Considering Slurm as scheduler, the environment variable SLURM _ TASKS _ PER _ NODE is set in the batch script to indicate ensure those MPI processes spawning additional OpenMP threads are placed on the same nodes, which can be also shared with those running pure MPI processes. For instance, in the SMP-style (a sequential) assignation of the resources for the scenario depicted in Fig. 5 , the environment variable would be SLURM _ TASKS _ PER _ NODE = 16 (x1 ) , 12 (x1 ) , 8 (x1 ) , 2 (x1 ) , in the value which is a list of comma-separated items of set as A ( xB ) with A denoting the number of MPI processes on B consecutive nodes. This value would Hence, 16 MPI processes are assigned to the first compute node, 12 MPI processes to the second one leaving room for two of them (MPI processes 26 and 27) to spawn an additional OpenMP thread each, 8 MPI processes to the third node, leaving room for all allowing them to spawn an additional OpenMP thread each, and the remaining MPI processes (36 and 37) to the last node. Note this feature is not exclusive of Slurm as the proposed hybrid parallelisation scheme of Hydro3D could be adopted with other schedulers, e.g. PBS, and without the need for administrative privileges.

Parallel performance assessment
This section presents the scalability study of Hydro3D for three different flow problems. The first one is a lid-driven cavity flow case and the other two are high-resolution large-eddy simulations of complex flows of engineering interest [6,28] . The parallelisation speedup S n is evaluated as, where T 0 corresponds to the wall-time or runtime obtained with the configuration using the lowest number of cores and T n corresponds to the runtime when using n cores. The runtime is obtained

Lid-driven cavity flow
Firstly, the scalability of the MPI parallelisation is analysed using the lid-driven cavity flow [56] , a common benchmark case used for testing incompressible flow solvers [57] . Here a similar setup to that by Wang et al. [54] is adopted. The domain is a threedimensional cube the sides of which are equal to one and the Reynolds number is set to 400. A Dirichlet boundary condition is set at the top lid with an imposed velocity of (u, v , w ) = (1 , 0 , 0) , no-slip conditions are used at the bottom, west and east walls and slip conditions are adopted on the north and south walls. The grid resolution is uniform in the whole domain with an even distribution of Eulerian cells per processor. The time step is set variable with a CFL value of 0.5 in all cases.
The resulting flow field obtained with the mesh resolution comprising 160 grid cells along each spatial direction is shown in Fig. 6 with the contour plot of u-velocity at a transversal plane along the mid-width of the domain. Profiles of u-and w-velocities confirm the accuracy of Hydro3D to predict the flow developed in the cavity achieving a good match with those of Ghia et al. [56] . Scalability of the code is assessed for three grid-resolutions using five different number of cores, namely 1, 8, 64, 125 and 512 (512 is the largest number of cores available on the cluster). Details of mesh resolution ( x i ), number of divisions along each direction ( n i = 1 / x i ) and total number of cells ( N e ) are provided in Table 1 . Results of the code's speedup are presented in Fig. 7 a demonstrating that Hydro3D features a good strong scalability for all grid resolutions (and especially for the finest mesh n i = 320) except when 512 CPUs are used in cases 1 and 2, i.e. those with the least number of grid cells.
As mentioned before the most expensive computation at every time step in LES solvers is the solution of the Poisson pressure equation ( Eq. (10) ). Fig. 7 b shows that in the present cases com-   Table 1 Details of mesh resolution, number of divisions per spatial direction and total number of fluid cells for the three scenarios used in the lid-driven cavity flow. puting the pressure solver always takes more than 50% of the total time and for those cases using less than 8 tasks this increases up to 70%. Overall, good strong scalability is obtained as the mean runtime needed to compute the pressure equation decreases almost linearly when increasing the number of cores.

Simulation of a vertical axis tidal turbine
This section assesses the parallelisation performance of Hy-dro3D for the simulation of the flow past a Vertical Axis Tidal Turbine (VATT). In contrast to the lid-driven cavity flow, this simulation requires the computation of moving Lagrangian markers and the hybrid MPI/OpenMP parallelisation approach of the code is employed. The accuracy of these simulations using the immersed boundary method was successfully validated in previous works [6,58] .
The turbine operates in a hydraulic flume with an incoming velocity U 0 of 2.3m/s and comprises three cambered NACA 0018   [6,58] . The efficiency of the hybrid parallelisation strategy is tested with two different mesh resolutions, namely M1 and M2, which are chosen based on the mesh convergence study of Ouro et al. [6] . Table 2 details the mesh configurations regarding the nor-  Fig. 8 shows a zoom-in of the fluid domain of the horizontal plane at z/c = 1 with contours of normalised mean streamwise velocity. The domain is divided into 57 sub-domains in the x -y plane, meaning that the code runs 57 MPI processes when using one vertical layer of sub-domains, whilst 114 and 171 MPI processes are used when adopting two or three layers in the vertical, respectively. The turbine blades move within the finest LMR level but only 24 of the 36 sub-domains deal with Lagrangian markers during the simulation, which means that 24 MPI processes can require to spawn OpenMP threads.
The scalability of the scheme is assessed for 57, 114 and 171number of sub-domains for the coarse resolution (M1) while for the fine resolution (M2) only 114 and 171number of subdomains are used. Each case is tested with a pure MPI configuration (meaning that no additional OpenMP thread is spawned), and using the hybrid MPI/OpenMP scheme with 2 and 3 OpenMP threads for those MPI processes computing IB markers, i.e. each MPI process spawns 1 or 2 additional OpenMP threads. The procedure to determine the processes computing the IB bodies is straightforward when the body moves with a prescribed motion, e.g. a circular movement described by a VATT, and for this case the sub-domains requiring multi-threading are the ones enclosed within the blue boundary in Fig. 8 .
Results of runtime and speedup obtained with the pure MPI and hybrid MPI/OpenMP schemes on the VATT simulations are presented in Fig. 9 . Note that the OpenMP Dynamic schedule directive with chunk size of 50 is used as this gave the best performance according to the test described in Appendix A . For case 1.a, using 1 and 2 additional OpenMP threads reduces the time to compute the IB method by 43% and 55% respectively, which diminish the overall time to 25% and 32%. Nonetheless, the configuration using 105 threads (57 MPI processes with 48 additional OpenMP threads) in case 1.a takes longer to run than using 114 MPI processes with no OpenMP in case 1.b. Note that having two vertical domains reduces the number of Eulerian cells used in the neighbour searching, thus the computation of the IB method also benefits from a larger domain partitioning.
For case 1.b, using 114 MPI tasks with 2 additional OpenMP threads (162 cores in total) the hybrid scheme outperforms by 12% the results obtained with the pure MPI with 171 cores. In view of these runtimes, the effectiveness of the hybrid scheme is deemed conditioned by the time spent solving the Poisson equation, as shown in Table 3 . For the finer mesh resolution, M2, similar results are obtained. In case 2.a, executing the code with 2 additional OpenMP threads drops by 42% the time spent computing the IB method leading to a smaller runtime than that obtained for case 2.b with 171 MPI processes.
The speedup obtained with every hybrid configuration relative to the pure MPI scenario is presented in Fig. 9 c and d for meshes M1 and M2, respectively. The average speedup gained to compute the IB method with 1 additional thread is about 1.7, which in terms of speedup based on the total runtime is 1.3. However, when using 3 threads the speedup increases up to about 2.2 and 1.4 related to the IB method and total runtime per time step. Note that the slope of the speedup curves related to the total runtime flattens when adopting 3 threads as a result of the time spent solving the Poisson equation again becoming the most expensive computation over the IB method.
The percentage of time spent on the different stages of the LES solver to advance the simulation in time for the pure MPI configurations is summarised in Table 3 . It is appreciated that the IB method consumes from 48% to almost 60% of the computing time whilst resolving the Poisson equation takes about 30-38%, which indicates that the IB method arises as the code bottleneck. This is noticeable when increasing from 114 to 171 cores, i.e. 1.5 times more resources, runtime time drops by 14% and 21% for cases 1 and 2 respectively.

Hybrid MPI/OpenMP performance for different kernel functions
In the direct forcing IB method the accuracy and smoothness of the interpolation can be improved increasing the number of neighbours [42] . However, this is not free-of-charge and brings additional computational overhead due to a larger number of operations in the neighbour searching. Here, three delta functions φ 3 , φ 4 and φ * 4 are examined which use 27, 64 and 125 neighbours respectively. Fig. 10 presents the mean runtime per time step and that corresponding to the IB method with mesh M1. For all cases with pure MPI configuration the computing time spent on the IB method increases about 1.8 times using φ 4 and 3.4 times using φ * 4 when compared to the runtime obtained with φ 3 , which uses the least number neighbours in the interpolation.
The runtime obtained with 114 MPI processes and 1 additional OpenMP thread (162 cores in total) using φ * 4 and φ 4 is 22% and 12% lower, respectively, than that with 171 MPI processes without OpenMP threads. In the case of φ 3 , the hybrid scheme does not outperform the pure MPI performance as the resolution of the    Poisson equation is constantly more time-consuming than the IB method. An effective measure of multi-threading performance is through parallel efficiency evaluated as, E n = nT n mT m (19) which compares the runtime obtained with n and m number of cores. Details of the different hybrid configurations with their speedup and parallel efficiency calculated in reference to the respective pure MPI configurations are given in Table 4 , together with the ratio between time computing the IB method ( T IB ) and pressure solver ( T Press ) for the different kernel functions. N omp represents the number of additional OpenMP threads and N threads the total number of threads, summing the ones running MPI processes and OpenMP threads.
The performance of adding multi-threading improves for φ * 4 as speedup values based on the total time step are above 1.4 for all cases using 2 threads and 1.6 when using 3 threads. This is also reflected in good parallel efficiency values being approx. 1.0 and 0.88 for 2 and 3 threads, respectively. Values of E n for φ 4 decrease to a range between 0.85 and 0.94 when using 2 threads and from 0.71 to 0.80 for 3 threads. Speedup results for φ 3 are relatively low ( < 1.2) as the computational load of the IB method compared to that of Poisson pressure equation, i.e. T IB / T Press is constantly lower than 0.75. The resulting parallel efficiency shows values mostly over 0.80 for 2 OpenMP threads while dropping to 0.60 for 3 threads. Results suggest that it should be borne in mind the balance between the higher-order kernels being more accurate, e.g. φ * 4 , and their computational expense.
Overall good parallel efficiency is obtained with the hybrid MPI/OpenMP scheme whenever the ratio T IB / T Press is greater than one, which seems a good indicator of whether extra computational resources should be added to pure MPI ( T IB / T Press < 1) or multithreading ( T IB / T Press > 1).

Simulation of a pitching airfoil under dynamic stall
The simulation of pitching airfoils undergoing dynamic stall is of interest to many fields, as it is a key phenomenon in the aerodynamics of helicopters, micro-aerial vehicles or wind and tidal turbines. Despite its relevance in such variety of flows, the understanding of the dynamic stall is still not complete due to its remarkably complex nature depending on a large number of flow and kinematic variables, e.g. flow or pitching conditions [59] .
In the simulation of pitching airfoils, the velocity gradients over the airfoil surfaces need to be well-resolved in order to accurately capture flow phenomena such as flow separation, laminarto-turbulent boundary layer transition during upstroke motion, or boundary layer reattachment during pitch down. Therefore, eddyresolving approaches, such as LES, provided with very fine meshes are required to obtain trustworthy results. The computational load of simulating these moving bodies is notably high mainly due to extra computations to re-allocate variables when body-fitted or chimera methods are used [60] or to re-construct the interpolation functions using the immersed boundary method [28] .
Here the hybrid parallelisation scheme is tested in the simulation of a NACA 0012 [61] under a pitching motion using the IB method with kernel function φ 4 and whose accuracy was successfully proven in Ouro et al. [28] . In the present case an airfoil with chord length c = 0 . 15 m oscillates sinusoidally with a constant pitching frequency ( ω) around its gravity centre located at 0.25 c away from the leading edge. The angle described by the solid at any time is calculated as, α(t) = α 0 + α · sin (ωt) (20) where α 0 is the pre-set angle with value 10 °, α is the angle amplitude equal to 6 °, and ω is the oscillation frequency equal to 0.32rad/s. The resulting maximum and minimum pitch angles are 16 °and 4 °, respectively. Fig. 11 shows the flow developed over the moving airfoil at α = 15 . 88 • ↑ and 15.92 °↓ . Fig. 11 a represents the last stage of the upstroke motion with the leading edge vortex dominating the flow over the airfoil's suction side also observed during the experiments [61] . This energetic large-scale structure is shed once the airfoil achieves its maximum angle of attack and a trailing edge vortex is then generated as observed in Fig. 11 b.
The performance of the hybrid MPI/OpenMP scheme is tested against pure MPI runs using two grid resolutions. Details are provided in Table 5 together with total number of Eulerian cells N e and Lagrangian markers N L . Mesh P1 features 320 markers uniformly distributed over both pressure and suction sides of the airfoil while 400 markers are adopted in mesh P2. These are selected based on the mesh sensitivity study performed in [28] that provided an accurate resolution of the flow. The grid is uniform in xand y-directions while in the spanwise direction it is z = 2 x .   Four levels of LMR are adopted to construct the fluid mesh as depicted in Fig. 12 with the airfoil embedded within the finest level. This allows to reduce the total number of fluid cells required to perform these fine-grid simulations with a relatively affordable number of computational resources. The sub-domains dealing with IB markers that may require multi-threading are indicated by the red boundary in Fig. 12 . Analogously to the VATT case, the fluid mesh P1 comprises a single layer of sub-domains in the spanwise direction, so up to 12 MPI processes can spawn OpenMP threads. In mesh P2 this number is doubled as two domains are distributed along the spanwise direction. Note the Dynamic OpenMP scheduling directive with chunk size of 50 is used as it obtained the best performance results in Appendix A . Fig. 13 presents the performance comparison between the hybrid MPI/OpenMP and pure MPI parallelisation schemes with mean runtime and speedup values for meshes P1 and P2. Adding an additional OpenMP thread to the MPI processes computing the Lagrangian markers in the 38 sub-domain configuration results in a reduction of the runtime from 4.9s to 3.4s, about a 30% of the total mean time step. This achieves an almost perfect parallel efficiency as 31% more cores are added. The hybrid scheme takes 3.4s per time step compared to 3.11s from the 78 pure MPI setup, i.e. needs about 9% more time to compute each time step but with 36% less Analogous results are obtained for mesh P2 with the hybrid scheme using 50 cores (38 MPI processes and 12 additional OpenMP threads) performing similarly to 78 pure MPI cores and becoming faster when 2 additional OpenMP threads ( N dLm = 24 ) are adopted. Fig. 13 c and d show that the hybrid MPI/OpenMP features a good weak scalability, as similar speedup is achieved between meshes P1 and P2 comparing analogous hybrid configurations.
Results of the relative time computing the IB method and pressure solver, speedup related to the IB method and total time step, and parallel efficiency are presented in Table 6 . For cases 1.a with 2 and 3 threads, the ratio T IB / T Press is consistently almost equal to or above 1.0 identifying the IB method as the code's bottleneck, and consequently the parallel efficiency for these cases is over 1.0 proving the feasibility of the multi-threading scheme. For case 1.b, the relative time spent on the IB method reduces attaining an E n of 0.954 using 2 and 0.829 using 3 threads as the IB method computation becomes faster than computing the pressure equation, i.e. the Lagrangian part is not the code bottleneck after 2 additional threads are used.
It is noteworthy that Ouro et al. [28] performed almost 30 0,0 0 0 iterations to simulate four full cycles of a pitching airfoil albeit their flow and kinematic conditions were different to the present case. The computational load of their LES was equivalent to 38,0 0 0 CPU hours using a pure MPI scheme with 76 cores. In view of the present results, a hybrid scheme with 38 MPI processes and 2 additional OpenMP threads would lead to approx. 22% lower computational cost as the total number of cores is 62 (38 MPI processes + 24 additional OpenMP threads) and the average runtime per time step reduces from 4.58s to 4,37s, i.e 5% less. An even better performance improvement could be achieved for those cases with large stencil kernels, e.g. φ * 4 .

Conclusions
A hybrid MPI/OpenMP parallelisation methodology designed for Eulerian-Lagrangian simulations using the in-house code Hydro3D has been presented and applied to a series of benchmark cases. The hybrid scheme features multi-threading capabilities by means of OpenMP that has been added to an already existing MPI coarsegrained parallelisation approach. OpenMP has been included into a complex master-scattering-gathering strategy which targets the load imbalance of executing the Lagrangian computations and efficiently solves the movement of dynamic bodies across Eulerian sub-domains during the simulation.
In this work, the performance of the enhanced code has been assessed for three different flow problems. The first one comprises various lid-driven cavity flow simulations with up to 512 CPUs which demonstrate the excellent scalability of the pure MPI parallelisation of Hydro3D. Next, the efficiency of the hybrid strategy has been assessed for two challenging fluid-structure inter-action problems in which the Lagrangian-framework-based immersed boundary method is employed to simulate a moving solid body. The mixed parallelisation strategy outperforms the performance of pure MPI schemes in those cases in which the load from Lagrangian computations is considerably larger than the Eulerian ones, and here mainly the solution of the Poisson pressure equation requires most of the resources. It has been shown that the hybrid MPI/OpenMP scheme achieves a reduction of approx. 20% of the computational cost needed for the simulation of the pitching airfoil due to a larger computational load of the Lagrangian part than in the case of the turbine. This performance increase can have a notable impact in many CFD fields considering the huge expense of most LES or DNS applications allowing to carry out a certain simulation in a shorter time using less computational resources.
The hybrid MPI/OpenMP scheme was further analysed for different kernel functions used in the immersed boundary method, which showed that the relative performance of the mixed strategy improves when the number is larger. Good parallel efficiency values close to the unity are reported when 64 and 125 neighbours were adopted in the interpolations. The OpenMP parallelisation was further refined using Dynamic scheduling with chunk size of 50 which performed best with a speedup of 1.3-1.4 times the default Static directive.
The presented test-cases used the standard direct forcing equations without additional fluid-structure interaction algorithm, such as the ones use for deformable bodies, which require to solve a larger number of equations meaning they can benefit even more from the proposed hybrid strategy. Multiphase techniques, such as Lagrangian particle tracking, can also take advantage of the mixed MPI/OpenMP strategy which will be analysed in the future. threads. Results are presented in Fig. A.1 comparing the speedup values relative to the IB method S IB and total time step S total which are calculated based on the runtime obtained the Static directive. The greatest speedup is achieved with the Dynamic directive using a chunk size larger or equal to 50 with S IB = 1.6 and S total = 1.25 using 2 threads, and S IB = 2.25 and S total = 2.35 using 3 threads. Chunk sizes between 50 and 20 0 0 achieve a significant performance increase with any thread number, while a larger chunk size of 50 0 0 experiences a decrease in speedup, being this noticeable for 3 threads. The Guided directive provides better performance than Static but lower than that achieved with Dynamic.
The speedup obtained with the Dynamic directive and chunks equal or larger than 50 are remarkable, suggesting that the performance of multi-threading in the hybrid paralellisation scheme is somewhat compromised by the scheduling directives. This becomes more relevant when a large number of threads are used. Therefore, the improved speedup obtained in case 1.b using 114 MPI processes and 2 additional OpenMP threads over the pure MPI with 171 processes could be only achievable using the Dynamic directive with chunk sizes of 50-200.