FluTAS: A GPU-accelerated finite difference code for multiphase flows

We present the Fluid Transport Accelerated Solver, FluTAS, a scalable GPU code for multiphase flows with thermal effects. The code solves the incompressible Navier-Stokes equation for two-fluid systems, with a direct FFT-based Poisson solver for the pressure equation. The interface between the two fluids is represented with the Volume of Fluid (VoF) method, which is mass conserving and well suited for complex flows thanks to its capacity of handling topological changes. The energy equation is explicitly solved and coupled with the momentum equation through the Boussinesq approximation. The code is conceived in a modular fashion so that different numerical methods can be used independently, the existing routines can be modified, and new ones can be included in a straightforward and sustainable manner. FluTAS is written in modern Fortran and parallelized using hybrid MPI/OpenMP in the CPU-only version and accelerated with OpenACC directives in the GPU implementation. We present different benchmarks to validate the code, and two large-scale simulations of fundamental interest in turbulent multiphase flows: isothermal emulsions in HIT and two-layer Rayleigh-B\'enard convection. FluTAS is distributed through a MIT license and arises from a collaborative effort of several scientists, aiming to become a flexible tool to study complex multiphase flows.


Introduction
Multiphase flows are ubiquitous in many contexts, ranging from environmental flows to industrial applications. The interaction between phases has a prominent role in the formation and evolution of clouds [2], in sediment transport [3,4], oceanic sprays and bubbles generation [5] and more in general it represents one of the grand challenges of environmental fluid mechanics [6]. These flows are also crucial in several industrial applications, such as pharmaceutical, transportation, food processing and power generation [7]. From a theoretical point of view, the main difficulty when analyzing multiphase flows relies on their multiscale nature, since the length-scale of the interface is of the order of the mean-free path while in most applications the typical length-scale is several orders of magnitude larger (∼ 10 5 − 10 6 ). This huge separation of scales is magnified when dealing with turbulent multiphase flows, which further broadens the spectrum of length-scales, thus making unfeasible any attempt to bridge all of them in a single and unique framework. For this reason, all the tools developed so far, both of experimental and numerical nature, have focused on only a portion of the scale spectrum while the remaining part is modelled or neglected.
As regards multiphase turbulence, where most of our interests and applications are, both experimental investigations and numerical simulations have been extensively used in the last thirty years and have led to important contributions in a variety of problems and configurations: to name a few, particle laden flows and sediment transport discussed in [8,4], bubbly and droplet flows reviewed in [9,10,11], oceanic sprays and bubbles as highlighted in [5]. Nevertheless, as already discussed in [10] and despite the recent progress in the instantaneous measurement of bubble/droplet shape [12,13,14], there is still a lack of experimental data for the measurement of the instantaneous velocity fields of both carried and dispersed phase as well as for the turbulent kinetic energy and dissipation near the interface locations. These limitations disappear when dealing with numerical simulations and, therefore, in the last decades interface resolved simulations of multiphase flows have become a central investigation tool. Despite the advantages, numerical simulations are still limited to simple configurations and moderate scale separation, and pose the challenge of the choice of the proper method to fully resolve the twophase interface. As discussed in [15], there is now consensus that numerical methods suitable to perform interface-resolved simulations of multiphase flow should have the following properties: i) be able to enforce mass, momentum and kinetic energy conservation at discrete level, ii) allow mismatches in the material properties, whose magnitude depends on the application, and iii) handle complex and possibly arbitrary topological changes. Among the four groups of numerical methods for multiphase flows, Front-Tracking (FT) [16], Volume-of-Fluid (VoF) [17], Phase Field (PFM) [18], Level-set (LS) [19], it exists at least a variant of each which possesses the aforementioned numerical properties, giving some freedom to researchers and scientists on the choice of their preferred numerical tool (see [17,20,21] for a review).
Nevertheless, it is becoming ever more clear that another desirable property of any numerical method is its straightforward adaptation to be able to run massively parallel simulations, especially on accelerated architectures.
With the increase in the computing power driven by Graphics Processing Units (GPUs) [22], several HPC centers are now shifting towards GPUonly and GPU-accelerated architectures. This trend is making the GPUparallelization of numerical codes for fluid mechanics a mandatory requirement rather than a simple advantage. This effort has been already taken for single-phase codes, where at least three open-source codes for incompressible and fully compressible simulations are able to run on accelerated architectures: AFiD [23], STREAmS [24] and the accelerated version of CaNS [25]. Conversely, on the multiphase counterpart, despite the large availability of CPU-based open source codes, PARIS Simulator [26], TBFsolver [27], FS3D [28], NGA2 [29], Basilisk [30] and MFC [31] to name few, limited effort has so far been devoted to their adaptation to hybrid architectures.
In this work, we aim to fill this gap and present FluTAS (Fluid Transport Accelerated Solver), a code for massive Direct Numerical Simulations on multi-GPU and multi-CPU architectures targeting incompressible multiphase flows, optionally with heat transfer. The numerical solution of these flows is typically performed using finite-difference methods in a staggered variable arrangement, and it involves the solution of a Poisson equation to enforce the constraints on the velocity divergence. In this context, FluTAS uses as basis the Navier-Stokes solver CaNS [32] and its GPU extension [25], whose key feature is a general implementation incorporating all the possible homogeneous pressure boundary conditions that can benefit of the FFTbased elliptic solvers [33]. The single-phase Navier-Stokes solver is extended to a two-fluid code using the one-domain formulation [17,20] and coupled with the algebraic VoF MTHINC [1] to capture the two-phase interface. This method combines the exact mass conservation properties of certain geometric VoF methods with the reduced number of local operations for the interface reconstruction of the algebraic VoFs, making it a good candidate for properly exploiting hybrid and accelerated architectures. The version available in our group has been validated in [34] and extensively employed in a different variety of multiphase configurations, both for laminar [35,36,37] and turbulent [38,39,40,41] flows. Note that it has been extended to phase changing flows [42] and also to handle weakly compressible multiphase flows (low-Mach approximation) [43,44]. This paper is organized as follows. In §2, we introduce the governing equations for the incompressible two-fluid system. The discretization details of the VoF method, energy equation and Navier-Stokes solver are provided in §3, whereas the standard benchmarks for code validation are discussed in §4. Next, the parallelization for the GPU acceleration is presented together with the scaling tests in §5 and in §6. The code potentialities are shown in two demanding simulations of multiphase turbulence: emulsions in homogeneous isotropic turbulence (HIT) and two-phase thermal convection (see §7). Finally, main conclusions and future perspectives are summarized in §8.

Governing equations
We consider a two-phase system of immiscible incompressible Newtonian fluids (e.g., a gas-liquid system). The two phases are bounded by an infinitesimally small interface, through which momentum and energy can be transferred. To describe the system, we define a phase indicator function H distinguishing the two phases at position x and time t: where Ω 1 and Ω 2 are the domains pertaining to phases 1 and 2. We can use H to define the thermophysical properties in the whole domain Ω = Ω 1 ∪ Ω 2 as follows: where ξ i (i = 1, 2) can be the mass density ρ i , the dynamic viscosity µ i , the thermal conductivity k i or the specific heat capacity at constant pressure c p,i . Hereafter, unless otherwise stated, thermophysical quantities not specifically referring to one of the phases are defined from eq. (2). The evolution of the indicator function is governed from the following topological equation: where u Γ is the interface velocity. In absence of phase change, the one-fluid velocity u is continuous across the interface and therefore, it can be employed as interface velocity in equation (3). The equations governing the momentum and energy transport for the liquid and gas phase are coupled through appropriate interfacial conditions [45], reported below in the so-called one-fluid or whole-domain formulation, where each transport equation is defined in Ω [20].
Here, u is the fluid velocity assumed to be continuous in Ω, p is the hydrodynamic pressure, T the temperature. In equation (5), σ is the surface tension, κ the local interfacial curvature and δ Γ is a delta Dirac function, g is the gravity acceleration andρ is the volumetric density field modified to account for the thermal effects in the gravity forces. Using the Oberbeck-Boussinesq approximation,ρ reads as: where ρ i=1,2,r are the reference phase densities and β i=1,2 are the liquid and gas thermal expansion coefficients.

Numerical methodology
The numerical solution of the governing equations (3), (4), (5) and (6) presented in section 2 is addressed on a fixed regular Cartesian grid with uniform spacing ∆x, ∆y and ∆z along each direction. A marker-and-cell arrangement is employed for velocity and pressure points [46], whereas all scalar fields are defined at the cell centers. Each time-step, the governing equations are advanced in time by ∆t n+1 = t n+1 − t n , with the previous time-step indicated with ∆t n = t n −t n−1 . Hereafter, we present the numerical discretization of the governing equations, following the same order in which they are solved.

Volume of fluid: the MTHINC method
The first step of the time-marching algorithm consists in the interface reconstruction and its subsequent advection. As previously mentioned, these tasks are addressed within a fully Eulerian framework using a volume-of-fluid (VoF) method. From a numerical point of view, this consists first in defining the volume fraction φ in each cell of the computational domain as: with V c = ∆x∆y∆z. Next, equation (3) is written in terms of volume fraction as: The distinct feature of each class of VoF method lies in the way H is approximated. In this work we employ the algebraic volume-of-fluid method based on the Multi-dimensional Tangent Hyperbola reconstruction, MTHINC [1], whose central idea is to approximate H with a hyperbolic tangent: where β th , d th are the sharpness and the normalization parameter, respectively, and (x,ỹ,z) a local coordinate systemx = [(x − 0.5) /∆x, (y − 0.5) /∆y, (z − 0.5) /∆z]. Employing equation (10) has two distinct advantages with respect to a piecewise approximation, commonly employed in the geometric VoF methods. First, the phase indicator H can be approximated with a reconstructing polynomial T of arbitrary order in a straightforward manner. Next, once T is known, the resulting interface at the two-phase boundary has smooth but controlled thickness (with the parameter β th ), which also allows computing accurately the normal vector n and curvature tensor K directly from φ. More details about the choice of T and the calculations of d th , n and K are found in the original paper by Ii et al. [1], but for completeness we include them in the appendix Appendix A with the numerical implementation details. After the reconstruction step, the interface is advected using a directional splitting approach [47,48], which consists in evaluating the numerical fluxes sequentially in each direction using, for each split, the latest estimation of VoF field. Accordingly, three provisional fields φ p i,j,k (with p = [x, y, z]) are first computed: where s = [n, x, y], [∆l x , ∆l y , ∆l z ] = [∆x, ∆y, ∆z], [u x , u y , u z ] = [u, v, w] with u p ± the p-th velocity component. The calculation of the numerical fluxes f ± in equation (11) are evaluated using the hyperbolic tangent approximation of H as detailed in appendix Appendix A. Next, the divergence correction step is applied in order to impose the volume conservation of both phases at a discrete level: With the above approach, mass conservation is ensured up to the accuracy with which the divergence free condition (4) is satisfied. Accordingly, if direct methods are employed to solve the Poisson equation, the mass of each phase results to be conserved up to machine precision. Another approach with a similar property has been introduced in [49]. However, in that case the dilatation terms at the denominator of equation (11) are treated in an explicit manner, while here in an implicit strategy is employed. This comes at a cost of the final correction step, given by equation (12), but with the advantage of not introducing additional time-step restrictions (apart the convective one) in the advection of the color function.

Thermal effects
The next step of the time-marching algorithm consists in advancing the temperature field using an explicit second-order Adams-Bashforth method: where f t,1 = (1 + 0.5∆t n+1 /∆t n ) and f t,2 = 0.5∆t n+1 /∆t n are the coefficients of the Adams-Bashforth scheme. In equation (13), the operator M T accounts for the advection and diffusion contribution and it is provided below in a semi-discrete form: All the spatial terms in equation (14) are discretized with second-order central schemes, except for the temperature convection term. The discretization of the latter is based on the 5th-order WENO5 scheme, as in reference [50].

Pressure correction algorithm
Once the energy equation has been advanced, the momentum equation is solved with a second-order pressure correction [51], reported below in a semi-discrete form: where the operator M n u and M n−1 u in equation (15) includes the convective and diffusive terms computed at the current and previous time level, neglecting the surface tension and gravity forces which are then included as source terms. The spatial gradients in M u are discretized with central schemes. The intermediate velocity u is then updated with the contribution from the terms due to the time-pressure splitting, as in (16). Note that ρ 0 is the minimum value of the density field in the computational domain andp represents the time-extrapolated pressure between the current and the old time step, i.e.p = (1 + ∆t n+1 /∆t n )p n − (∆t n+1 /∆t n )p n−1 . Following [52] and contrary to [53,54], the terms arising from the pressure splittings are included in the prediction of the velocity field (see eq. (16) before the imposition of the boundary conditions. This approach has two distinct advantages. First, it represents an incremental pressure projection which allows achieving an almost second-order accurate in time pressure field [52]. Next, it ensures the consistency of the pressure field near a solid boundary (i.e, u n+1 = u = 0), where the pressure gradient component normal to the boundary (i.e., ∇ ⊥ ψ n+1 = 0) vanishes independently of the local density (see eq. (18)). (17) is solved with the method of eigenexpansion technique that can be employed for different combination of homogeneous pressure boundary conditions [33]. Finally, the velocity field is corrected as in equation (18) in order to impose the divergence constrain (i.e., solenoidal velocity field) and the pressure updated as in equation (19).

Poisson solver
The code uses the FFT-based finite-difference direct solver developed and implemented in the DNS code CaNS; see [32,25]. The underlying numerical approach dates back to the late 1970s [55,33], and has regained popularity in recent years, thanks to the improvements of hardware, and of and software frameworks for collective data communications, provided by the MPI standard and higher-level libraries like 2DECOMP&FFT. In a nutshell, the approach uses Fourier-based expansions along two domain directions, which reduce the system of equations resulting from the three-dimensional second-order finite-difference Laplace operator (seven non-zero diagonals) to a simple, tridiagonal system. These Fourier-based expansions depend on the boundary conditions of the system, and can be computed using FFTs, some of them with pre-/post-processing of the FFT input/output (see, e.g., [56]). The FFT-based expansions are employed along directions x and y, and the resulting tridiagonal system along z is then solved using Gauss elimination. For calculations on CPUs, the method leverages the guru interface of the FFTW library [57], which allows for performing all possible combinations of discrete transforms using the same syntax. On GPUs, the fast discrete cosine and sine transforms have been implemented using real-to-complex/complexto-real FFTs from the cuFFT library, with pre-and post-processing of the input and output signals to calculate the desired series expansion [56,25]. We refer to Refs. [32,25] for more details on this method and its implementation. Concerning the parallelization of the method in a distributed-memory setting, the FFT-based transforms and Gauss elimination steps require the data along each direction to be local to each MPI task. The domain is decomposed using a 2D pencil decomposition, where collective all-to-all communications are required to transpose the orientation of the 2D data decomposition. These transposes are performed using the 2DECOMP&FFT library [58], which was modified to allow for GPU-GPU communication in [23,25]. It is worth noting that, in line with the recent developments of CaNS, the present method uses a default decomposition (i.e., "outside" the Poisson solver) based on a partitioning along y and z, resulting in x-aligned pencils. This reduces the total number of data transposes to be performed during the solution of the Poisson equation from 6 to 4. The approach has been adopted for both CPUs and GPUs, and the required operations to solve the Poisson equation are summarized as follows: 1. perform forward FFT-based transforms along x; 2. transpose x-to-y; 3. perform forward FFT-based transforms along y; 4. transpose y-to-z; 5. solve tridiagonal system using Gauss elimination along z; 6. transpose z-to-y; 7. perform backward FFT-based transforms along y; 8. transpose y-to-x; 9. perform backward FFT-based transforms along x.
Moreover, for the GPU implementation, the solver explicitly reduces the number of all-to-all operations when the domain is not decomposed along z (i.e., when a x − y slab decomposition is prescribed). This effectively decreases the number of collective operations from 4 to 2 (steps 2 and 8 above are skipped). This is the approach adopted in the GPU runs presented here -due to the higher memory bandwidth in GPUs, a slab decomposition suffices for distributed-memory calculations with sufficiently small wall-clock time per step. Explicitly skipping these two no-op resulted in a substantial reduction in wall-clock time per step, and in an overall improvement in the parallel scalability of the solver.

Complete solution algorithm
For clarity, a step by step description of the overall solution procedure is presented in Algorithm 1.

Two-dimensional Zalesak's disk
The Zalesak problem represents a classical benchmark to assess the accuracy of the interface capturing/tracking algorithm. It consists in the solidbody rotation of a slotted disk immersed in an imposed two-dimensional Algorithm 1 Overall solution procedure 1: φ 0 , T 0 , u 0 , p 0 are initialized; 2: ρ 0 , µ 0 , k 0 and c 0 p are calculated using equation (2) from φ 0 ; 3: n = 0 is set, 4: while (t < t tot n < N tot ) do
Periodic boundary conditions are prescribed in both directions. Simulations are conducted up to t = 2π (i.e., one complete revolution of the slotted disk) using a constant time-step ∆t = t/3200. Note that this value has been chosen to ensure a stable time integration for the highest grid resolutions cases (i.e., 256 × 256) and is employed for the coarser cases. Figure 1 shows the final disk shape for different grid solutions and for two sharpness parameters β th = 2 and β th = 3. Note that the highest deviation from the initial shape are in the corner regions, where the highcurvature regions are located. Moreover, the solution is weakly dependent on the value of β th and deviations between the different employed β th are visible only for the coarser simulations. Finally, to assess the accuracy of the solution, we compute the L 1 norm and the order of convergence as: where L 1,N is the L 1 -error using N x ×N y grid points and L 1,2N is the L 1 -error evaluated with 2N x ×2N y grid points. Results are reported in figure 2, where an order of convergence between the first and the second-order is achieved for φ, almost independent of the employed value of β th .

Three-dimensional rising bubble
The rising bubble test case is a well-established numerical benchmark for multiphase flows [59]. This test is presented here to showcase the ability of the numerical tool to accurately capture the topological changes of a moving interface. The flow is driven by the density difference between the two phases, and is influenced by the viscosity difference and the surface tension. The relevant dimensionless groups for this flow are the Reynolds number Re= ρ g u r l r /µ g , the Weber number We= ρ g u 2 r l r /σ, the Froude number Fr= u r / |g|d 0 , the density ratio λ ρ = ρ l /ρ g and the viscosity ratio λ µ = µ l /µ g . In these definitions, l r is a reference length and u r the reference velocity. Moreover, σ is the surface tension coefficient, ρ g and ρ l the reference gas and liquid densities, and µ g and µ l the reference gas and liquid dynamic viscosity. Finally, g is the acceleration of gravity and d 0 is the initial diameter of the spherical bubble. Following the benchmark study [59], the values adopted for the dimensionless groups are Re= 35, We= 1, Fr= 1, λ ρ = 10 and λ µ = 10, setting l r = d 0 , u r = |g|d 0 and the reference time t r = d 0 /|g|. The dimensions of the computational domain are l x = l y = 2d 0 and l z = 4d 0 . The acceleration of gravity acts along the z-direction. No-slip and no-penetration boundary conditions are prescribed at the horizontal top and bottom boundaries of the domain (z-normal) and periodic conditions are prescribed at the vertical boundaries (x-or y-normal). A uniform Cartesian grid of 128×128×256 cells is used. Initially, stagnant flow conditions are applied and the position of the center of mass of the spherical bubble, denoted as (x c (t), y c (t), z c (t)), is located at (d 0 , d 0 , d 0 ). A constant time-step ∆t/t r = 2.8 × 10 −4 is used to advance the solution in time. Figure 3 shows the isosurfaces of φ = 0.5 at various time instances. It is evident that as the initially spherical bubble rises, its surface topology changes. To compare against the reference results from [59], figure 4 shows the evolution of the bubble rising velocity U c in time. The bubble velocity is defined as, t/t r U c /u r A 0 /A ref. [59] 0 where w is the vertical velocity component and Ω the volume of the entire domain. After an initial period where the bubble accelerates, the rise velocity reaches a maximum and then stabilizes. Figure 4 demonstrates an excellent agreement between present and reference results. To further quantify this agreement, table 1 presents benchmark quantities for comparison at specific time instances. Besides the bubble velocity, the table shows the bubble sphericity A 0 /A, defined as the initial value of the bubble surface area over the value at a later time. The deviation between reference and present values is less than 1%.

Differentially heated cavity
To demonstrate the accuracy of the code in the presence of thermal effects, this section considers the flow of air in a closed two-dimensional square heated cavity. The cavity is heated and cooled by the vertical side walls (y-normal), while the horizontal walls are adiabatic (z-normal). Within this configuration, a circulation is formed and maintained by the ascending hot fluid next to the heated wall and the descending cold fluid next to the cooled wall. The flow is therefore purely thermally-driven and is characterized by the Rayleigh number Ra= |g|β∆T l 3 r / (να) and the Prandtl number Pr= ν/α. In these definitions, β is the fluid thermal expansion coefficient, ν is the fluid viscosity, α is the fluid thermal diffusivity and ∆T = (T h − T c ) is the temperature difference between the heated (T h ) and cooled (T c ) walls. Typically, the height of the cavity is taken as the reference length (l r = L z ), while the reference velocity and time are defined as u r = α/l r and t r = l 2 r /α. The case simulated here follows the setup presented in several studies [61,62,60]  field, constant temperature boundary conditions are applied on the vertical walls and a zero temperature gradient along the normal direction is applied on the horizontal walls. The domain is discretized in space using a uniform Cartesian grid with 256 × 256 cells. Initially, the air in the cavity is stagnant and isothermal at a temperature T 0 = T c . A constant time-step ∆t is used to advance the solution in time, given by ∆t/t r = 5.0 × 10 −7 . Figure 5a) shows the contour of the temperature field at t/t r = 0.5, at which point a steady state has been reached. The temperature field is characterized by thin and spatially developing thermal boundary layers next to the thermally active vertical walls, and a stratified region at the central area of the cavity. The heat transfer rate inside the cavity is expressed through the Nusselt number, defined as, where h is the heat transfer coefficient, k is the fluid thermal conductivity, ∇T w is the temperature gradient on any of the thermally active vertical walls and n w is the corresponding unit normal vector on the wall. Figure 5b) shows the comparison of the temporal evolution of the wall-averaged Nusselt number N u z on the heated wall between the present and reference results from [60]. It is evident that the present results are in excellent agreement with the reference solution for the entire duration of the simulation. Furthermore, Table 2 presents the comparison of key benchmark quantities at steady state,  confirming the agreement between present and reference results.

Domain decomposition
The code is designed to run both in multi-CPU and multi-GPU architectures. For domain decomposition, both slab (1D) and pencil (2D) are allowed [32] through the library 2DECOMP [58]. The type of decomposition can be implicitly set via the 2-component array dims (e.g. [1, n] for slabs and [n, m] for pencils) in one input file dns.in. The pencil/slab orientation can be arbitrarily chosen as in CaNS, via the preprocessor flags -D DECOMP X, -D DECOMP Y and -D DECOMP Z which set the direction over which the domain is not decomposed. This flexibility allows improving the efficiency of both the CPU and the GPU implementations. For CPU, using pencils allows increasing the number of processes used per execution (i.e. up to N 2 for n x = n y = n z = N ), hence reducing the time to solution. In the GPU implementation, only the z-pencil and x-slabs decompositions are allowed. It is recommendable to use x-slabs on GPU (i.e. compiling with -D DECOMP X and using dims= [1, n]) as this implementation reduces the number of allto-all calls to the minimum, hence reducing GPU-GPU communication and improving performances on multi-nodes runs.

Code parallelization
The parallelization is performed using MPI. When GPU acceleration is enabled, MPI allocates one rank for each GPU. The code assumes the chosen MPI library is "CUDA-aware", meaning GPU data is directly passed to MPI function call and the MPI implementation takes care of moving the data in the most efficient way. If available, GPU-to-GPU communication can leverage NVIDIA NVLink which is a physical GPU-to-GPU interconnection known to have higher bandwidth (at least one order of magnitude) than Infiniband. Throughout the code, all nested for-loops, i.e. iterations over all the domain points, are accelerated on GPUs using OpenACC [63], a portable standard directive-based programming model that can execute code on multicore CPUs as well as accelerators like NVIDIA GPU. Such offload is not used for CPU-only compilation and execution. To execute FluTAS, the platform needs to support NVIDIA Unified Memory, which has two main advantages: • the ability of allocating and managing more GPU allocated memory than what is physically present on the device; • the ability to avoid explicitly to handle data movements Host-to-Device and Device-to-Host, leaving the runtime do the work for the developers.
Both features are used in the code and proved crucial for an efficient GPU acceleration.

Code performance
We now present an analysis of the code performances on standard CPUbased and accelerated GPU-based architectures. Tests on GPUs were performed on MeluXina at LuxProvide (LXP, Luxembourg) [64] and Berzelius at National Supercomputer Centre (NSC, Sweden) [65], while tests on CPUs were performed on Tetralith also managed by NSC.

Weak and strong scaling
We first discuss the weak-scaling tests for a Rayleigh-Bénard problem with the same set-up as it will be discussed in §7.2. For this test, we start with a "base" computational grid of N x × N y × N z = 1024 × 512 × 256 grid points on 2 GPUs. Then, while keeping fixed N x and N y , we increase N z proportionally to the number of GPUs, resembling a procedure of spatial ensemble-average (i.e. more structures simulated to improve the convergence of the large-scale statistics). As discussed in §5, we adopt a slab parallelization along the z direction using the -D DECOMP X compilation option, which reduces to 2 the number of all-to-all operations. It is worth noticing that, although both HPC machines are equipped with A100-40GB NVIDIA cards, Berzelius has 8 GPUs/node while MeluXina has 4 GPUs/node. Moreover, the interconnection between GPUs is handled through NVLink, whereas node-to-node connection is performed through Infiniband (IB), known for having lower bandwidth and for operating on a different protocol. Hence, IB is less straightforward to handle as it requires a more careful configuration from a hardware and software perspective. This implies choosing the right MPI configurations and selecting compatible communication libraries, resulting in an efficiency that may vary significantly among different HPC centers. For these reasons, to prove the non IB-dependent scaling and maximize the GPU-to-GPU communication throughput, Berzelius was used to perform weak-scaling tests within a node. On the other hand, MeluXina was used for multiple-node tests to assess the IB-dependent scaling. 1 Figure 6a) shows that weak-scaling is linear when bounded by NVLink communications (i.e. no IB communications), as clearly supported by tests on Berzelius. When IB communications are required (i.e. node-to-node data transfer) the code performances decrease. It is worth noticing that, while an increasing commu- 1 Berzelius uses a HGX 8-way platform, 8 A100 GPU connected all together, designed primarily for heavy AI workloads. MeluXina uses a HGX 4-way platform, 4 A100 GPU connected all together, which is better suited for scale-out HPC workloads. The majority of GPU-accelerated HPC clusters used primarily for simulation workloads in various scientific fields adopt the HGX 4-way configuration (including many European HPC systems funded by EuroHPC). nication overhead is provided by node-to-node communication on IB network, additional slowdown is caused by the slab parallelization. By increasing the number of elements along z, more data need to be transferred during the x-to-z transposes, further increasing the communication load. This is clearly shown in figure 6b), where the slowdown is found to increase proportionally to the number of GPUs. Results of the strong scaling tests are reported in figure 7. Here we use two different grids, i.e. 1024×512×1024 (grid-1) and 1024×1024×1024 (grid-2) for the Rayleigh-Bénard problem discussed in §7.2. Tests are performed on Meluxina and Berzelius as for the weak scaling. While keeping the problem sizes fixed, the number of GPUs is progressively increased up to a maximum of 128, starting from N GP U = 16 which represents the minimum amount required to fit the two computational domains in the available GPU memory. Despite a speed-up is always achieved, the code shows a progressive loss in performance, i.e. a reduction of the benefits derived by increasing the number of GPUs. Note, however, that a larger number of grid point (e.g. grid-2) leads to lesser performance loss, as a higher GPU occupancy can be obtained. The decrease in performance observed in figure 7 is caused by two factors: the increase in communication among GPUs, and the reduction in local problem size, which does not leverage the full compute capacity of each GPU. While these effects are present in a strong scaling test, weak scaling allows us to isolate the effects of multi-GPU communication while keeping a higher GPU saturation. Thus, we argue that weak scaling represents a better tool to identify communication bottlenecks on multiple GPUs. Conversely, the strong scaling is more useful to estimate how much a fixed domain can be partitioned while keeping an efficient use of the computational resources. Overall, the previous analysis suggests an important guideline for the user: in presence of unbalanced compute-vs-network architectures (e.g., node-tonode networking connection less efficient than the connection among GPUs within the same node), the optimal number of GPUs to be employed should be chosen as close as possible to the minimum amount required to fit the computational domain in the available GPU memory. Indeed, this is not always the case with older HPC architectures using previous generations of GPU hardware, where NVLink connections across GPUs inside a node were typically missing. For a fixed problem size, modern cards with high compute throughput will exhaust the required computation faster, leaving the remaining part of the computation as communication bound. In older GPU hardware, the acceleration is lower and communication becomes the domi-

CPU-GPU comparison
In conclusion, we perform a comparison between the code performances on a CPU and a GPU architecture. It is worth mentioning that such comparison is notoriously not trivial. First, no exact and standard procedures to compare the two systems are established. Next, code performances may exhibit large variations among different architectures and using the same hybrid CPU-GPU node to perform tests on both may be misleading. CPU-only nodes and CPU-GPU nodes are intrinsically different in terms of network configuration and GPU/CPU interconnection, hence an unbiased test may not be performed directly on hybrid architectures (as CPU-GPU cluster would hardly be used to perform CPU-only jobs). Therefore, the following analysis has to be taken as a first-approximation estimate. Here, we repeat the weak-scaling simulation with n GP U = 8 GPUs on Berzelius on n CP U = 512 CPUs on Tetralith, in both cases employing a slab parallelization along z. The test shows that for GPUs the average wall-clock time per-timestep is t 8,GP U = 0.191 s, while for CPUs t 512,CP U = 1.075 s. This results in an equivalent number of GPUs n eq = (t 512,CP U n CP U )/(t 8,GP U n GP U ) ≈ 359. Finally, a comparison in terms of computing-load percentage for each code section is displayed in figure 8. As previously anticipated, transposes during GPU simulations (panel a) represents more than half of the computing load. The remaining parts, mainly composed of stencil operations enclosed in for loops, largely benefit from GPU-offload, while in CPUs (panel b) these account for more than 70 % of the total wall-clock time per time-step.

Emulsions in HIT
In many multiphase flows, the dispersed phase interacts with the surrounding turbulence induced through large-scale stirring mechanisms. While turbulence introduces a vast range of scales, usually spanning over several order of magnitudes, the presence of an interface introduces further complexity, offering alternative paths for energy transmission through scales and generating poly-dispersed droplet/bubble distributions. Resolving the interplay between all these mechanisms leads to an extremely complex scenario to simulate numerically. Even in simplified conditions, represented by Homogeneous and Isotropic Turbulence (HIT), the number of grid points could rapidly exceed N ≥ 1024 3 , quite challenging for multiphase flows. Furthermore, variations of density and viscosity, and variations of the surface tension coefficient may introduce smaller scales as lower viscosity in the dispersed phase may accelerate vortices. Fully developed turbulence is usually Re λ W e L V L ρ 1 /ρ 2 µ 1 /µ 2 137 42.6 0.06 π 1 1 Table 3: Physical dimensionless parameters of the present configuration: Re λ = u rms λ/ν 2 and W e L = ρ 1 u 2 rms L/σ are the corresponding Taylor microscale Reynolds number and the large scale Weber number, L is the large-scale at which turbulence is forced, V is the volume fraction (ratio between the volume occupied by the disperse phase and the total volume). The simulation is performed at matching density and viscosity (i.e., ρ 1 /ρ 2 = µ 1 /µ 2 = 1).   [68,69]). Panel (b) shows the one-dimensional turbulent kinetic energy spectra, comparing the single-phase and multiphase (i.e. emulsion) cases. Here, the −5/3 law for the inertial range is shown, which applies for almost a decade.
Taylor scale and u rms is the root-mean-square fluctuation velocity. In multiphase flow simulations, these intensities are rarely reached as typical values are Re λ 100. In this example, we present a simulation of a turbulent emulsion at Re λ ≈ 137 performed on a grid of 512 3 on a cube of side length 2π and with the main physical parameters reported in table 3. Turbulence is sustained at large scale using the Arnold-Beltrami-Childress forcing (see [66,67]) throughout the whole simulation. Figure 9 shows a render at statistically-stationary state. Turbulence is first simulated in single phase; when statistically-stationary conditions are reached, the dispersed phase is introduced through a random distribution of droplets and let to develop until convergence is reached. This is monitored in terms of droplet-size-distribution and one-dimensional spectra, see figure 10. Despite the simplicity of this configuration, HIT offers a relevant framework to study multiphase turbulence in complex configurations. To reach properly convergent statistics, several large-eddies turnovers are required, corresponding to ∼ 10 7 time-steps. Hence, GPU acceleration will be invaluable to reach fully-developed turbulent conditions in future studies.

Two-layer Rayleigh-Bénard convection
Rayleigh-Bénard convection is the flow developed inside a fluid layer that is heated from below and cooled from above. It is driven by the density differences that arise due to the temperature variation inside the fluid. Even though it is a seemingly simple configuration, it encapsulates rich physics that are encountered in a range of engineering applications and physical phenomena. Beyond the classical setting, the study of the two-layer Rayleigh-Bénard variant is crucial from both a fundamental and an applied standpoint. First, regardless of the application, there is always some dissolved gas in every liquid. Therefore, it is inevitable that a gaseous phase will be formed in any realistic natural convection flow. Second, physical phenomena such as the convection in the earth's mantle [70] or engineering applications such as the heat transfer inside magnetic confinement systems in fusion reactors [71] are more accurately modelled as two-layer convection, where the two fluid layers are dynamically coupled. Figure 11 shows a schematic representation of the domain used for the numerical simulations, as used previously in [72], in two dimensions. The bottom and top walls are modeled as solid isothermal surfaces at constant temperatures of 328 K and 318 K respectively. The x-and y-directions are considered periodic, and the aspect ratio between the horizontal and vertical dimensions of the cavity is Γ = L x /L z = L y /L z = 2. The dimensionless parameters adopted are shown in Table 4. The property ratios between the two fluids are considered equal to 1 (kinematic viscosity ν, thermal diffusivity α, specific heat c p and thermal expansion coefficient β), except the density ratio λ ρ = ρ 2 /ρ 1 = 0.1. This mismatch in densities is the reason behind the arrangement of the fluids in a two-layer configuration. Preliminary simulations revealed that a grid of 1024 × 1024 × 512 cells and a CFL number of 0.50 were adequate for obtaining grid-and time step-independent solutions. Figure 12 shows instantaneous temperature fields in the x-y plane for the three Rayleigh numbers considered. With increasing Rayleigh number, the thermal structures in the cavity become finer, indicating increased turbulent activity. This increased thermal agitation is not enough to induce any significant interface movement, even at the highest Rayleigh number considered. Furthermore, in all three cases, the temperature drop at the bottom fluid layer is much smaller than the top layer. This is explained by the fact that the two layers have the same thermal diffusivity and specific heat but different densities, leading to a top layer with a smaller thermal conductivity (k = αρc p ) compared to the bottom layer. Therefore, the top wall conducts heat less effectively than the bottom wall, which explains the larger temperature gradients at the top layer. Focusing on the highest Rayleigh number case, figure 13 shows instantaneous temperature contours for Ra = 10 8 . As in classical Rayleigh-Bénard convection, hot and cold plumes are ejected from the bottom and top walls. In the presence of a single fluid layer, these plumes would typically get organized in large scale circulation structures, extending from the bottom to the top wall. Here, the existence of two fluid layers changes the classical picture; the interface acts as a barrier, confining the thermal plumes in each fluid layer. The interface also acts as a thermal conductor, promoting the exchange of heat between the two layers. More specifically, the hot plumes ascending from the bottom wall are cooled off when reaching the interface, forming colder plumes that travel downwards. The opposite is true at the top layer, where the descending cold plumes are heated by the interface and hotter plumes emerge from the interface traveling upwards. Figure 13 clearly illustrates this behavior, revealing the existence of regions dominated by ascending plumes and regions dominated by descending plumes, hinting towards the organization of the flow in three-dimensional

Conclusions and further developments
We present the code FluTAS, a numerical framework tailored for direct numerical simulations of multiphase flows, with the option of heat transfer, able to run efficiently on CPU-based standard architectures and on GPUbased accelerated machines. The open source version, released under MIT license, includes a pressure-correction algorithm for two-phase flows extended with an algebraic Volume-of-Fluid method (MTHINC) for capturing the interface dynamics. We provide here a description of the employed numerical algorithm, with details on the solution of the governing equations and of the advection of the interface. After presenting different validation benchmarks both in single and multiphase configurations, we discuss the code performance focusing on two aspects: i) its current limitation when the communications among GPUs in different nodes are considerable less efficient than the communication among GPUs within the same node, ii) its advantages compared to CPUs in terms of "time-to-solution". Finally, we report results from two configurations of fundamental interests in multiphase turbulence: emulsions in homogeneous isotropic turbulence and the two-layer Rayleigh-Bénard convection. In the future, we aim to improve the code maintainability and portability (both on CPU and GPU) and to release additional modules under development, e.g. weak compressibility and phase change [43,44]. Further efforts will be devoted to enhance the code performance on multiple GPUs nodes, reducing the current communication bottlenecks. To this end, a promising strategy is the one proposed in [73], i.e., implement the solution of the tridi-agonal system for the third direction on a distributed memory. The main advantage of this approach is the elimination of all-to-all operations in the Poisson solver. This improvement combined with future enhancements in the software frameworks for collective data communications will allow tackling several multiphase problems while keeping an efficient use of the computational resources.
wherex R = (ỹ,z,x) while the components of the flag vectors (equal to 0 or 1) c and c R will be discussed later in Appendix A.1.2. Accordingly, to determine T , one needs just to compute the components of the vectors Q T ,i=1,2 and L T . This can be done by first imposing that the first-order and second-order gradient of T , evaluated forx =x c , are equal to the normal vector n and the curvature tensor K: Expanding equation A.1 and the equations A.2 lead after some manipulation to a unique expression for the components of the vectors Q T ,i=1,2 and L T : Note that to recover a linear reconstruction, the curvature tensor is set identically zero and thus also the components of the vector Q T ,i=1,2 in equations where m = ∇φ = (m x , m y , m z ). Following the Youngs' method [74,75], the three components m x , m y and m z (the partial derivative of φ in each direction) are computed by first evaluating the derivatives at the cell corners and then each corner-value is averaged to find ∂φ/∂x, ∂φ/∂y and ∂φ/∂z. Once n is known, the curvature tensor components are computed using (A.7) whereas the geometrical curvature is derived from the sum of the three diagonal components of K, i.e., κ = − (K xx + K yy + K zz ).
Appendix A.1.2. Calculation of normalization parameter d th Once T is known, the last step to obtain H(x) is to compute the normalization parameter d th . To this purpose, we impose the volume conservation at discrete level by setting that the volume integral ofĤ over local grid cell is equal to the VoF field in that cell. Using the normalized Cartesian coordinate, this results in: I : As remarked in [1], an exact integration of (A.8) is not possible while it exists for the one dimensional integration. Taking for example thex direction for an exact integration, we get: 1 0Ĥ (x)dx = 1 2 log x + log (cosh (β th (T (x,ỹ,z)))) β th (∂Tx/∂x) Note that in the integration of (A.9), we assume that the derivative of T with respect tox just depends onỹ andz. This is achieved by setting c = [0, 1, 1] and c R = [1, 1, 0]. In the other two directions, numerical integration should be performed and in this work we employ a two-point Gaussian quadrature method. Therefore, the integral I in equation (A.8) results in: x + log (cosh (β th (T (x, r p (p), r m (p)) + d th ))) a x where r p = 1 + √ 3/2[−1, +1, −1, +1] and r m = 1 + √ 3/2[−1, −1, +1, +1]. In general, the direction along which the exact integration is performed cannot be decided a-priori. On the other hand, a criterion for this choice is based on the magnitude of the normal vector components and, therefore, three cases are possible.

Case 1
If, |n x | ≥ (|n y |, |n z |), the exact integration is performed only along x and we set c = [0, 1, 1] and c R = [1, 1, 0] in equation (A.1). Therefore I becomes: x + log (cosh (β th (T (x, r p (p), r m (p)) + d th ))) a x (A.14) where a p L is a x L , a y L or a z L according to the three cases above. Finally, equa-tion (A.14) can be further recast in a more convenient quartic equation: where D = exp(2β th d th ) while the constants A, B ± , C ± and Q are given by: = exp (2β th a p L ) , B ± = exp(2β th a v · r p,B ), C ± = exp(2β th a v · r p,C ), Q = exp(4β th a p L (2φ − 1)).
(A. 16) The coefficients a p L and the expressions for a v , r p,B and r p,C are reported in tables A.5 and A.6, differentiated among the three cases previously described. Once the coefficients α i=1,4 of the quartic equation (A.15) are computed, a solution for D can be found. In the current work, we adopt the approach proposed in [1]. Instead of computing all the four complex roots, we look for the real and positive solution, fulfilling the constraint given by equation A.8. To this purpose, we first write equation (A.15) as:  with: Below, we report the overall VoF algorithm in the pseudocode 2. As a final remark, note that the entire algorithm has been described assuming that the first directional split is always oriented along x (i.e., x → y → z. Nevertheless, this solution proves to be only first-order accurate in time. To improve the time accuracy of the solution, one possibility is to alternate the splitting direction as suggested in [76].