Parallelization of a stochastic Euler-Lagrange model applied to large scale dense bubbly flows

A parallel and scalable stochastic Direct Simulation Monte Carlo (DSMC) method applied to large-scale dense bubbly ﬂows is reported in this paper. The DSMC method is applied to speed up the bubble-bubble collision handling relative to the Discrete Bubble Model proposed by Darmana et al. (2006) [1]. The DSMC algorithm has been modiﬁed and extended to account for bubble-bubble interactions arising due to uncorrelated and correlated bubble velocities. The algorithm is fully coupled with an in-house CFD code and parallelized using the MPI framework. The model is veriﬁed and validated on multiple cores with different test cases, ranging from impinging particle streams to laboratory-scale bubble columns. The parallel performance is shown using two different large scale systems: with an uniform and a non-uniform distribution of bubbles. The hydrodynamics of a pilot-scale bubble column is analyzed and the effect of the column scale is reported via the comparison of bubble columns at three different scales.


Introduction
Bubbly flows are one of the more complex multi-scale multiphase flow problems which are encountered across many fields of physics and engineering. Small scale physical changes in the properties of the different phases can have a huge impact on the large scale physics of systems. Such flows are studied at the DNS (Direct Numerical Simulations) level for representative systems to derive closures for interactions forces used in simulations executed at larger length and time scales. Methods such as the Discrete Bubble Model (DBM) and also the stochastic Euler-Lagrange (E-L) models (among which DSMC methods) are examples of classes of methods where these closures are applied. These models (DBM/DSMC) have been extensively used across many length and time scales for the simulation of multi-phase flows and have proven to provide good results on relatively coarse grids compared to fully resolved DNS, while remaining more accurate than the further coarse grained Euler-Euler (E-E) models.
However, due to the large amount of individually tracked objects in large-scale systems, the application of Lagrangian Discrete Methods (LDM) in multi-phase flows is complicated due to the high computational effort and memory requirements. The limitations can be overcome with effective parallelization of the numerical code, which can be realized using Table 1 List of recent publications on parallel implementations of Lagrangian methods.
The main drawback of MPI-based codes is in the necessity of keeping an optimal load balance between distributed processes, which is often hard or even impossible to reach in a real case scenario. For instance, in the Euler-Lagrange (E-L) framework, a non-even distribution of the dispersed phase may lead to reduction of the parallel performance or even idling of some processes during execution of the Lagrangian part. This problem is mainly caused by the static domain decomposition based on the Eulerian grid. The obvious solution is to use a separate dynamic decomposition for the Lagrangian part. The coupling between two numerical domains can be achieved by means of a map, that should be reconstructed during the update of the Lagrangian domain decomposition. This approach was successfully applied by Pozzetti et al. [24], who reported an excellent parallel efficiency of the E-L framework up to 1400 cores. However, in a uniformly distributed system, this approach may also lead to extra communications and, therefore, to a decrease in parallel efficiency.
The computational and parallel efficiency of LDMs highly depends on the type of collision detection method employed in the algorithm. Particle based simulations, which often utilize a soft-sphere or a force based collision model, employ a single loop-based algorithm, thus leading to optimal parallel performance on any platform, as shown by Niethammer et al. [17], Pohl et al. [23] and Tian et al. [5], who performed simulation of systems containing 10 12 , 10 9 and 10 8 particles, respectively. Most of these works are based on the decomposition of only Lagrangian entities in the domain, which can easily incorporate load-balancing with the latest state of the art algorithms.
Historically, bubble-based simulations have often employed a hard-sphere model, which is an event-driven algorithm that requires calculation of the shortest collision time between all possible pairs of bubbles. The efficient parallelization of the DBM model on distributed memory platforms is impossible due to the necessity of continuous synchronization between all processes, while shared memory platforms limit the amount of bubbles that can be simulated. Several recent works also reported the use of soft sphere models for bubbly flow simulations. However, these soft sphere models generally require determination of a spring constant which defines the amount of overlap that is allowed. This is challenging since a high spring constant leads to very small time steps, while a too low spring constant leads to significant over-packing. To avoid spurious velocities in dense systems modelled with a soft-sphere model, very high spring constants and, therefore, extremely low time-steps are required to resolve the collisions sufficiently accurate. Therefore, the number of works conducted on parallelization of bubbly flows at this scale is very limited.
Darmana et al. [25] simulated O(10 5 ) bubbles using a mirror domain technique for the Discrete Bubble Model (DBM) combined with an MPI-based CFD code. These authors reported only 50% of parallel efficiency on 32 cores. Ma et al. [13] simulated a bubble column with 6.5 · 10 4 bubbles using a DBM code parallelized with OpenMP and reported 99% parallel efficiency on 12 threads. Sungkorn et al. [26] have reported a 40 m 3 stirred bio-reactor with about 15.6 million bubbles using a LBM based flow-solver and the bubble phase being solved on GPUs. This approach used a ghost-particle technique developed by Sommerfeld et al. [27]. However, the volume displacement of the bubbles is not accounted for in the flow solver via the volume fraction term and is only coupled through the volumetric source term in the momentum equations. Recently, Kamath et al. [28] have reported a DSMC based stochastic Euler-Lagrange method which accounts for the volume Table 2 Closures for the different types of forces acting on bubbles.

Force
Closure Reference

Eö
Eö [29,30] Eö d > 33 [29,30] displacement of the bubbles in the Eulerian phase and also has the potential to be parallellized for large scale bubbly flow simulations. This paper presents an efficient highly scalable parallel approach that allows one to simulate dense bubbly flows in a Lagrangian formulation using the MPI framework. The bubble collisions are resolved by means of Direct Simulation Monte Carlo (DSMC), proposed earlier by Kamath et al. [28]. To the authors' knowledge, this is the first time a DSMC based method is used to simulate dense bubbly flows at this scale. The DSMC operates with a collision probability that can be calculated locally and thus the method does not require global synchronizations and demonstrates high parallel efficiency. As a result, the fully coupled CFD and DSMC code can be applied to simulations of large scale bubble columns with more than 10 6 bubbles.
The paper is structured as follows. Section 2 describes the DSMC method applied to bubbles and the governing equations for the liquid phase. In section 3, the numerical approach for both phases and their coupling is discussed. The parallelization strategy for the gas and liquid phases is described in section 4. In section 5, the verification and validation of the parallel code is reported. Section 6 is dedicated to the investigation of the parallel performance. In section 7.1 an example of a simulation of a large scale bubble column is shown. Finally, the conclusions are presented in section 8.

Discrete phase
The discrete phase or bubble phase is solved in a Lagrangian framework where the bubble motion is governed by Newton's equation of motion (equation (1)). The interaction forces included in this model are the gravitational (F G ), pressure (2)). The relevant closures used for these forces can be found in Table 2.
The DSMC methods generally employ a parcel approach where a group of discrete particles of the same size and velocity is represented by one simulated particle. The resolution of the DSMC method is also governed by the number of discrete phase entities per parcel. In the current work, the parcel size for all simulations is taken as 1, to obtain an accurate representation of the Lagrangian phase. Moreover, the extension to larger parcel sizes requires closures determined with careful consideration of the different forces acting on the bubbles as a swarm and mapping functions for the calculation of the porosity. This is caused by the non-linear trajectory of the bubbles due to coupling with the liquid. The parcel growth cannot be assumed to be isotropic. Moreover, parcels of similar size should be able to break-up and coalesce based on the fluid-parcel interactions and the induced turbulence in the liquid. This behaviour is complex due to the four way coupling and this aspect will be considered in our future work.
The main goal of the DSMC method is to predict the frequency and probability of collisions of particles. This stochastic approach is much more efficient than deterministic collision models. The method was initially developed and applied to molecular systems that have finite/large Knudsen number with no external force fields affecting the position and velocity of the parcels [32]. The modified Nanbu method is a DSMC type method used by Pawar et al. [33] to simulate small droplets in spray modelling with a few alterations. Two modifications are made to the algorithm reported by Pawar et al. [33] by Kamath et al. [28]. First of all, the radial distribution function at contact is added to account for the increase in the collision frequency and the probability of collisions for dense clusters. Secondly, a nearest neighbour collision check is added accounting for discrete phase entities with correlated velocities which are not accounted for by the previously mentioned algorithms.

Liquid phase hydrodynamics
The liquid dynamics are resolved by means of the volume-averaged Navier-Stokes equations: where is the volume of the grid cell, A is the face area of the grid cell, ε l is the local liquid volume fraction, represents an external source term due to the presence of the gas phase and τ l is the viscous stress tensor given by: The μ L,l and μ T ,l are the dynamic molecular and eddy viscosity, respectively. To close the system we use the subgrid scale model proposed by Vreman et al. [34]: where C s is the Smagorinsky constant equal to 0.1.

Time marching
The numerical solution of the governing equations for our Euler-Lagrange model is obtained using a time-marching technique similar to the one reported in Darmana et al. [25] where the equations for the discrete and continuous phase are solved sequentially, while accounting for the coupling via the momentum source terms. Three time steps are defined in this model: the flow time-step (δt f low ), the Lagrangian phase time step (δt bub ) and the collision time-step (δt coll ) such that: where N i is fixed for a given simulation such that the time integration of all interaction forces acting on a given bubble is carried out in an accurate manner. In this work N i = 20 to maintain sufficient accuracy for both the force integration and also the DSMC collisions. δt coll is calculated on the fly by the DSMC algorithm from the estimated collision frequency for each discrete phase entity, as reported by Kamath et al. [28]. This is explained in more detail in the coming sections. In DBM, δt coll is calculated as the time-step to reach the next physical collision, with which the whole system is updated.

Coupling
DSMC algorithms have come far in terms of describing the hydrodynamics and heat transfer, when implemented at the molecular level or in applications where particle-particle or droplet-droplet collisions are the main physical phenomena occurring in the system of interest [32,35,36]. The continuous phase is either absent, weakly coupled or coupled only using momentum based source terms, which results in a completely collision driven system. This is not what we encounter in the current system of bubbly flows. Apart from the right collision frequency, the phase fraction needs to be evaluated for the discretized volume-averaged Navier-Stokes equations (see equation (3) and (4)).
To account for the inter-phase coupling, the gas-fraction is mapped as well as the computation of the volumetric momentum exchange rates between the Eulerian and Lagrangian phases. This is implemented along the lines of Darmana et al. [37] using a polynomial filter function proposed by Deen et al. [38].
Naturally, one would expect an effect of changing the filter width on the predicted time-averaged axial velocity profiles. Fortunately, this was found to be negligible by Lau et al. [39] when a correction to the drag force based on the local void fraction is employed. The filter width is responsible for maintaining ε l and when a bubble crosses different node points in the Eulerian grid.
Two things have to be monitored and applied with caution in the coupling of the DSMC and the CFD: a) The first is the correction factor of the drag force due to swarm effects as these are defined well for low and intermediate void fractions but not for the dense limit. b) The second is the phase fraction, particularly in the dense regions of the flow, since large overlaps will lead to a singularity in the volume averaged Navier-Stokes equations. To counter the over-packing, additional measures have been taken in the collision algorithm.

Discrete phase dynamics and collision algorithm
The bubble phase equation of motion is solved using explicit schemes for equation (1): The force integration has been tested with a fourth order explicit Runge-Kutta scheme and a first order explicit Euler scheme. The first order scheme is more computationally efficient, when the time-step constraint is due to the collisions. Both schemes were checked on a single bubble rise problem (see section 5.2) for many time-steps until the terminal rise velocity is established.
The bubbles are moved in time using the time-marching technique described in section 3.1. During each bubble time step, the forces (using equation (9)) are calculated at the discrete bubble locations using the quantities mapped from the Eulerian grid cells using the polynomial filter described in section 3.2. Due to Newton's third law of motion, a reaction force is collected to be mapped on to the momentum nodes in the Eulerian framework, to be applied at the end of the bubble time-step. The collision sequence is then initiated according to the Algorithm 3.1. This is repeated until the discrete phase has moved for a full δt f low . The forces collected in the volumetric source term are time-averaged over the flow time-step δt f low , since they are calculated multiple times in one flow time-step. The weights for the averaging are calculated based on the ratio δt bub /δt f low to achieve the correct momentum balance. With the new bubble positions, the volume fraction in each Eulerian cell is calculated using the same polynomial mapping function.
In any particle based technique, the two major steps are the force calculation step and the collision detection step. In a hard-sphere collision model, among which the DSMC, the force calculation and the collision detection loops are separated. Typically the most expensive step is the collision detection step, especially when done trivially leading to a time scaling of the order of O(N 2 ) where N is the total number of discrete phase entities in the system. The application of cell lists and neighbour lists reduce the number of comparisons considerably and reduce the time scaling to O(pN) where the factor p depends on the size of the searching scope or the cut-off distance (r cut ). Even then, populating these data structures takes considerable execution time compared to the remaining part of the algorithm. A Verlet list further reduces the number of times the above mentioned data structures need to be populated or refreshed depending on the shell distance R shell . This has been implemented and is being utilized in this work for evaluating the neighbours in the searching scopes of different bubbles.
The modified DSMC collision detection step is detailed in algorithm 3.1. The main additions relative to the already existing algorithm proposed by Pawar et al. [33] are: a) an explicit treatment of nearest neighbour collisions to account for correlated and uncorrelated discrete phase velocities, and b) incorporation of the radial distribution function g ij to account for the increase in collision frequency in dense or clustered arrangements of particles in 3D space [40]. A more detailed explanation can be found in the work of Kamath et al. [28]. A speed up in the DSMC algorithm over the deterministic DBM from Darmana et al. [25] has been reported by Kamath et al. [28] with minimal loss of accuracy for different collision regimes. Moreover, the reason for the speed-up is stated clearly and the serial computational times for a DEM time-step for varied number of bubbles and void fractions have also been reported. In this work, an MPI parallelized version of the same algorithm is presented.
The term g ij is given by Ma and Ahmadi [41] for a mono-disperse system (see equation (10)). This is extended to poly-disperse systems by Santos et al. [42] (see equation (11)). Choose a bubble/particle id i ∈ X such that X = {n : n is a positive integer and n ∈ N tot } where N tot represents total number of particles in the domain/sub-domain. 3: while t < δt bub do

4:
Calculate the collision frequency f i for i based on equation where v ij is the relative velocity between particles/bubbles i and j, d is the diameter, n j is the parcel size, R s,i is the searching scope size for particle/bubble i and g ij is the radial distribution function at contact for discrete entity i with particle type j.

5:
: 10: where d represents the Sauter mean diameter of the particles within the searching scope, and ε p is the average solids/bubble volume fraction in the neighbourhood of particle i.

Fluid phase numerical scheme
All numerical simulations from the present work are performed using FoxBerry -an in-house developed framework for solving unified transport equations. The discretization of the governing equations (3) and (4) is done on a structured grid with staggered arrangement of variables 2 using the Finite Volume method. The SIMPLE algorithm is used for the pressurevelocity coupling, as described by Patankar [43]. All convective fluxes are discretized with the second order Barton scheme [44] and all diffusive fluxes are discretized with the second order central difference scheme. To keep the numerical stencil compact, a deferred correction method is applied to convective fluxes, resulting in a 7-diagonal matrix in a 3D case. The time integration is carried out using the first order backward Euler method. All discretized terms are treated implicitly, except source terms originating from the discrete phase, which are considered explicitly.
All linear systems are solved by means of an in-house BiCGStab(2) method [45] and the solution of the Pressure Poisson Equation (PPE) is additionally accelerated with the multilevel ML solver [46] from the Trilinos library [47].

Domain decomposition
To obtain the optimal parallel performance, the developed code utilizes a full 3D decomposition instead of the often used pencil or stripe decompositions. This allows for reduction of the sub-domain interface A 3 proportionally to its volume V, which helps to decrease the communication message size between sub-domains and, thus, the communication time.

Parallelization strategy fluid solver
Due to the staggered arrangement of variables, two adjacent sub-domains 1 and 2 share one layer of momentum cells (see Fig. 2), which results in the duplication of the matrix rows during the assembly step of the momentum matrix. To preserve the initial size of the distributed matrix and to reduce the potential increase of the solution time and the number of iterations of linear solvers, the rule of the "positive owner" is applied. Thus, all shared momentum points on the positive (right) side of the 1 sub-domain are considered as an extra layer of halo cells, while their physical representation belongs to the 2 sub-domain.
The application of the "positive owner" rule results in an extra communication step during the assembly of the Pressure Poisson Equation (PPE), caused by the direct dependence of coefficients in the PPE matrix on the central coefficients of the momentum matrices [43].
To simplify the access to this data, every sub-domain performs a search in its neighbouring sub-domains to determine the ownership of the required elements. This operation is performed once and its result is stored in the form of a hash table, which maps the locally missing and remotely detected elements with the information on corresponding remote process ID. The constructed hash table is used at every iteration of the SIMPLE algorithm to obtain the off-process data and to populate the local vectors of the diagonal elements from the momentum matrices.

Parallelization strategy DSMC
Parallelization strategy for the DSMC part follows the same domain decomposition technique as the fluid solver mentioned above. Generally particle based simulations follow their own domain decomposition as described in Abraham et al. [48]. Such a decomposition can be uniform or non-uniform, based on the dynamics of the problem itself. Darmana et al. [25] have reported a mirror domain technique to decompose the domain. The advantage of such a method lies in its load balancing, due to an even distribution of bubbles across all domains. The parallel efficiency of such a technique lasts until a maximum within 50 processors, as shown in the same work [25]. As the system becomes larger, the amount of com- Set t bub = 0.

3:
Calculate interface forces and velocity of bubbles ∀ i ∈ N tot 4: Bubble/particle removal from the sub-domains/domain 5: Bubble/particle transport 6: Bubble/particle introduction into the sub-domains/domain 7: if neighbour list rebuild = true then see operation 19 in Algorithm 3.1 8: Refresh cell-linked lists and neighbour lists ∀ i ∈ N tot 9: else 10: Do not update cell linked lists and neighbour lists 11: end if 12: Transfer real bubble data to neighbouring halo cells 13: Calculate volume of bubbles in different cells and map them to the Eulerian sub-domain. 15: Apply data reduction to halo cells of current sub-domain to real cells of neighbouring sub-domain for the bubble volumes and the momentum source terms. 16: end for munications for such a technique will get larger due to timely synchronizations among processes and as such the parallel efficiency drops quite quickly.
Another important disadvantage of the mirror domain technique is the memory requirement. This is especially apparent when the problem sizes go up such that there are 10 6 − 10 7 bubbles in the system or even more. Every process should hold enough memory to maintain at least 7 doubles for position and velocity vectors and bubble size, as well as memory for the complex data structures which include the cell linked lists and neighbour lists.

Algorithm
The overall discrete phase algorithm is described in a combination of Algorithms 4.1 and 3.1. The DSMC algorithm uses the domain decomposition of the Eulerian part itself to avoid extra overhead due to communication of the Eulerian data required for the force calculation. Moreover the two communication steps that are mentioned in Algorithm 4.1 are only between the neighbouring domain processes. The additional performance gain is also due to the fact that local DSMC processes can run almost independently of each other in the sub-domains, unlike the mirror-domain technique employed for the discrete bubble model (see [25]) where the smallest collision time-step across all domains has to be determined and broadcast to all processors.

Data structures and flagging
Cell linked lists and Verlet lists are very well known for collision detection in several particle based techniques [49]. They are always used to minimize the number of particle-pair comparisons to resolve interactions among the particles themselves. The cell linked list used in the current implementation is stored in 2 arrays: one is a list of head ids and the other is a list of linked ids. An adjacency list is used to store the Verlet list. Apart from these, a particle id → cell id map is also stored to avoid unnecessary cell id calculations. The cell ids are then further used to identify the location type (using the cell flags) or the neighbourhood of the particles in question.
The memory allocations for these data structures are kept contiguous to avoid excessive refreshing of the CPU cache. Moreover, the id storage in the memory for the neighbour list is done in the order of its distance from the particle under consideration. Quick access to these variables for the distance calculation (during the collision frequency calculation) results in as many cache hits as possible (see Fig. 3). This is necessary because the collision frequency calculation is the most expensive step in Algorithm 3.1. SIMD vectorization operations have also shown to improve performance drastically for large numbers of particles [50]. These will be considered for further performance improvements in our future work.
A map of flags on cells is used to identify the location type of the bubbles/particles and also for the boundary detection (see Fig. 4). This map is necessary for boundary detection (like walls) and halo cells, preventing extra checks for the bubbles/particles in the bulk. This flagging is done once during initialization of the simulation, either based on the boundary conditions from the Eulerian part or based on explicit boundary conditions provided for the Lagrangian part.

Bubbles in the halo region
Since the DSMC process in every sub-domain is independent, it is important to check the consistency of the physics in the form of momentum conservation across sub-domains. Several stochastic methods in literature are known not to conserve momentum at the individual particle level (see [27]) but rather at an overall statistical level. Fig. 3 shows a 2D representation of a single layer particle decomposition across neighbouring sub-domains, including the halo regions in colour.
For the pure DSMC algorithm, the exchange of momentum between real and halo particles/bubbles in the current process need not be exactly the same as between the particles/bubbles of halo and real cells of a neighbouring process. This will only happen if the chosen collision pairs are exactly the same across the neighbouring sub-domains. Since the process of  collision pair selection is random, the net amount of momentum loss or gain is also random. To minimize the momentum exchange errors, a nearest neighbour collision is used near sub-domain boundaries.

Parallel verification and validation
In our previous work [28], we verified and validated the developed DSMC method for two limiting conditions: first for a pure DEM (with no background fluid phase), and second for a fully coupled gas-liquid system. This also included monodisperse and poly-disperse cases. The calculation of the total number of collisions and also the collision frequency was verified for varying solids and gas fractions. For validation, a comparison was made with the deterministic Discrete Particle Model (DPM) and Discrete Bubble Model (DBM), respectively.
In the current work, similar test cases have been set up to verify the parallel implementation of the DSMC algorithm. Naturally, it cannot be expected from a stochastic algorithm to execute the exact same sequence of collisions when run multiple times, even when run on a single core. This is possible in the deterministic cases when the initial particle distribution is identical every single time. However, for a stochastic algorithm the statistical properties of the collisions and momentum exchange should not change upon parallelization. The following problems are set up for the verification: • Impinging particle streams • Single bubble rise across sub-domains • Bubble columns

Impinging particle streams
The system consists of two injection nozzles oriented such that they partly face each other. The solid particles flow into the system through these nozzles. The system boundaries act as exits for the particles. The domain is decomposed based on the number of processors provided. The solid particles are provided with inlet velocities based on the overall mass flow rate. To obtain a velocity distribution, a fluctuating velocity, based on a Gaussian distribution is added to the average velocity. The width of the distribution is varied to obtain different velocity profiles from the nozzles. Two test cases, I1 and I2 have been setup (see Table 3 for the simulation parameters) with following goals, 1. To verify with varying number of cores, consistency in the collision frequency and also its independence from the domain decomposition. (I1) 2. To quantify momentum errors that occur across the neighbouring sub-domains because of a mismatch in collision pairs due to independently running DSMC loops. (I2) A schematic of the problem setup for I1 with different decompositions is reported in Fig. 5. The comparison of the serial version of the method with the deterministic method (DPM) has been reported in Kamath et al. [28] for both mono-disperse and poly-disperse cases at varying solid fractions. The simulation has been executed for 5 decompositions: 12 cores, 23 cores, 24 cores, 48 cores and 96 cores. The overall collision frequency, the instantaneous collision frequency and the total amount of collisions occurring in the system are reported in Fig. 6. The results are almost independent of the number of cores used with a maximum relative error of 0.2%. Note that this algorithm does not use the nearest neighbour collisions as the Stk p → ∞ without any continuous fluid medium.
A smaller representative system of the same problem is simulated (I2) on 5 cores to evaluate the momentum conservation error across neighbouring sub-domains (see Table 3). This error is expected to reduce in time for the DSMC algorithm with respect to the absolute amount of momentum transferred across the sub-domain due to the randomization of the particle ids, collision partners and secondly because the searching scope is isotropic.
A schematic representation of the problem along with the process id numbering is shown in Fig. 7. The relative errors are calculated for momentum conservation in each sub-domain with all of their neighbours as follows:  Fig. 6. Collision properties for the impinging particle stream case I1 with the number of cores ranging from 12 to 96.  where p a represents the momentum stored in the part of the halo region of sub-process a that is shared with sub-process b and similarly p b . These momentum errors ( mom ) are reported in Table 4 and shown for process id 0 in Fig. 8. It can be seen that the maximum errors reported are for process ids 0 ↔ 3 and 1 ↔ 2. It should be noted that the overlap between these specific sub-domains is a single row of cells along the diagonal (see Fig. 7b). The net amount of momentum exchanged or the amount of collisions occurring here are nearly two orders of magnitude lower than at the other process boundaries. The cumulative error with respect to the amount of momentum exchanged reduces as more collisions occur. Nevertheless, to reduce this error, a nearest neighbour collision is forced near the process boundaries in the case of bubbly flows. This reduces the momentum error significantly (to less than 0.5%) as the bubble positions are communicated across the process boundaries and thus the relative distances are the same for a given pair in each sub-process.

Single bubble rise across sub-domains
For the second test, a single bubble is initiated in the middle of a domain with six no-slip boundary walls. The test case differs from the previous as the Lagrangian part here is coupled with the flow-solver. Note that because there are no other bubbles to collide with, the purpose of this test is to check whether the domain decomposition has an influence on the momentum coupling between the Lagrangian and Eulerian phase. The simulation settings are specified in Tables 5 and 6. Two tests are carried out to check: 1. the temporal accuracy of the force integration scheme with respect to the chosen time-step. 2. the spatial stability of the force calculation with changing sub-domains.
The temporal accuracy of the first order explicit Euler scheme is checked with a fourth order explicit RK4 scheme (see Fig. 9a). There are negligible differences which indicate the sufficiency of the flow time-step division into different bubble time-steps. It is clearly seen from Fig. 9b that the bubble passes through the sub-domains without any jump in velocity arising due to entry or exit from the sub-domains, which indicates proper communication of flow velocities and the pressure field across sub-domains.

Bubble columns
Finally, the four-way coupling of the Lagrangian and Eulerian phases is verified via simulation of two tall square crosssectioned bubble columns (see Fig. 10). Apart from the dimensions, the nozzle distribution in the sparger for both problems is different, ultimately inducing different kinds of flows in both systems. Problem setup parameters are listed in Table 7.
Grid cell size x 5 mm Table 6 Physical properties of the gas and liquid phases and numerical properties of the simulations used in verification tests.    The results of the T1 case are compared with time-averaged experimental measurements of Deen et al. [51] and results from the serial code reported earlier [53]. Fig. 11 shows the distribution of the time-averaged axial liquid velocity along the x-axis at a fraction of the column height H of z = 0.56H and fraction of the column depth D of y = 0.5D. It can be seen from Fig. 11 that the simulated velocities are in good agreement with the experimental measurements and with the reference data from the serial code. The minor deviations between simulated curves from the present work can be attributed to the stochastic nature of bubble collisions in the DSMC and thus, slightly different behaviour of the plume for each realization. The maxima at approximately x/L = 0.5 of the time-averaged axial liquid velocities are slightly higher than  and experimental measurements of Harteveld et al. [52]. Data is averaged over the time interval 10 − 20 s.

Table 8
Parallel performance of FoxBerry code.  those of the serial code. This can be attributed to the nearest-neighbour treatment of bubble-bubble collisions near the edges of the sub-domains. To maintain the momentum conservation across the sub-domain borders which renders it to a slightly more deterministic collision treatment [25].
In the T2 case, results are compared with experimental measurements of Harteveld et al. [52] and numerical data, using the DBM, from Darmana et al. [25]. In this test case, the simulations are performed using 24 cores. Fig. 12 demonstrates the integral gas hold-up as a function of superficial gas velocity. The calculated values of the parallelized code show excellent agreement with the experimental data as well as with the reference simulations.

Parallel performance and comparison with previous approaches
In this section, we analyze the parallel performance of the implementation by considering different types of problems. All tests are performed on "Cartesius", the Dutch National Supercomputer. Every node consists of 2x12 cores of Intel Xeon E5-2690 v3 with a basic clock speed of 2.6 GHz and hyper-threading switched off. The numerical code is compiled using the Intel C++ compiler v16.0.3, Intel MPI v5.1.3 and Trilinos 12.10.

Single phase flow solver
The parallel efficiency of FoxBerry is measured based on the flow in the initial section of the square duct. The domain is discretized on 125 · 10 6 grid cells. The chosen time step is 10 −3 s. Each time step takes approximately 20 inner iterations of the SIMPLE algorithm.
The strong scaling of FoxBerry is shown in Table 8 and demonstrates 48.4% of the parallel efficiency on 3072 tested cores (relative to the reference performance on 192 cores). The results in Table 8 also show a breakdown of the most Table 9 Main properties of the P1 and P2 cases. N cells and d x are the total number of Eulerian cells and their size, respectively. < N b > is the average number of bubbles in the fully developed system and v b is the superficial gas velocity. Dimensions L, D and H correspond to x, y and z axes, respectively.
Case computationally expensive parts of the code, indicating a significant contribution of the matrix assembly into the total time. This part can be further improved by implementation of a more suitable algorithm for the pressure-velocity coupling for transient problems, e.g. the fractional step method. In addition, the solution of the PPE demonstrates poor scaling after 768 cores, which can be improved with a better preconditioning technique.

Combined code
The parallel efficiency of the coupled Euler-Lagrange code directly depends on uniformity of the distribution of Lagrangian entities in the Eulerian domain [4,24]. In bubble column reactors, this distribution is directly determined by the geometry and placement of the sparger. This section demonstrates the parallel efficiency of the developed code on examples of the two most common scenarios where: (P1) the sparger partially occupies the bottom plate, which leads to appearance of a plume and an imbalanced workload for the parallel DSMC. (P2) the sparger fully occupies the bottom plate, which leads to uniform bubble rise and a well balanced workload for the parallel DSMC.
The main properties of the aforementioned cases are shown in Table 9. All nozzles are arranged in squares and placed at the bottom of the domain. The governing equations are discretized on a uniform Cartesian grid. All the following results are obtained by averaging the required data over 10 time steps after the flow in the domain is fully developed. It is important to note that the used Barton convection scheme demands 3 layers of halo cells to calculate the convective fluxes coming into a sub-domain. This results in an increase in the communication message size and adversely affects parallel performance when the decomposition is too fine.
The parallel efficiency and the elapsed time for the P1 case are shown in Fig. 13. The code scales well demonstrating a 48% combined parallel efficiency on 384 cores, each time step takes 4.4 seconds of which 1.6 seconds are taken by the DSMC algorithm. The decrease in the parallel performance of the CFD part is due to the low density of the Eulerian grid, causing MPI communications to be dominant over the workload on each process. Additionally, the non-even distribution of the bubbles (see Fig. 14) results in a significant work unbalance among the processes in the DSMC algorithm. This leads to only 39% parallel efficiency of the DSMC algorithm at 384 cores preventing the coupled code from further scaling.
The problem of unbalanced workload will become more pronounced if the number of bubbles increases, e.g. due to the reduction of their sizes. This issue emphasizes the necessity of having a dynamic domain decomposition for the Lagrangian part in order to improve the parallel performance of the coupled CFD-DSMC code in simulations similar to the P1 case. This requires careful communication of the Eulerian data to the dynamically determined sub-domains for the force calculation  step. This communication step is currently completely absent. Therefore, one also needs to consider if this will actually pay off in terms of performance for general multiphase flow problems.
In the P2 case, the absolutely uniform distribution of bubbles in the domain (see Fig. 16) leads to a substantial increase of the parallel performance compared to the P1 case, as shown in Fig. 15. The DSMC algorithm scales up to 768 cores, maintaining excellent parallel efficiency, demonstrating the potential for further scaling. However, the grid density of the Eulerian solver only allows to scale up to 43% parallel efficiency on 768 cores, which leads to an overall parallel performance of 51%. This corresponds to 4.5 seconds elapsed time, among which only 0.5 seconds are taken by the DSMC algorithm.

Large scale bubble column
In this section, a detailed analysis of the hydrodynamics in a large scale bubble column is presented. The configuration of the considered column corresponds to the P1 case described in section 6.2. The simulations are performed for 95 seconds of the physical time and the results are averaged over the time interval between 10 and 95 seconds. The lower boundary of the averaging period corresponds to the time when the fully developed flow conditions are reached. Fig. 17 shows the distribution of the time-average velocity vector fields at different heights of the column. As can be seen, at the bottom of the column, the liquid velocity reaches approximately 1.3 m/s, while at the top it decreases to almost 0.2 m/s. The high liquid velocity at the bottom is directly related to the high superficial gas velocity and can be explained as follows. The bubbles already released into the domain are decelerated due to the large mass of the liquid above them. Because of a lack of coalescence in this simulation, the new bubbles entering the domain collide with the decelerated bubbles and form a dense swarm. This swarm is pushed by the next sets of bubbles entering the system and thus, it experiences an increase in the axial momentum. At the center of the swarm the liquid fraction is low and bubbles experience less drag. Therefore, they have higher velocities compared to the bubbles at the edges. This dynamics is preserved   The distribution of the time-average vorticity magnitude at different heights of the column is shown in Fig. 19. At the bottom half of the domain the flow is highly rotational around the very pronounced core in the center. The regions of the high vorticity correlate with the swirling effect of the bottom part of the plume, which can be observed in Fig. 14. This swirling effect is caused by the helical trajectories of the bubbles, which is the most energy efficient way for a high-   Table 10. Also, despite the lower magnitude of the vorticity at the top, the circulation in that region is comparable to the one at the bottom. The aforementioned dynamics can also be seen from the distribution of the turbulence kinetic energy (TKE), as shown in Fig. 20. While at the bottom of the column most energetic eddies are concentrated close to the core of the plume, the second half of the column shows a wide region of high TKE. Nevertheless, at the top of the column, the TKE remains low in the vicinity to the corners and walls, due to viscous effects. In addition, Fig. 20 indicates a low region of the TKE at the center of the bubble plume. Similar to a submerged jet, the liquid in this region has a potential core, which can also be observed through the zero vorticity at the center of the bubble plume in Figs. 19b-c. However, the TKE significantly increases towards the edge of the bubble plume, where the shear-stresses are high and the fluid flow becomes highly turbulent. Further away from the plume, the TKE reduces indicating a decrease of the velocity fluctuations.

Effect of column scale
The results and conclusions obtained from the laboratory-scale bubble column may fail to predict the dynamics in pilotor industrial-scale columns, which is caused by the influence of the geometry and scale effects. To assess the significance of these effects, three bubble columns of different sizes are compared, see Table 11.
The T1 and P1 cases from Table 11 have already been discussed in section 5.3 and section 7.1, respectively. The P3 case represents a domain, which is extended towards industrial scales from the case P1. The numerical and physical parameters of the P3 case can be found in Table 6. As in all cases, the sparger in the P3 case is represented by nozzles arranged in a square and placed at the center of the bottom of the column. The P3 case is simulated for 27 seconds of the physical time using 768 cores. Both P1 and P3 cases have been simulated for 5 days or a wall time of 120 hours on respective number of physical cores.    Figs. 21, 22 and 23 show the instantaneous axial liquid velocity fields at the vertical mid-plane of the domain for the T1, P1 and P3 cases, respectively. It can be clearly seen that the maximal axial liquid velocity in the T1 case varies only slightly along with the height of the column, unlike in the P1 and P3 cases, where the axial liquid velocity is significantly higher at the bottom part of the domain compared to the top part. This is because of the small distance between the opposite vertical walls in the T1 case. Large recirculation zones with a length comparable to the height of the column appear in the  domain shortly after the bubbles reached the top. These circulation zones then travel downwards, disturbing core of the plume leading to meandering of the plume from the very bottom. In contrast, in the P1 and P3 cases, the rise of the bubble plume is accompanied with multiple eddies of different scales from the very beginning. These eddies dissipate energy at different sections of the plume leading to unstable large scale structures that break into smaller ones. The core of the plume is relatively more stable at the bottom because these smaller eddies do not have the energy to disturb the bottom part of the plume completely.
The instantaneous distribution of the bubbles in the three domains is shown in Figs. 24, 25 and 26. The bubble plume in the T1 case is highly influenced by the aforementioned recirculation zones, which push the plume towards the different side walls (see Fig. 24, t = 15.0 and 27.0 seconds). Due to the dominance of a single large-scale eddy, the bubble plume may stay close to one of the walls for a long period of time, until it is changed by instabilities arising in the bulk. This dynamics is fundamentally different from the P1 and P3 cases, where the plume at the bottom of the domain shifts only slightly from the center and never reaches the walls. In addition, the bubble plume in the P1 and P3 cases breaks shortly after the bubbles enter the domain. Based on the indicated scales in Fig. 26, the breakage of the bubble plume occurs at almost the same heights in both cases. The overall instability of the bubble plume contributes to the almost uniform distribution of the bubbles in the domain above the point of breakage.
The comparison of the bubble distribution in the P1 and P3 cases demonstrates the influence of the height of the domain on the rising bubbles. The front of the rising bubbles in the P3 case consists of constantly appearing small groups of gas inclusions. These groups experience less drag compared to the massive bubble front but they break fast, creating regions with lower pressure. The bubbles from the rising front are attracted by these regions, which leads to the appearance of the new groups. These dynamics are not observed in the T1 case, where the bubbles rise in a "mushroom-like" manner [1], and in the P1 case, where the rising bubbles represent a dense column moving in the helical path.

Conclusion
In this paper a parallel Euler-Lagrange framework and its application to simulation of dense gas-liquid flows are presented. The Lagrangian phase is resolved by means of the DSMC algorithm, for which parallelization is carried out with MPI, whereas the Eulerian phase is resolved using the volume-averaged Navier-Stokes equations incorporated in the in-house numerical framework FoxBerry. The coupled parallel code has been verified on several standard problems and demonstrates high accuracy, which is independent of the number of used cores.
The stochastic nature of the DSMC algorithm allows one to avoid limitations of the conventional deterministic (DBM) methods associated with parallelization for distributed memory platforms. Thus, the DSMC method demonstrates a linear scaling up to 768 tested cores for 8.3 · 10 5 bubbles uniformly distributed across the domain. In the case of non-even distribution of bubbles, the DSMC algorithm allows one to simulate 2.3 · 10 5 gas entities using 384 cores and maintaining 39% parallel efficiency.
The developed algorithms allow, for the first time, to simulate and analyze a tall square cross-sectioned bubble column using full four-way coupling of the Eulerian and Lagrangian phases. The results demonstrate a pronounced split of the domain in two sections. In the bottom section, the bubble plume occupies only the central part of the domain and determines the overall dynamics of the Eulerian phase. Close to the center of the domain the liquid demonstrates a high rotational motion, which is related to the swirling behaviour of the bubble plume. In the top section the plume breaks, which leads to a more uniform distribution of bubbles in the bulk and the appearance of a large amount of unstable eddies. This splitting is not observed in a smaller scale bubble column, where the bubble plume remains pronounced along the whole height of the domain and the dynamics of the plume is governed by the large-scale turbulent structures.

Declaration of competing interest
The authors declare that they have no known conflicts of interests or personal relationships that could have appeared to influence the work reported in this paper.