A multi-level parallel solver for rarefied gas flows in porous media

A high-performance gas kinetic solver using multi-level parallelization is developed to enable pore-scale simulations of rarefied flows in porous media. The Boltzmann model equation is solved by the discrete velocity method with an iterative scheme. The multi-level MPI/OpenMP parallelization is implemented with the aim to efficiently utilise the computational resources to allow direct simulation of rarefied gas flows in porous media based on digital rock images for the first time. The multi-level parallel approach is analyzed in details confirming its better performance than the commonly-used MPI processing alone for an iterative scheme. With high communication efficiency and appropriate load balancing among CPU processes, parallel efficiency of 94% is achieved for 1536 cores in the 2D simulations, and 81% for 12288 cores in the 3D simulations. While decomposition in the spatial space does not affect the simulation results, one additional benefit of this approach is that the number of subdomains can be kept minimal to avoid deterioration of the convergence rate of the iteration process. This multi-level parallel approach can be readily extended to solve other Boltzmann model equations.


Introduction
Gas transport in ultra-tight porous media has recently received extensive attention due to its important role in extracting unconventional gas resources and developing new technologies such as fuel cells, filters and catalytic converters [1,2]. However, understanding of gas transport in ultratight porous materials remains a research challenge for theoretical, computational and experimental studies. As the pore space can be as small as a few nanometers, the conventional continuum flow theories such as the Darcy law and the Klinkernberg model become invalid. Meanwhile, the low permeability (at the nano-Darcy scale) is difficult for experimental measurements [3]. Therefore, direct pore-scale simulation of gas flow in ultra-tight porous material becomes even more important to unravel complex gas transport mechanisms, which can also provide reliable data for upscaling, e.g. the apparent permeability obtained at the representative elementary volume scale can be plugged into macroscopic upscaling equations for predicting gas production [4].
In ultra-tight porous media, gas rarefaction effect becomes important to gas transport as the molecular mean free path is comparable to the average pore size. The rarefaction level can be indicated by the Knudsen number Kn, defined as the ratio of the molecular mean free path λ to the characteristic length L of the flow, i.e.
where R,p and µ(T 0 ) are the specific gas constant, mean gas pressure and dynamic viscosity of gas at a reference temperatureT 0 , respectively. Gas flows can be roughly categorized into four regimes: continuum (Kn < 0.001), slip (0.001 < Kn < 0.1), transition (0.1 < Kn < 10) and free-molecular (Kn > 10) flow regimes. The Navier-Stokes equations are applicable for continuum flow regime and their validity might be extended to slip flow regime by introducing velocity-slip boundary conditions at the solid surfaces [5,6]. Further increasing rarefaction level into the transition and free-molecular flow regimes, the linear constitutive relations as assumed in the Navier-Stokes equations are no longer valid [7]. Consequently, gas kinetic equations such as the Boltzmann equation or its model equations should be used instead of the Navier-Stokes equation. The Boltzmann equation for rarefied gas flows, describes the temporalt and spatialx evolution of the distribution functionf (x,v,t) of gas molecules with the velocityv where the Boltzmann collision integral I(f ,f * ) represents the rate at which the distribution function varies before (f ) and after (f * ) collisions. Numerical solutions of this nonlinear integro-differential equation for realistic problems are tremendously expensive, in terms of computational cost and memory requirement, which is due to the complex five-fold integral I as well as multidimensional phase space (3D spatial domain Ω x , 3D velocity domain Ω v , and 1D temporal domain Ω t for unsteady flows). The full Boltzmann solvers can be classified into stochastic approach, such as the direct simulation Monte Carlo (DSMC) method, and deterministic approach, such as discrete velocity method (DVM) and fast spectral method (FSM) [8][9][10][11][12][13][14]. In order to obtain reasonable approximation of the Boltzmann equation, the full collision integral I is commonly simplified by a relaxation-time collision model, e.g. Bhatnagar-Gross-Krook (BGK) [15], ellipsoidal-statistical BGK (ES-BGK) [16] or Shakhov (S) [17] model. The Boltzmann model equations preserve the main physical properties, e.g. conservation of mass, momentum and energy, and their accuracy in modeling rarefied gas flows has been assessed. The models have been used for a wide range of applications from high-altitude aerodynamics to microfluidics. Among various numerical methods for the Boltzmann model equations, lattice Boltzmann model (LBM) [18] is well developed and widely used for modeling flows in porous media thanks to the ease implementation of boundary condition on complex surfaces [19,20]. However, the conventional LBM fails to capture rarefaction effects in early transition flow regime even for a simple Poiseuille flow due to finite number of velocity grid points N v [21,22]. It is demonstrated that high-order LBM is needed to capture the Knudsen paradox phenomena in a straight channel [23,24]. Therefore, the accuracy of conventional LBM for porous media flows in the transition and free-molecular regimes is still questionable and other gas kinetic methods are required to simulate rarefied gas flows in these regimes. As the DSMC method is impractical for often low-speed flows in porous media, the DVM is adopted here. Even with the most simple collision model, i.e. the BGK equation, multi-dimensional phase space still prohibits practical simulations from serial computing. For example, a serial calculation of a porous sample of 1000 3 -voxel image may need several terabytes of memory and years of the wall-clock time, even with a modest molecular velocity grid of N v = 10 3 . In the last decade, rapid advances in massively parallel computing technology have paved the way for gas kinetic solvers for direct simulations of flows in porous media [25][26][27][28][29][30][31][32][33]. For calculations on a single computing node, kinetic solvers are usually parallelized using Open Multi-Processing (OpenMP) or Graphical Processing Unit (GPU). The OpenMP of shared memory model is accomplished by multi-threaded parallelism with the advantages of simplicity, portability and extensibility. However, scalability of OpenMP is limited by the number of cores on a single computing node. For example, speedup of 46 times is reported for 3D Sod problem executed on a decent 64-core machine [26]. For applications that seek floating point performance, GPU is much faster than the traditional CPU (Central Processing Unit). The GPU-based BGK solver may provide impressive acceleration (about 600 and 340 times for the 1D shock wave and 2D driven cavity tests, respectively) thanks to hundreds or thousands of "light weight" stream processors of GPU [25]. Nonetheless, it is normally restricted to relatively small domain mainly due to limited on-board global memory, e.g. a total grid points of up to N x · N v = 25.6 × 10 3 · 8 × 10 3 is reported in the above study.
For large scale calculations, the computing nodes are interconnected by high-speed networks and the Message Passing Interface (MPI) library is usually utilized for communication between subdomains, which are decomposed in either the velocity domain Ω v or the spatial domain Ω x .
The Ω v -decomposition is simple to implement and easy to achieve load balancing, thus it is usually favoured. For instance, in calculation of 3D hypersonic flows around space vehicles, in which high resolution and a wide range of velocity space are desired, the Ω v -decomposition strategy enables high efficiency of 88% upto 20000-core computing (in comparison with the 500-core one) with the total grid points of 191 × 10 3 · 729 × 10 3 [31]. Nevertheless, large amount of communications in calculating macroscopic parameters (i.e. the moments of distribution functionf over the velocity domain Ω v ) at every N x spatial grid points reduce the performance significantly, especially for a large N x . Hence Ω x -decomposition is preferred (or inevitable) when the spatial domain is large compared to the velocity domain. For example, the simulations of 3D pressure-driven flow in a pipe, which uses nearly 350 × 10 3 · 4.1 × 10 3 grid points, have parallel efficiencies of 96% for Ω x -decomposition and 70% for Ω v -decomposition using 512 cores (in comparison with the 64 cores) [28]. Regardless of Ω x -decomposition or Ω v -decomposition, pure MPI parallelization has limited scalability for a fixed number of grid points N x · N v . The subdomain handled by each MPI core becomes smaller whereas the total amount of MPI communications increase with the number of core N c , leading to load unbalancing. One can observe significant drop of efficiency when the number of cores doubles to 1024 in the above example, i.e. from 96% to 64% and from 70% to 15% respectively. Performance comparison between the two strategies of domain decomposition can be found in Ref. [33].
The limitation on MPI scalability can be mitigated by introducing the second level parallelization. More precisely, two-level MPI/OpenMP parallel approach, in which the top-level MPI proccess is responsible for each Ω x -subdomain, and a group of OpenMP threads at the bottom-level accelerate computing within the Ω x -subdomain as well as communicating with other Ω x -subdomains. This two level approach will help to achieve load balancing and reduce memory consumption for duplicated data [34]. To the best of our knowledge, this multi-level parallel approach for a gas kinetic solver was first proposed by Baranger et al. [32]. However, the tested largest domain is relatively small with the total grid points of 50 × 10 3 · 3 × 10 3 and the parallel performance was not analyzed in that study. A typical modern micro-CT scanner can provide images with 2000 3 voxels and advanced synchrotron imaging may allow much larger images to be reconstructed [19]. As a result, the two-level parallel MPI/OpenMP approach with excellent scalability is required for the kinetic solver to be able to directly simulate gas flows in the porous media with flow passages provided by high-resolution images.
Apart from parallel scalability, convergence rate to steady state also contributes to the total simulation time. To solve the BGK equation, the most numerical schemes are designed for unsteady flows [35,25,26,30,33,[36][37][38][39]. A few BGK solvers recently employ implicit timemarching scheme to accelerate steady state solutions with a large Courant-Friedrichs-Lewy (CFL) number [27,32,40]. It is demonstrated in Refs. [41,28] that domain decomposition in the spatial space Ω x deteriorates, to some extent, the convergence rate of implicit time-marching schemes. Taking into account a smaller number of subdomains for the same number of cores, multi-level parallel approach may mitigate the deterioration in convergence rate for pure MPI Ω x -decomposition.
The present work is to develop a multi-level parallel BGK kinetic solver with enhanced scalability for steady rarefied gas flows in porous media. The solver is designed to efficiently solve flows in porous media with complex structures described by binary 2D slide-images (composed of pixels) or 3D volume-images (composed of voxels). Parallel scalability of two-level MPI/OpenMP approach is analyzed and compared with that of the pure MPI approach.

Governing equation and its boundary conditions
The BGK model equation can be employed to describe low-speed gas flows in ultra-tight porous media, where the flow resistance is high. The distribution function is linearized in the standard manner asf =f eq (1 + h) [42], and the Maxwellian distribution function iŝ 4 The global equilibrium number densityn eq is related to the mean gas pressure by the ideal gas laŵ n eq =p/mRT 0 , in which m is the molecular mass. The (dimensionless) perturbated distribution where time-dependent derivative in Eq.
(2) is omitted as only steady-state solution is of interest. The following dimensionless quantities (denoted by omitting "hat" in the notations of the corresponding dimensional ones) are used where v m = 2RT 0 is the most probable speed. The perturbed number density , velocity u and temperature τ are calculated as moments of perturbed distribution function h over the velocity space where the dimensionless equilibrium distribution function is Boundary conditions for the BGK equation (4) should be specified at the solid surfaces and the outer faces of a porous medium. Gas-surface interaction is modeled by the Maxwell diffusespecular reflection where n, s , α are the outer normal unit vector of the solid surface, the perturbed gas number density on the solid surface and the tangential momentum accommodation coefficient (TMAC), respectively. Value of TMAC represents the diffuse portion of the reflected molecules, i.e fully diffuse or fully specular reflections correspond to α = 1 or α = 0. This study uses the diffuse boundary condition α = 1 at the solid surfaces. The perturbed gas number density on the solid surface is computed from the non-penetration condition, i.e. zero-mass flux through the solid surface A porous medium can be constructed by the periodic replica of the representative elementary volume in the x 1 , x 2 , and x 3 directions, respectively. It is convenient to take the size of computational domain i.e. the representative elementary volume in thex 1 direction as characteristic flow length L =x max in the definition of Knudsen number Eq. (1). At the inlet (x 1 = x min 1 ) and outlet (x 1 = x max 1 ), periodic condition [43] representing the pressure gradient is applied for molecules entering the computational domain assuming that the pressure gradient only exists in the At the lateral faces of the porous medium, symmetric boundary conditions are implemented by the specular reflection, i.e.
The dimensionless apparent permeability k, which is normalized by L 2 , is calculated as where G p is the dimensionless mass flow rate per unit area normalized byp/v m G p = 4 In order to appropriately compare the pore-scale rarefaction effects in different porous samples of various sizes, the effective Knudsen number Kn * is defined as where the average pore size L * is defined as In this study, permeability obtained with the smallest examined Kn, which ensures that flow is in the continuum regime, is considered as intrinsic permeability k ∞ to determine L * . The porosity of a porous model in Eq. (15) is determined by the ratio of the number of fluid points to the total number of points, i.e. the percentage of voids in the digital images.

Discrete velocity method
DVM is one of the most common deterministic approaches to solve the Boltzmann equation and its simplified models [44,35], which projects the continuous molecular velocity space v into a set of fixed N v -discrete velocities v (k) (k = 1, 2, .., N v ). Consequently, the BGK equation (4) is replaced by a system of N v -independent equations. Generally speaking, choice of velocity grid, i.e. the number N v and value of discrete velocity points v (k) , is problem dependent. The range of v (k) must cover the scope of bulk velocity u and temperature T in the whole flow domain, e.g. a broad variety of u or T requires a wide range of v (k) . The number N v associates with accuracy order of numerical quadrature adopted for evaluating macroscopic parameters using Eq. (6). Highly rarefied (large Kn) flows usually demand a fine velocity grid to capture the discontinuity in the distribution function. The values of discrete velocity points are determined by the abscissas of the adopted quadrature rule. Detailed discussion on velocity grid can be found in Ref. [32]. In this study, the Cartesian velocity grid generated by half-range Gauss-Hermit quadrature [45] is employed.

Finite difference approximation of advection terms
As reconstructed digital images of porous media samples comprise of voxels, it is straightforward to construct uniform grids, where the fluid and solid points coincident with voids and matrix voxels, respectively. As a result, finite difference method (FDM) is convenient to approximate the advection terms on these uniform grids. The spatial derivatives in Eq. (4) are approximated by the upwind schemes, e.g. the gradient component of h at the fluid point x projected in the x j coordinate axis ( j = 1, 2, 3) is evaluated as follows where sgn, ∆x and i j are the sign (signum) function, the constant grid size and the unit vector of the x j coordinate axis, respectively. The constants (C 0 , C 1 , C 2 ) are equal to (1.5, −2, 0.5)/∆x for the second-order-accurate scheme, while they are (1, −1, 0)/∆x for the first-order-accurate scheme. The second-order-accurate scheme is used by default while the first-order-accurate scheme is automatically deployed if the upstream grid point in the v j -direction is located on the boundary surfaces. The upwind numerical stencils vary from 3 to 5 points for 2D flows and from 4 to 7 points for 3D problems. For instance, the full 5-point stencil and reduced 4-point stencil for a 2D flow problem are illustrated in Fig. 1 by the black dash frame around the central fluid point x. To avoid checking the conditional expression IF-THEN(-ELSE) at every iteration, the order of accuracy associated with each spatial grid point is set in pre-processing by storing the values of (C 0 , C 1 , C 2 ) in an array. Therefore, we obtain a set of discretized equations by applied DVM and FDM on the BGK equation (

Algorithmic procedure for the serial solver
The set of N v -independent equations, i.e. Eq. (17) are solved by an iterative scheme. Three major steps are to be completed successively in each iteration as follows: (i) Distribution function h in the bulk region is updated by solving Eq. (17).
To be more specific, each equation corresponding to a discrete velocity v (k) can be independently solved by forward substitution along a path of sweep, namely raster scan, on all the fluid points in the spacial domain. The path of sweep is subject to numerical stencil and thus depends on the signs of all the components of the discrete velocity v (k) j . Figure 1 demonstrates the path of sweep for the group of discrete velocity v (k) in that 3D porous sample, the volume path of sweep can be developed from the above plane path of sweep by continuing from the bottom-left fluid point of the next plane in the positive direction of v 3 , and so forth. Paths of sweep for the other three (seven) groups of discrete velocity in 2D (3D) simulations can be deducted in an analogous manner. Distribution function of a fluid point is computed from these of the neighbour points in the upwind stencil, which have been already calculated either in some early stages of the current sweep (if the neighbours are the fluid points) or in the step (ii) of the previous iteration (if the neighbours are boundary points). It should be noted that each solid corner point has either 2 or 3 outer normal unit vectors n j directing to its neighbour fluid points. If the numerical stencil includes a solid corner point that links to the central fluid point in the x j -direction, diffuse-specular reflection at that corner point must be calculated with the normal vector n j by Eq. (8). For example, considering the numerical stencil of the fluid point on the right of the corner point in Fig. 1, diffuse-specular reflection at that corner point must be calculated with the normal vector n 2 = (1, 0).
(ii) Applying the boundary conditions, i.e. Eq. (8) on solid surfaces, and updating the boundary conditions, i.e. Eqs. (10) and (11) on the outer faces of the porous sample.
(iii) The macroscopic flow fields obtained from Eq. (6) and the gas number density on the solid surface from Eq. (9) are updated by the quadrature rule associated with the adopted velocity grid.
Only the resulting data for step (ii) and step (iii) are needed for the next iteration, thus we do not need to store the solved distribution function h in the bulk region as required in the time-marching schemes. The two supplementary steps are performed at every 100 iterations: 8 (iv) Permeability k given by Eq. (12) is approximated by the trapezoidal rule applied on a transverse cross section.
(v) The values of permeability obtained at the current iteration k (l) and the previous 100 iterations k (l−100) are used to assess convergence, i.e.

Two-level parallel implementation using MPI and OpenMP
The two-level parallel approach as illustrated in the Algorithm 1 is a natural option when the discretizations in both the spatial and velocity spaces are required. At the upper level, the 3D/2D spatial domain Ω x is decomposed into multiple cuboid/rectangular subdomains, and each subdomain is assigned to one MPI process. At the lower level, i.e. within each subdomain, OpenMP is employed to parallelize the major steps described in Sec. 3.1. Moreover, additional steps are introduced to deal with data transfer between subdomains, e.g. buffer packing/unpaking are also parallelized with OpenMP. In the rest of this paper, multi-level MPI/OpenMP parallelization will be denoted by MPI c 1 ×c 2 ×c 3 OMP c 4 . Here c 1 , c 2 , c 3 are the number of subdomain divisions along the x 1 , x 2 , x 3 coordinate axes, respectively, while c 4 is the number of OpenMP threads allocated to each subdomain. Therefore, the total number of subdomains, i.e. the MPI processes, is c 1 × c 2 × c 3 and the total number of adopted cores is c 1 ×c 2 ×c 3 ×c 4 . Pure MPI or pure OpenMP parallelizations correspond to c 4 = 1 or c 1 = c 2 = c 3 = 1, respectively.
MPI communications occur only between the spatial subdomains, see Fig. 1. To facilitate the overlapping of computation and MPI communication, we use the classical ghost points approach and the non-blocking version of the MPI send/receive subroutines. To be more specific, two layers of additional grid points, called ghost points or external halo, are padded around each computational subdomain, resulting an extended subdomain. As a result, the fluid points near the communication boundaries, called internal halo, can be treated normally like the inner fluid points, although the numerical stencils around the internal halo may occupy points beyond the communication boundaries. Additional sending (receiving) buffers are allocated to store the distribution function of the out-going (in-coming) discrete velocities at each subdomain boundary. The non-blocking MPI send/receive subroutines and the MPI wait subroutine are called prior and posterior, respectively, to the processing of the bulk subdomains. Then, the out-going (in-coming) distribution functions at the internal halo (external halo) are packed (unpacked) to (from) the sending (receiving) buffers. By using the two-level MPI/OpenMP parallel approach, the total number of MPI communication function calls is effectively reduced compared with the pure MPI approach, so that the communication congestion is avoided and the communication time is reduced so as to realize MPI level optimization, in which the performance of the program is improved from two aspects, MPI level optimization and algorithmic level optimization.
As stated in Sec. 1, the resulting parallel solver is not equivalent to the original serial solver in terms of the convergence history, because the grid points near the communication boundaries use the last iteration value of the upwind grid points. In the following Section, we will investigate how and to what extend the convergence rate is affected by the spatial domain decomposition, and even more importantly, whether the final permeability converges to the prediction of serial solver.

Convergence properties of the kinetic solvers
Both the 2D and 3D MPI/OpenMP parallel implementations of the current algorithm are compared against the OpenMP parallelization (without domain decomposition) in terms of the convergence rate and converged permeability. It is noted that the OpenMP parallelization has been validated for 2D flow through square array of circular cylinders in our previous work [46]. The term associated with the pertubed temperature τ in Eq. (17) has no influence on permeability in  our test on isothermal flow thus it is neglected. The solvers are written in Fortran and compiled by the Cray Programming Environment (version 5.2.82), and run on the ARCHER super-computing system (118080 processing cores in total), the UK national academic HPC facility. Each compute node of the ARCHER contains two 12-core E5-2697 v2 (Ivy Bridge) series processors and 64 GB memory.

2D flows
We first consider a regular model porous medium, i.e. the Sierpinski carpet, which is a famous plane fractal and can be constructed through recursion. The construction begins with a square. The square is cut into 9 congruent sub-squares with a 3-by-3 grid, and the central sub-square is then removed. The same procedure is applied recursively to the remaining 8 sub-squares. We consider a three-level construction with a resolution of 540 × 540 pixels, see Fig. 2

(a). The Knudsen number
Kn is associated with the reference length L = 540 pixels taken from size of the sample in the  Table 1. In the bottom row of the table, we also present the converged permeabilities obtained by the pure OpenMP computations. It is noted that pure OpenMP option (MPI 1 × 1 OMP 12) predicts the identical results as the serial version (MPI 1 × 1 OMP 1) in terms of convergence history and converged permeabilities where no domain decomposition is applied. The permeabilities predicted by the multi-level parallel are not shown here, because they match those of the pure OpenMP option, with the maximum relative deviation being 0.000203%. The convergence history of the permeabilities at different Knudsen numbers and domain decompositions are shown in Fig. 3(a). Both Table 1   and Figure 3(a) show that the deterioration in convergence rate generally becomes worse with increasing spatial subdomains and Knudsen number. The number of iterations has risen by 32% and 67% for Kn = 0.001 and Kn = 1, respectively, when 16 subdomains are adopted. The slow convergence rate near the continuum regime for the classical DVM, without domain decomposition, has been confirmed by the Fourier stability analysis, where the spectral radius is found to be equal to unity [47]. Now we considered a more complex 2D porous media composed by circular cylinders with random locations and sizes, as shown in Fig. 2(b), where the porosity is 0.6. The spatial grid size is 3000 × 1500, and Kn = 0.001, 0.01, 0.1, 1. The velocity grids N v used for the smaller Kn (0.001 and 0.01) and larger Kn (0.1 and 1) cases are 8 × 8 and 24 × 24, respectively. The Knudsen number Kn is defined with the reference length L = 3000 pixels, while the effective Knudsen number Kn * = 64.52Kn is estimated by Eq. (15) with k ∞ = 1.201 × 10 −5 . Similar to the previous Sierpinski carpet case, we show that the number of iterations to reach the converged permeabilities as presented in Table 2 and the convergence histories of permeability as shown in Fig. 3(b). The deterioration in convergence rate due to the spatial domain decomposition is  Fig. 2 found to be mitigated by increasing complexity in porous structure, especially at small Knudsen numbers. The increase of the total iterations is only 2% for Kn = 0.001 and 32% for Kn = 1, when 16 subdomains are used. The maximum deviation of the permeabilities predicted by the parallel solver from the corresponding serial ones is 0.024%. Comparing Fig. 3(b) with Fig. 3(a), we can find the convergence history of the random circular cylinders case is much smoother then the regular Sierpinski carpet case. This can be explained by the more pronounced mixing effect caused by the complex flow passages and gas-surface interactions.

3D flows
Similar verifications are also conducted for our 3D parallel implementation of the algorithm on the flows through a simple cubic array of spheres and the randomly packed spheres, which are often used as model porous media in analytical, computational and experimental studies [48,49].
In the first case, the unit cell is a cubic box with a sphere located at its centre and repeated itself in the 3D space. Due to symmetry and periodicity of the configuration, we simulate only a quadrant of the unit cell as shown in Fig. 4(a) Table 3, and the convergence histories are also presented in Fig. 4(c). The permeabilities predicted by the parallel solver deviate from the serial solver's results up to 0.11%. Similar to the 2D porous models, the degeneration of convergence rate also increases with number of spatial subdomains and Knudsen number. The number of iterations is increased by 35% and 460% for Kn = 0.01 and Kn = 10, respectively, when 64 subdomains are applied. We can also observe from Fig. 4(c) that the parallel solver's convergence history is not as smooth as the serial solver's at the high-Kn cases. One possible reason is that at a high Kn, some particles are altered their distribution function accidentally due to one-interation-lag in information transfer (at the boundaries of subdomain) instead of moving freely between the inlet and the outlet (as in the serial solver). These alternations then influence back to the inlet or outlet through periodic boundary conditions.
The second case is a cube filled with randomly packed spheres as illustrated in Fig. 4(b). Note that we extend four fluid layers at the inlet and outlet of the original geometry to make periodic boundary conditions applicable. The spatial grid size is 308 × 300 × 300 including the extended fluid layers. The Knudsen number Kn is associated with the reference length L = 308 voxels,  Table 4. From Fig. 4(d), it can be seen that the convergence history of the irregular sphere packing case is smooth even at high Kn, unlike the simple cubic array of spheres case. Table 4 shows that the slowing down of convergence rate in the parallel solver is not as significant as the simple cubic array of spheres case. The number of iterations increases by only 28% and 113% for Kn = 0.01 and Kn = 1, respectively, when the number of subdomains increases from 8 to 512. These observations in 3D porous media confirm our finding in the 2D cases that degeneration in convergence rate due to spatial domain decomposition is alleviated by complex porous media.

Parallel performance
Measuring the scaling performance of the OpenMP and MPI parallelisation individually allows us to better evaluate the overall efficiency of the multi-level parallel strategy. In this section, we demonstrate the scaling performance of each parallel level of the solver by excluding deterioration of convergence rate, i.e. comparing the wall clock time for an interval of 100 iterations. We first present the results of fine-grained (the bottom-level OpenMP) scaling performance using only one MPI process, followed by the weak and strong scaling performance with multiple MPI processes (the top level of parallelization), each with multiple OpenMP threads. Finally, speedup of two-level MPI/OpenMP parallelization is compared with that of pure MPI parallelization.

Fine-grained (the bottom-level OpenMP) scaling performance
To evaluate the OpenMP speedup against the serial processing, the OpenMP threads are mapped to the cores of a single processor (the maximum number of threads is 12) to avoid the performance drop caused by the non-uniform memory access (NUMA) across the two processors on the compute node.
The speedup of OpenMP parallelization for the 2D porous model as illustrated in Fig. 2(b) is plotted in Fig. 6(a). Almost perfect linear speedup is observed for all the examined velocity grids until 4 OpenMP threads. It can be seen that speedup slightly increases with the velocity grid refinement. In the case of 12 OpenMP threads, the speedup varies from 8.32 to 9.30 with the velocity grids of N v = 8 2 and N v = 24 2 , respectively. Figure 6(b) plots the speedup of OpenMP parallelization for the 3D porous model as shown in Fig. 5(a), which is a portion extracted from irregular sphere packing as shown in Fig. 4(b). Similar to the 2D case, speedup is almost linear until 4 OpenMP threads. In the case of 12 OpenMP threads, the speedup increases slightly from 7.80 to 8.76 with the velocity grid refinement from

Coarse-grained (the top-level MPI) scaling performance
In order to measure the MPI scalability, the number of OpenMP threads for each MPI process is fixed at 12, and each compute node is assigned to 2 MPI processes, running on the two processors. All MPI scalabilities are assessed against 1 compute node apart from the case of 3D strong scaling, where 4 compute nodes are used as the baseline due to memory requirement.  Weak scaling examines parallel performance by assuming that the spatial domain increases at the same rate as the number of cores, i.e. the workload per subdomain is equally fixed. Figure 7(a) shows the weak scaling efficiency of the 2D solver, in which each MPI process manages a spatial subdomain as illustrated in Fig. 2(a) with a coarse resolution of N x = 270 × 270 pixels. The weak scaling efficiency only slightly decreases with increasing number of MPI processes. Very good parallel efficiency of about 94% is observed with a total of 128 MPI processes (1536 cores). In Fig. 7(b), similar observation is found for the 3D solver, in which each MPI process handles a spatial subdomain as shown in Fig. 5(b). The weak scaling efficiency gradually declines to 86% for the total of 128 MPI processes (1536 cores) and 81% for the 1024 MPI processes (12288 cores). It is worth noting that speedup of 420 is obtained with 512 compute nodes for the grid size of 1.1 × 10 9 · 4.1 × 10 3 as examined on this weak scaling. High efficiency of the weak scaling indicates the solvers' capability for a large number of voxels of homogeneous porous media.
On the other hand, strong scaling examines the parallel performance when the computational domain remains unchanged as the number of cores increases. Figs. 8(a) and 8(b) demonstrate the strong scaling speedup of the 2D and 3D solvers on the spatial domains illustrated in Figs. 2(b) and 4(b), respectively. It can be seen that the 2D strong scaling speedup is close to ideal linear one until 64 MPI processes (768 cores), where the efficiency is around 86%, and the efficiency falls to 64% at 128 MPI processes (1536 cores). In the 3D case, strong scaling speedup keeps increasing even beyond 12288 cores, but the efficiency slumps from 81% to 67% as the number of MPI processes increases from 64 to 128 (768 to 1536 cores). The deterioration in strong scaling performances can be explained by load imbalance among the MPI processes, which is directly related to imhomogeneity of the porous models. Consider the cases of 128 MPI processes, where the spatial domains are decomposed into 16 × 8 rectangular subdomains for the 2D random cylinders as shown in Fig. 2(b) or 4 × 4 × 8 cuboid subdomains for the 3D irregular sphere packing as shown in Fig. 4(b). The porosity of each subdomain varies from 0.313 to 0.909 for the 2D     Fig. 2(b); and (b) the 3D porous model as shown in Fig. 4(b). Velocity grid of N v = 8 2 , 12 2 , 16 2 , 24 2 are considered in (a), and N v = 8 3 in (b). case ( = 0.60 on average) or from 0.090 to 0.706 for the 3D case ( = 0.38 on average), leading to strong load imbalance among the MPI processes. Improving load balance for heterogeneous porous media is certainly worth further studies, e.g. using the standard graph partitioners such as METIS [50], PT-SCOTCH [51] or developing suitable algorithms.

Scaling performance of multi-level MPI/OpenMP versus pure MPI parallelism
As mentioned in Sec. 1, pure MPI parallelization commonly employs either velocity or spatial domain decomposition. But the velocity domain decomposition is not suitable for a relatively large spatial domain. It is interesting to compare the parallel performance of multi-level MPI/OpenMP approach adopted in this study with that of pure MPI approach using spatial domain decomposition. To enable pure MPI mode from the adopted multi-level parallel solvers, we disable OpenMP recognition by adding the compiler flag -hnoomp in the Cray Programming Environment. The wall clock time of both 100 iterations and whole simulations obtained by the multi-level and pure MPI modes for the case of 3D irregular sphere packing as shown in Fig. 4(b) with Kn = 1 is recorded. The wall clock time of the multi-level mode on 4 compute nodes is used as the baseline time for calculating speedup. Figure 9(a), in which the speedup on the interval of 100 iterations is reported, shows a relatively good scaling of both parallel modes until the maximum used resource, i.e. 12288 cores. With the same resource, the multi-level mode is between 30% and 60% faster than the pure MPI mode until 768 cores and becomes slightly faster beyond that number of cores. From Fig. 9(b), when the deterioration of convergence rate is taken into account, it can be observed that even at 512 compute nodes (12288 cores) the overall speedup of both the multi-level and pure MPI modes have not been saturated. However, with the same number of CPU cores, the overall speedup of multi-level mode is roughly 1.5 to 3 times better than that of the pure MPI mode. One main reason is that the multi-level mode has less number of spatial subdomains, i.e. MPI processes, than the pure MPI mode, resulting in faster convergence rate for the iterative scheme.

Conclusions and remarks
In this work, we have developed a high-performance DVM solver for steady 2D and 3D rarefied gas flows in porous media.
While spatial domain decomposition is inevitable for practical large-scale pore-scale computations, the number of subdomains should be kept minimal to avoid deterioration of the convergence rate of the iterative scheme. Therefore, two-level MPI/OpenMP parallelization is proposed to improve parallel performance, where an additional parallel level (OpenMP) allows further speedup with the fixed number of subdomains or mitigates deterioration in convergence rate with the fixed CPU resource. The parallel scaling shows that the two-level parallel approach has significantly better performance than the commonly-used MPI approach for the same number of CPU cores. The deterioration of the convergence rate is found to become worse with increasing Knudsen number, but it can be mitigated by complex porous media.
The developed solver can enable 3D pore-scale simulations to predict the flow properties of porous media. Further investigations on optimization of grid partition is needed to improve scalability for heterogeneous porous media. This multi-level parallel approach, which is demonstrated with the BGK equation here, can be easily extended to solve other kinetic model equations.