Performance Optimization of 3D Lattice Boltzmann Flow Solver on a GPU

Lattice Boltzmann Method (LBM) is a powerful numerical simulation method of the fluid flow. With its data parallel nature, it is a promising candidate for a parallel implementation on a GPU. The LBM, however, is heavily data intensive and memory bound. In particular, moving the data to the adjacent cells in the streaming computation phase incurs a lot of uncoalesced accesses on theGPU which affects the overall performance. Furthermore, the main computation kernels of the LBM use a large number of registers per thread which limits the thread parallelism available at the run time due to the fixed number of registers on the GPU. In this paper, we develop high performance parallelization of the LBM on a GPU by minimizing the overheads associated with the uncoalesced memory accesses while improving the cache locality using the tiling optimization with the data layout change. Furthermore, we aggressively reduce the register uses for the LBM kernels in order to increase the run-time thread parallelism. Experimental results on the Nvidia Tesla K20 GPU show that our approach delivers impressive throughput performance: 1210.63Million Lattice Updates Per Second (MLUPS).


Introduction
Lattice Boltzmann Method (LBM) is a powerful numerical simulation method of the fluid flow, originating from the lattice gas automata methods [1]. LBM models the fluid flow consisting of particles moving with random motions. Such particles exchange the momentum and the energy through the streaming and the collision processes over the discrete lattice grid in the discrete time steps. At each time step, the particles move into adjacent cells which cause collisions with the existing particles in the cells. The intrinsic data parallel nature of LBM makes this class of applications a promising candidate for parallel implementation on various High Performance Computing (HPC) architectures including manycore accelerators such as the Graphic Processing Unit (GPU) [2], Intel Xeon Phi [3], and the IBM Cell BE [4].
Recently, the GPU is becoming increasingly popular for the HPC server market and in the Top 500 list, in particular. The architecture of the GPU has gone through a number of innovative design changes in the last decade. It is integrated with a large number of cores and multiple threads per core, levels of the cache hierarchies, and the large amount (>5 GB) of the on-board memory. The peak floatingpoint throughput performance (flops) of the latest GPU has drastically increased to surpass 1 Tflops for the double precision arithmetic [5]. In addition to the architectural innovations, user friendly programming environments have been recently developed such as CUDA [5] from Nvidia, OpenCL [6] from Khronos Group, and OpenACC [7] from a subgroup of OpenMP Architecture Review Board (ARB). The advanced GPU architecture and the flexible programming environments have made possible innovative performance improvements in many application areas.
In this paper, we develop high performance parallelization of the LBM on a GPU. The LBM is heavily data intensive and memory bound. In particular, moving the data to the adjacent cells in the streaming phase of the LBM incurs a lot of uncoalesced accesses on the GPU and affects the overall performance. Previous research focused on utilizing the shared memory of the GPU to deal with the problem [1,8,9]. In this paper, we use the tiling algorithm along with the data layout change in order to minimize the overheads 2 Scientific Programming of the uncoalesced accesses and improve the cache locality as well. The computation kernels of the LBM involve a large number of floating-point variables, thus using a large number of registers per thread. This limits the available thread parallelism generated at the run time as the total number of the registers on the GPU is fixed. We developed techniques to aggressively reduce the register uses for the kernels in order to increase the available thread parallelism and the occupancy on the GPU. Furthermore, we developed techniques to remove the branch divergence. Our parallel implementation using CUDA shows impressive performance results. It delivers up to 1210.63 Million Lattice Update Per Second (MLUPS) throughput performance and 136-time speedup on the Nvidia Tesla K20 GPU compared with a serial implementation.
The rest of the paper is organized as follows: Section 2 introduces the LBM algorithm. Section 3 describes the architecture of the latest GPU and its programming model. Section 4 explains our techniques for minimizing the uncoalesced accesses, improving the cache locality and the thread parallelism along with the register usage reduction and the branch divergence removal. Section 5 shows the experimental results on the Nvidia Tesla K20 GPU. Section 6 explains the previous research on paralleling the LBM. Section 7 wraps up the paper with conclusions.

Lattice Boltzmann Method
Lattice Boltzmann Method (LBM) is a powerful numerical simulation of the fluid flow. It is derived as a special case of the lattice gas cellular automata (LGCA) to simulate the fluid motion. The fundamental idea is that the fluids can be regarded as consisting of a large number of small particles moving with random motions. These particles exchange the momentum and the energy through the particle streaming and the particle collision. The physical space of the LBM is discretized into a set of uniformly spaced nodes (lattice). At each node, a discrete set of velocities is defined for the propagation of the fluid molecules. The velocities are referred to as microscopic velocities which are denoted by → . The LBM model which has dimensions and velocity vectors at each lattice point is represented as DnQq. Figure 1 shows a typical lattice node of the most common model in 2D (D2Q9) which has two-dimensional 9 velocity vectors. In this paper, however, we consider the D3Q19 model which has three-dimensional 19 velocity vectors. Figure 2  (1) (i) Each particle on the lattice is associated with a discrete distribution function, called as particle distribution function (pdf), ( ⃗ , ), = 0, . . . , 18. The LB equation is discretized as follows: where is the lattice speed and is the relaxation parameter.
(ii) The macroscopic quantities are the density and the velocity ⃗ ( ⃗ , ). They are defined as (iii) The equilibrium function (eq) ( ( ⃗ , ), ⃗ ( ⃗ , )) is defined as (eq) ( ( ⃗ , ) , ⃗ ( ⃗ , )) = + ( ⃗ ( ⃗ , )) where and the weighting factor has the following values: Algorithm 1 summarizes the algorithm of the LBM. The LBM algorithm executes a loop over a number of time steps. At each iteration, two computation steps are applied: (i) Streaming (or propagation) phase: the particles move according to the pdf into the adjacent cells.
Scientific Programming 3 (1) Step 1: Initialize macroscopic quantities, density , velocity ⃗ , the distribution function , and the equilibrium function (eq) (2) Step 2: Streaming phase: move → * in the direction of → (3) Step 3: Calculate density and velocity ⃗ from * using Equations (3) and (4) (4) Step 4: Calculate the equilibrium function (eq) using Equation (5) Step 5: Collision phase: calculate the updated distribution function   Depending on whether the streaming phase precedes or follows the collision phase, we have the pull or the push scheme in the update process [10]. The pull scheme ( Figure 3) pulls the post-collision values from the previous time step from lattice A and then performs the collision on these to produce the new pdfs which are stored in lattice B. In the push scheme ( Figure 4); on the other hand, the pdfs of one node (square with black arrows) are read from lattice A; then collision step performs first. The post-collision values are propagated to the neighbor nodes in the streaming step to lattice B (red arrows).     (4) double * temp grid; (5) for (i = 0; i < timeSteps; i++) collide(source grid, temp grid, grid size); (8) stream(temp grid, dest grid, grid size); (9) swap grid(source grid, dest grid); (10) } (11) } Algorithm 2: Basic skeleton of LBM algorithm. and are swapped to interchange values between the two grids.

Latest GPU Architecture
Recently, the many-core accelerator chips are becoming increasingly popular for the HPC applications. The GPU chips from Nvidia and AMD are representative ones along with the Intel Xeon Phi. The latest GPU architecture is characterized by a large number of uniform fine-grain programmable cores or thread processors which have replaced separate processing units for shader, vertex, and pixel in the earlier GPUs. Also, the clock rate of the latest GPU has ramped up significantly. These have drastically improved the floating-point performance of the GPUs, far exceeding that of the latest CPUs. The fine-grain cores (or thread processors) are distributed in multiple streaming multiprocessors (SMX) (or thread blocks) (see Figure 5). Software threads are divided into a number of thread groups (called WARPs) each of which consists of 32 threads. Threads in the same WARP are scheduled and executed together on the thread processors in the same SMX in the SIMD (Single Instruction Multiple Data) mode. Each thread executes the same instruction directed by the common Instruction Unit on its own data streaming from the device memory to the on-chip cache memories and registers. When a running WARP encounters a cache miss, for example, the context is switched to a new WARP while the cache miss is serviced for the next few hundred cycles, the GPU executes in a multithreaded fashion as well.
The GPU is built around a sophisticated memory hierarchy as shown in Figure 5. There are registers and local memories belonging to each thread processor or core. The local memory is an area in the off-chip device memory. Shared memory, level-1 (L-1) cache, and read-only data cache are integrated in a thread block of the GPU. The shared memory is a fast (as fast as registers) programmer-managed memory. Level-2 (L-2) cache is integrated on-chip and used among all the thread blocks. Global memory is an area in the off-chip device memory accessed from all the thread blocks, through which the GPU can communicate with the host CPU. Data in the global memory get cached directly in the shared memory by the programmer or they can be cached through the L-2 and L-1 caches automatically as they get accessed. There are constant memory and texture memory regions in the device memory also. Data in these regions is read-only. They can be cached in the L-2 cache and the read-only data cache. On Nvidia Tesla K20, the read-only data from the global memory can be loaded through the same cache used by the texture pipeline via a standard pointer without the need to bind to a texture beforehand. This read-only cache is used automatically by the compiler as long as certain conditions are met. restrict qualifier should be used when a variable is declared to help the compiler detect the conditions [5].
In order to efficiently utilize the latest advanced GPU architectures, programming environments such as CUDA [5] from Nvidia, OpenCL [6] from Khronos Group, and OpenACC [7] from a subgroup of OpenMP Architecture Review Board (ARB) have been developed. Using these environments, users can have a more direct control over the Scientific Programming collision kernel<<<GRID, BLOCK>>> (source grid, temp grid, xdim, ydim, zdim, cell size, grid size); (5) cudaThreadSynchronize(); (6) streaming kernel<<<GRID, BLOCK>>>(temp grid, dest grid, xdim, ydim, zdim, cell size, grid size); (7) cudaThreadSynchronize(); (8) swap grid(source grid, dest grid); (9) } Algorithm 3: Two separate CUDA kernels for different phases of LBM. large number of GPU cores and its sophisticated memory hierarchy. The flexible architecture and the programming environments have led to a number of innovative performance improvements in many application areas and many more are still to come.

Optimizing Cache Locality and Thread Parallelism
In this section, we first introduce some preliminary steps we employed in our parallelization and optimization of the LBM algorithm in Section 4.1. They are mostly borrowed from the previous research such as combining the collision phase and the streaming phase, a GPU architecture friendly data organization scheme (SoA scheme), an efficient data placement in the GPU memory hierarchy, and using the pull scheme for avoiding and minimizing the uncoalesced memory accesses. Then, we describe our key optimization techniques for improving the cache locality and the thread parallelism such as the tiling with the data layout change and the aggressive reduction of the register uses per thread in Sections 4.2 and 4.3. Optimization techniques for removing the branch divergence are presented in Section 4.4. Our key optimization techniques presented in this section have been improved from our earlier work in [13].

Combination of Collision Phase and Streaming Phase.
As shown in the description of the LBM algorithm in Algorithm 2, the LBM consists of the two main computing kernels: for the collision phase and for the streaming phase. In the , threads load the particle distribution function from the source grid ( ) and then calculate the velocity, the density, and the collision product. The post-collision values are stored to the temporary grid ( ). In , the post-collision values from are loaded and updated to appropriate neighbor grid cells in the destination grid ( ). At the end of each time step, and are swapped for the next time step. This implementation (see (1) dest grid, xdim, ydim, zdim, cell size, grid size); (5) cudaThreadSyncronize(); (6) swap grid(source grid, dest grid); (7) } Algorithm 4: Single CUDA kernel after combining two phases of LBM.
Algorithm 3) needs extra loads/stores from/to which is stored in the global memory and affects the global memory bandwidth [2]. In addition, some extra cost is incurred with the global synchronization between the two kernels ( ℎ ℎ ) which affects the overall performance. In order to reduce these overheads, we can combine and a into one kernel , where the collision product is streamed to the neighbor grid cells directly after calculation (see Algorithm 4). Compared with Algorithm 3, storing to and loading from are removed and the global synchronization cost is reduced.

Data Organization.
In order to represent the 3dimensional grid of cells, we use the 1-dimensional array which has × × × elements, where , , are width, height, and depth of the grid and is the number of directions of each cell [2,14]. For example, if the model is 3 19 with = 16, = 16, and = 16, we have the 1D array of 16 × 16 × 16 × 19 = 77824 elements. We use 2 separate arrays for storing the source grid and the destination grid.
There are two common data organization schemes for storing the arrays: (i) Array of structures (AoS): grid cells are arranged in 1D array. 19 distributions of each cell occupy 19 consecutive elements of the 1D array ( Figure 6).

Data Placement.
In order to efficiently utilize the memory hierarchy of the GPU, the placement of the major data structures of the LBM is crucial [5]. In our implementation, we use the following arrays: , r , , , and . We use the following data placements for these arrays: and are used to store the input grid and the result grid. They are swapped at the end of each time step by exchanging their pointers instead of explicit storing to and loading from the memory through a temporary array. Since and are very large size arrays with a lot of data stores and loads, we place them in the global memory.
(ii) In array, the types of the grid cells are stored. We use the Lid Driven Cavity (LDC) as the test case in this paper. The LDC consists of a cube filled with the fluid. One side of the cube serves as the acceleration plane by sliding constantly. The acceleration is implemented by assigning the cells in the acceleration area at a constant velocity. This change requires three types of cells: regular fluid, acceleration cells, or boundary. Thus, we also need 1D array, , in order to store the types of each cell in the grid. The size of this array is × × elements. For example, if the model is D3Q19 with = 16, = 16, and = 16, the size of the array is 16 × 16 × 16 = 4,096 elements. Thus, is a large array also and contains constant values. Thus, they are not modified throughout the execution of the program. For these reasons, the texture memory is the right place for this array.
(iii) and are used to store the base indices for accesses to 19 directions of the current cell and the neighbor cells, respectively. There are 19 indices corresponding to 19 directions of D3Q19 model. These indices are calculated at the start of the program and used till the end of the program execution. Thus, we use the constant memory to store them. As standing at any cell, we use the following formula to define the position in the 1D array of any direction out of 19 cell directions:

Using Pull Scheme to Reduce Costs for Uncoalesced
Accesses. Coalescing the global memory accesses can significantly reduce the memory overheads on the GPU. Multiple global memory loads whose addresses fall within the 128byte range are combined into one request and sent to the memory. This saves the memory bandwidth a lot and improves the performance. In order to reduce the costs for the uncoalesced accesses, we use the pull scheme [12]. Choosing the pull scheme comes from the observation that the cost of the uncoalesced reading is smaller than the cost of the uncoalesced writing.
Algorithm 5 shows the LBM algorithm using the push scheme. At the first step, the pdfs are copied directly from the current cell. These pdfs are used to calculate the pdfs at the new time step (collision phase). The new pdfs are then streamed to the adjacent cells (streaming phase). At  If we use the pull scheme, on the other hand, the order of the collision phase and the streaming phase in the LBM kernel is reversed (see Algorithm 6). At the first step of the pull scheme, the pdfs from adjacent cells are gathered to the current cell (streaming phase) (Lines 3-5). Next, these pdfs are used to calculate the pdfs at the new time step and these new pdfs are then stored to the current cell directly (collision phase). Thus, in the pull scheme, the uncoalesced accesses occur when the data is read from the device memory whereas they occur when the data is written in the push scheme. As a result, the cost of the uncoalesced accesses is smaller with the pull scheme.

Tiling Optimization with Data Layout
Change. In the D3Q19 model of the LBM, as computations for the streaming and the collision phases are conducted for a certain cell, 19 distribution values which belong to 19 surrounding cells are accessed. Figure 8 shows the data accesses to the 19 cells when  Here, the accesses of the same group touch the consecutive memory locations and accesses of the different groups are separated apart in the memory which lead to the uncoalesced accesses also. Accesses to the data elements in the different planes ( 0 , 1 , and 2 ) are further separated apart and also lead to the uncoalesced accesses when is sufficiently large. As the computations proceed, three rows in thedimension of 0 , 1 , 2 planes will be accessed sequentially for = 0 ∼ − 1, = 0, 1, 2, followed by = 0 ∼ − 1, = 1, 2, 3, . . ., = 0 ∼ − 1, = − 3, − 2, − 1. When the complete 0 , 1 , 2 planes are swept, then similar data accesses will continue for 1 , 2 , and 3 planes, and so on. Therefore, there are a lot of data reuses in -, -, anddimensions. As explained in Section 4.1.2, the 3D lattice grid is stored in the 1D array. The 19 cells for the computations belonging to the same plane are stored ±1 or ± + ±1 cells away. The cells in different planes are stored ± × + ± + ±1 cells away. The data reuse distance along thedimension is short: +1 or +2 loop iterations apart. The data reuse distance along the -and -dimensions is ± + ±1 or ± × + ± + ±1 iterations apart. If we can make the data reuse occur faster by reducing the reuse distances, for example, using the tiling optimization, it can greatly improve the cache hit ratio. Furthermore, it can reduce the overheads with the uncoalesced accesses because lots of global memory accesses can be removed by the cache hits. Therefore, we tile the 3D lattice grid into smaller 3D blocks. We also change the data layout in accordance with the data access patterns of the tiled code in order to store the data elements in different groups closer in the memory. Thus we can remove a lot of uncoalesced memory accesses, because they can be stored within 128-byte boundary. In Sections 4.2.1 and 4.2.2, we describe our tiling and data layout change optimizations.

Tiling. Let us assume the following:
(i) , , and are sizes of the grid in -, -, anddimension.
(iii) -plane is a subplane which is composed of ( × ) cells.
We tile the grid into small 3D blocks with the tile sizes of , , and (yellow block in Figure 9 We let each CUDA thread block process one 3D tiled block. Thus -planes need to be loaded for each thread block. In each -plane, each thread of the thread block executes the computations for one grid cell. Thus each thread deals with a column containing cells (the red column in Figure 9(b)). If = 1, each thread processes cells and if = , each thread processes only one cell. The tile size can be adjusted by changing the constants , , and . These constants need to be selected carefully to optimize the performance. Using the tiling, the number of created threads is reduced by -times.

Data Layout Change.
In order to further improve benefits of the tiling and reduce the overheads associated with the uncoalesced accesses, we propose to change the data layout. Figure 10 shows one -plane of the grid with and without the layout change. With the original layout ( Figure 10(a)), the data is stored in the row major fashion. Thus the entire first row is stored, followed by the second    row, and so on. In the proposed new layout, the cells in the tiled first row in Block 0 are stored first. Then the second tiled row of Block 0 is stored instead of the first row of Block 1 (Figure 10(b)). With the layout change, the data cells accessed in the consecutive iterations of the tiled code are placed sequentially. This places the data elements of the different groups closer. Thus, it increases the possibility for these memory accesses to the different groups coalesced if the tiling factor and the memory layout factor are adjusted appropriately. This can further improve the performance beyond the tiling. The data layout can be transformed using the following formula: where id and id are cell indexes in -and -dimension on the plane of grid and is the value in the range of 0 to −1.
In our implementation, we use the changed input data layout stored offline before the program starts. (The original input is changed to the new layout and stored to the input file.) Then, the input file is used while conducting the experiments.

Reduction of Register Uses per
Thread. The D3Q19 model is more precise than the models with smaller distributions such as D2Q9 or D3Q13, thus using more variables. This leads to more register uses for the main computation kernels. In GPU, the register use of the threads is one of the factors limiting the number of active WARPs on a streaming multiprocessor (SMX). Higher register uses can lead to the lower parallelism and occupancy (see Figure 11 for an example) which results in the overall performance degradation. The Nvidia compiler provides a flag to limit the register uses to a certain limit such as − or launch bounds () qualifier [5]. The − switch sets a maximum on the number of registers used for each thread. These can help increase the occupancy by reducing the register uses per thread. However, our experiments show that the overall performance goes down, because they lead to a lot of register spills/refills to/from the local memory. The increased  memory traffic to/from the local memory and the increased instruction count for accessing the local memory hurt the performance.
In order to reduce the register uses per thread while avoiding the register spill/refill to/from the local memory, we used the following techniques: In order to reduce the register uses, we attempt to reuse variables whose life-time ended earlier in the former code. This may lower the instruction-level parallelism of the kernel. However, it helps increase the thread-level parallelism as more threads can be active at the same time with the reduced register uses per thread.
(vii) In the original LBM code, there are some complicated floating-point (FP) intensive computations used in a number of nearby statements. We aggressively extract these computations as the common subexpressions. It frees the registers involved in the common subexpressions, thus reducing the register uses. It also reduces the number of dynamic instruction counts.
Applying the above techniques in our implementation, the number of registers in each kernel is greatly reduced from 70 registers to 40 registers. It leads to the higher occupancy for the SMXs and the significant performance improvements.

Removing Branch Divergence.
Flow control instructions on the GPU cause the threads of the same WARP to diverge. Thus, the resulting different execution paths get serialized. This can significantly affect the performance of the application program on the GPU. Thus, the branch divergence should be avoided as much as possible. In the LBM code, there are two main problems which can cause the branch divergence:   In order to avoid using -while streaming at the boundary position, a "ghost layer" is attached in -and -dimension. If , , and are the width, height, and depth of the original grid, = , = + 1, and = + 1 are the new width, height, and depth of the grid with the ghost layer ( Figure 12). With the ghost layer, we can regard the computations at the boundary position as the normal ones without worrying about running out of the index bound.
As explained in Section 4.1.2, cells of the grid belong to three types such as the regular fluid, the acceleration cells, or the boundary. The LBM kernel contains conditions to define actions for each type of the cell. The boundary cell type can be covered in the above-mentioned way using the ghost layer. This leads to the existence of the other two different conditions in the same half WARP of the GPU. Thus, in order to remove -we combine conditions into computational statements. Using this technique, thein Algorithm 7 is rewritten as in Algorithm 8.

Experimental Results
In this section, we first describe the experimental setup. Then we show the performance results with analyses.

Experimental Setup.
We implemented the LBM in the following five ways: (i) Serial implementation using single CPU core (serial), using the source code from the SPEC CPU 2006 470.lbm [15] to make sure it is reasonably optimized (ii) Parallel implementation on a GPU using the AoS data scheme (AoS) (iii) Parallel implementation using the SoA data scheme and the push scheme (SoA Push Only) (iv) Parallel implementation using the SoA data scheme and the pull scheme (SoA Pull Only) (v) SoA using pull scheme with our various optimizations including the tiling with the data layout change (SoA Pull * ) We summarize our implementations in Table 2. We used the D3Q19 model for the LBM algorithm. Domain grid sizes are scaled in the range of 64 3 , 128 3 , 192 3 , and 256 3 . The numbers of time steps are 1000, 5000, and 10000.
In order to measure the performance of the LBM, the Million Lattice Updates Per Second (MLUPS) unit is used which is calculated as follows: where , , and are domain sizes in the -, -, anddimension, ts is the number of time steps used, and is the run time of the simulation.
Our experiments were conducted on a system incorporating the Intel multicore processor (6-core 2.0 Ghz Intel Xeon E5-2650) with 20 MB level-3 cache and Nvidia Tesla K20 GPU based on the Kepler architecture with 5 GB device memory. The OS is CentOS 5.5. In order to validate the effectiveness of our approach over the previous approaches, we have also conducted further experiments on another GPU, Nvidia GTX285 GPU.

Results Using Previous
Approaches. The average performances of the serial and the AoS are shown in Figure 13.  observed shows that the total transactions (loads and stores) of the pull and push schemes are quite equivalent. However, the number of store transactions of the pull scheme is 56.2% smaller than the push scheme. This leads to the performance improvement of the pull scheme compared with the push scheme.

Results Using Our Optimization Techniques.
In this subsection, we show the performance improvements of our optimization techniques compared with the previous approach based on the SoA Pull implementation: (i) Figure 16 compares the average performance of the SoA with and without removing the branch divergences explained in Section 4.4 in the kernel code.
(ii) Reducing the register uses described in Section 4.3 improves the performance by 12.07%, 12.44%, 11.98%, and 12.58% as Figure 17 shows. (iv) Figure 19 shows the performance comparison of the SoA Pull Full and SoA Pull Full Tiling. The SoA Pull Full Tiling performance is better than the SoA Pull Full from 11.78% to 13.6%. The domain size 128 3 gives the best performance improvement of 13.6%, while the domain size 256 3 gives the lowest improvement of 11.78%. The experimental results show that the tiling size for the best performance is = 32, = 16, and = ÷ 4. (v) Figure 20  (vii) Table 4 compares the performance of our work with the previous work conducted by Mawson and Revell [12]. Both implementations were conducted on the same K20 GPU. Our approach performs better than [12] from 14% to 19%. Our approach incorporates   more optimization techniques such as the tiling optimization with the data layout change, the branch divergence removal, among others compared with [12]. (viii) In order to validate the effectiveness of our approach, we conducted more experiments on the other GPU, Nvidia GTX285. Table 5 and Figure 21 show the average performance of our implementations with domain sizes 64 3 , 128 3 , and 160 3 . (The grid sizes larger than 160 3 cannot be accommodated in the device memory of the GTX 285.) As shown, our optimization technique, SoA Pull Full Tiling, is better than the previous SoA Pull Only up to 22.85%. Also we obtained 46-time speedup compared with the serial implementation. The level of the performance improvement and the speedup are, however, lower on the GTX 285 compared with the K20.

Previous Research
Previous parallelization approaches for the LBM algorithm focused on two main issues: how to efficiently organize the data and how to avoid the misalignment in the streaming (propagation) phase of the LBM. In the data organization, the AoS and the SoA are two most commonly used schemes. While AoS scheme is suitable for the CPU architecture, SoA is a better scheme for the GPU architecture when the global memory access coalition technique is incorporated. Thus, most implementations of the LBM on the GPU use the SoA as the main data organization. In order to avoid the misalignment in the streaming phase of the LBM, there are two main approaches. The first proposed approach uses the shared memory. Tölke in [9] used the approach and implemented the D2Q9 model. Habich et al. [8] followed the same approach for the D3Q19 model. Bailey et al. in [16] also used the shared memory to achieve 100% coalescence in the propagation phase for the D3Q13 model. In the second approach, the pull scheme was used instead of the push scheme without using the shared memory.
As observed, the main aim of using the shared memory is to avoid the misaligned accesses caused by the distribution values moving to the east and west directions [8,9,17]. However, the shared memory implementation needs extra synchronizations and intermediate registers. This lowers the achieved bandwidth. In addition, using the shared memory limits the maximum number of threads per thread block because of the limited size of the shared memory [17] which reduces the number of active WARPs (occupancy) of the kernels, thereby hurting the performance. Using the pull scheme, instead, there is no extra synchronization cost incurred and no intermediate registers are needed. In addition, the better utilization of the registers in the pull scheme leads to generating a larger number of threads as the total number of registers is fixed. This leads to better utilization of the GPU's multithreading capability and higher performance. Latest results in [12] confirm the higher performance of the pull scheme compared with using the shared memory.
Besides the above approaches, in [12], the new feature of the Tesla K20 GPU, shuffle instruction, was applied to avoid the misalignment in the streaming phase. However, the obtained results were worse. In [18], Obrecht et al. focused on choosing careful data transfer schemes in the global memory instead of using the shared memory in order to solve the misaligned memory access problem.
There were some approaches to maximize the GPU multiprocessor occupancy by reducing the register uses per thread. Bailey et al. in [16] showed 20% improvement in maximum performance compared with the D3Q19 model in [8]. They set the number of registers used by the kernel below a certain limit using the Nvidia compiler flag. However, this approach may spill the register data to the local memory. Habich et al. [8] suggested a method to reduce the number of registers by using the base index, which forces the compiler to reuse the same register again.
A few different implementations of the LBM were attempted. Astorino et al. [19] built a GPU implementation framework for the LBM valid for the two-and threedimensional problems. The framework is organized in a modular fashion and allows for easy modification. They used the SoA scheme and the semidirect approach as the addressing scheme. They also adopted the swapping technique to save the memory required for the LBM implementation. Rinaldi et al. [17] suggested an approach based on the single-step algorithm with a reversed collision-propagation scheme. They used the shared memory as the main computational memory instead of the global memory. In our implementation, we adopted these approaches for the SoA Pull Only implementation shown in Section 5.

Conclusion
In this paper, we developed high performance parallelization of the LBM algorithm with the D3Q19 model on the GPU. In order to improve the cache locality and minimize the overheads associated with the uncoalesced accesses in moving the data to the adjacent cells in the streaming phase of the LBM, we used the tiling optimization with the data layout change. For reducing the high register pressure for the LBM kernels and improving the available thread parallelism generated at the run time, we developed techniques for aggressively reducing the register uses for the kernels. We also developed optimization techniques for removing the branch divergence. Other already-known techniques were also adopted in our parallel implementation such as combining the streaming phase and the collision phase into one phase to reduce the memory overhead, a GPU friendly data organization scheme so-called the SoA scheme, efficient data placement of the major data structures in the GPU memory hierarchy, and adopting a data update scheme (pull scheme) to further reduce the overheads of the uncoalesced accesses. Experimental results on the 6-core 2.2 Ghz Intel Xeon processor and the Nvidia Tesla K20 GPU using CUDA show that our approach leads to impressive performance results. It delivers up to 1210.63 MLUPS throughput performance and achieves up to 136-time speedup compared with a serial implementation running on single CPU core.