Hybrid MPI and CUDA Parallelization for CFD Applications on Multi-GPU HPC Clusters

Graphics processing units (GPUs) have a strong floating-point capability and a high memory bandwidth in data parallelism and have been widely used in high-performance computing (HPC). Compute unified device architecture (CUDA) is used as a parallel computing platform and programming model for the GPU to reduce the complexity of programming. +e programmable GPUs are becoming popular in computational fluid dynamics (CFD) applications. In this work, we propose a hybrid parallel algorithm of the message passing interface and CUDA for CFD applications on multi-GPU HPC clusters. +e AUSM+UP upwind scheme and the three-step Runge–Kutta method are used for spatial discretization and time discretization, respectively. +e turbulent solution is solved by the K − ω SST two-equation model. +e CPU only manages the execution of the GPU and communication, and the GPU is responsible for data processing. Parallel execution and memory access optimizations are used to optimize the GPU-based CFD codes. We propose a nonblocking communication method to fully overlap GPU computing, CPU_CPU communication, and CPU_GPU data transfer by creating two CUDA streams. Furthermore, the one-dimensional domain decomposition method is used to balance the workload among GPUs. Finally, we evaluate the hybrid parallel algorithm with the compressible turbulent flow over a flat plate. +e performance of a single GPU implementation and the scalability of multi-GPU clusters are discussed. Performance measurements show that multi-GPU parallelization can achieve a speedup of more than 36 times with respect to CPU-based parallel computing, and the parallel algorithm has good scalability.


Introduction
e developments of computer technology and numerical schemes over the past few decades have made computational fluid dynamics (CFD) become an important tool in optimal design of aircraft and analysis of a complex flow mechanism [1,2]. A large number of CFD applications can reduce development costs and provide technical support for research on aircraft.
e scope and complexity of flow problems in CFD simulation is constantly expanding, and the grid size required for simulation is increasing. e rapid increase in grid size raises the challenge in processing these huge data on processors in engineering activities and scientific research. Traditionally, multi-CPU parallelization has been used to accelerate computation. e low parallelism degree and power inefficiency may limit the parallel performance of the cluster. Furthermore, the computing time is largely dependent on the CPU update. In recent years, the development of the CPU has been a bottleneck due to limitations in power consumption and heat dissipation prevention [3,4].
In CFD applications, a large amount of computing resources are required for complex flow problems, such as turbulent flow, reactive flow, and multiphase flow. Highperformance computing (HPC) platforms, such as graphics processing unit (GPU), many integrated core (MIC), and field programmable gate array (FPGA), exhibit a more efficient performance in parallel data processing than the CPU [5][6][7][8]. Faster and better numerical solutions can be obtained by executing CFD codes on these heterogeneous accelerators. In this paper, we discuss GPU parallelization in CFD applications.
GPU has a strong floating-point capability and a high memory bandwidth in data parallelism. e latest Volta architecture Tesla V100 GPU has single-precision and double-precision floating-point operations up to 14 and 7 TFLOP/s, respectively, which are much higher than the computing performance of the CPU. In 2006, NVIDIA introduced the compute unified device architecture (CUDA), which reduces the complexity of programming [9]. e programmable GPU has evolved into a highly parallel, multithreaded, many-core processor. erefore, GPU acceleration is becoming popular in general-purpose computing areas such as molecular dynamics (MD), direct simulation Monte Carlo (DSMC), CFD, artificial intelligence (AI), and deep learning (DL) [10][11][12][13].
In this work, we focus on the design and optimizations of a hybrid parallel algorithm of the message passing interface (MPI) and CUDA for CFD applications on multi-GPU HPC clusters.
e compressible Navier-Stokes equations are discretized based on the cell-centered finite volume method. e AUSM + UP upwind scheme and the three-step Runge-Kutta method are used for spatial discretization and time discretization, respectively. Moreover, the turbulent solution is solved by the K − ω shear stress transport (SST) two-equation model. In some previous work, the CPU was designed to perform data processing together with the GPU [14,15]. However, for the latest Pascal or Volta Architecture GPU, the computing ability of the GPU far exceeds that of the CPU. In our design of the hybrid parallel algorithm, the CPU only manages the execution of the GPU and communication, and the GPU is responsible for data processing. Performance optimization involves three basic strategies: maximizing parallel execution to achieve maximum utilization, optimizing memory usage to achieve maximum memory throughput, and optimizing instruction usage to achieve maximum instruction throughput. In this study, parallel execution and memory access optimizations are investigated.
In a multi-GPU HPC cluster, ghost and singularity data are exchanged between GPUs. Some scholars use CUDAaware MPI technology to accelerate the speed of data exchange [13,[16][17][18]. is technology not only makes it easier to work with a CUDA + MPI application, but also makes acceleration technologies like GPUDirect be utilized by the MPI library. However, the current hardware and software configurations we used in this paper do not support this technology. Moreover, the use of the K − ω SST twoequation turbulence model increases the complexity of multi-GPU parallel programming. In our design of the hybrid parallel algorithm, we need to stage GPU buffers through host memory. Two kinds of communication approaches are considered: blocking and nonblocking communication methods. e first approach uses blocking functions MPI_Bsend and MPI_Recv without overlapping communication and computations. e second approach uses nonblocking functions MPI_Isend and MPI_Irecv with fully overlapping GPU computations, CPU_CPU communication, and CPU_GPU data transfer. e nonblocking communication method can improve computational efficiency to some extent.
Multi-GPU parallelization can achieve the maximum performance by balancing the workload among GPUs based on domain decomposition [3,19]. e one-dimensional (1D), two-dimensional (2D), or three-dimensional (3D) domain decomposition method is commonly used for GPU implementation. In this study, we design a 1D domain decomposition algorithm based on the idea of dichotomy to load each GPU with approximately the same grid scale.
ough the 1D method needs to transfer more data, the 2D or 3D method cannot achieve coalesced memory access in the global memory, which results in considerable performance loss when performing CFD applications on multi-GPU HPC clusters.
In this paper, the design and optimizations of the parallel algorithm are closely related to the hardware configurations, numerical schemes, and computational grids to obtain the optimal parallel performance on multi-GPU HPC clusters. e main contributions of this work are summarized as follows: (i) A hybrid parallel algorithm of MPI and CUDA for CFD applications implemented on multi-GPU HPC clusters is proposed, and optimization methods are adopted to improve the computational efficiency (ii) Considering the CFD numerical schemes the nonblocking communication mode is proposed to fully overlap GPU computing, CPU_CPU communication, and CPU_GPU data transfer by creating two CUDA streams (iii) 1D domain decomposition method based on the idea of dichotomy is used to distribute the problem among GPUs to balance workload (iv) e proposed algorithm is evaluated with the flat plate flow application, and the parallel performance has been analyzed in detail e remainder of this paper is organized as follows. Section 2 discusses the related work on GPU-based parallelization and optimizations. Section 3 introduces the governing equations and numerical schemes. Section 4 describes the hybrid parallel algorithm and the optimizations in detail. Section 5 presents the performance evaluation results with the compressible turbulent flow over a flat plate. Section 6 provides the conclusion of this work and a plan for future work.

Related Work
In the field of CFD, GPU parallelization for CFD applications has achieved numerous remarkable results. Brandvik and Pullan [20,21] developed 2D and 3D GPU solvers for the compressible, inviscid Euler equations. is was the first CFD application to use CUDA for the 2D and 3D solutions. Ren et al. [22] and Tran et al. [23] proposed a GPUaccelerated solver for turbulent flow based on the lattice Boltzmann method, and the solver can achieve a good acceleration performance. Khajeh-Saeed and Blair Perot [24] and Salvadore et al. [25] accomplished direct numerical simulation (DNS) of turbulence using GPU-accelerated supercomputers which demonstrated that scientific problems could benefit significantly from advanced hardware. Ma et al. [26] and Zhang et al. [27] performed GPU computing of compressible flow problems by a meshless method. Xu et al. [14] and Cao et al. [28] described hybrid OpenMP + MPI + CUDA in parallel computing of CFD codes. e results showed that the GPU-accelerated algorithm had sustainably improved efficiency and scalability. Liu et al. [29] proposed a hybrid solution method for CFD applications on CPU + GPU platforms, and a domain decomposition method based on the functional performance model was used to guarantee a balanced workload. e optimization techniques are used to enhance the performance of GPU acceleration. Memory access and communication are the most critical parts of performance optimization. A review of optimization techniques and the specific improvement factors for each technique is shown in [4].
In a multi-GPU HPC cluster, GPUs cannot communicate directly for data exchange. Meanwhile, the blocking communication mode is simple to implement but has low efficiency. Several researchers have designed the nonblocking communication mode for GPU parallelization to overlap computation and communication. Mininni et al. [30] used the nonblocking communication method to realize the overlap between GPU computation and CPU_CPU communication. ibault and Senocak [31], Jacobsen et al. [32], Castonguay et al. [33], and Ma et al. [34] performed the nonblocking communication method with CUDA streams to fully overlap GPU computation, CPU_CPU communication, and CPU_GPU data transfer. eir results showed that the full coverage between computation and communication is the most efficient.
Domain decomposition is a commonly used method in parallel computing of CFD simulations to balance the workload in each processor. Jacobsen et al. [32] used 1D domain decomposition to decompose 3D structured meshes into a 1D layer. Wang et al. [35] studied the HPC of atmospheric general circulation models (ACGMS) in Earth science research on multi-CPU cores.
ey indicated an ACGMS model with 1D domain decomposition can only run in dozens of CPU cores. erefore, they proposed a 2D domain decomposition parallel algorithm for this large-scale problem. Baghapour et al. [36] executed CFD codes on heterogeneous platforms, with 16 Tesla C2075 GPUs, where the solver works up to 190 times faster than a single core of a Xeon E5645 processor. ey pointed out that 3D domain decomposition performs best in bandwidth-bound communication and not in latency-bound communication, in which 1D domain decomposition is preferred. Given that the GPU computing can execute many threads simultaneously and the communication between the CPU and the GPU becomes a source of high latency with highly noncontiguous data transfer, the 1D domain decomposition method is the most suitable for balancing the workload on GPUs.

Governing Equations and
Numerical Schemes e simulation of a compressible turbulent flow is considered. All volume sources are ignored due to body forces and volumetric heating, and the integral form of the 3D Navier-Stokes equations for a compressible, viscous, heatconduction gas can be expressed as follows [37]: where W �→ is the vector of conservative variables, F → c is the vector of convective fluxes, and F → v is the vector of viscous fluxes: 0 n x τ xx + n y τ xy + n z τ xz n x τ yx + n y τ yy + n z τ yz n x τ zx + n y τ zy + n z τ zz where ρ is the density, (u, v, w) are the local Cartesian velocity components, p is the static pressure, E is the total energy, H is the total enthalpy, (n x , n y , n z ) are the unit normal vectors of the cell surface, V is the velocity normal to the surface element dS, and Θ i stands for the work of the viscous stresses and of the heat conduction, respectively. e K − ω shear stress transport (SST) turbulence model [38] merges the Wilcox's K − ω model [39] with a high Reynolds number K − ε model [40]. e K − ω SST twoequation turbulence model can be written in integral form as follows: where W �→ T is the vector of conservative variables, F → c,T and F → v,T represent the convective fluxes and viscous fluxes, respectively, and Q → T is the source term: where K is the turbulent kinetic energy and ω is the specific dissipation rate. e components of the source term and the model constant are introduced in [37]. e spatial discretization of equation (1) on structured meshes is based on the cell-centered finite volume method. e upwind AUSM + UP scheme [41], which has a high resolution and computational efficiency for all speeds, is used to compute the convective fluxes. Second-order accuracy is achieved through the monotone upstreamcentered schemes for conservation law (MUSCL) [42] with Van Albada et al. limiter function [43]. e viscous fluxes are solved by the central scheme, and turbulence is modeled with the K − ω SST two-equation model. e solution of equation (1) employs a separate discretization in space and in time, so that space and time integration can be treated separately. For the control volume Ω I,J,K , equation (1) is written in the following form: e three-step Runge-Kutta method [44] with thirdorder accuracy has good data parallelism and lower storage overhead, which is used for temporal discretization of equation (5): where Δt I,J,K is the time step of control volume Ω I,J,K and R → I,J,K stands for the residual.

Algorithm Description.
e GPU implementation uses CUDA. CUDA is a general-purpose parallel computing platform and programming model for GPUs. CUDA provides a programming environment for high-level languages, such as C/C++, Fortran, and Python. For NVIDIA GPUs, CUDA has wider universality than other general programming models, such as OpenCL and OpenACC [9,[45][46][47]. In this study, we choose CUDA as the heterogeneous model to design GPU-accelerated parallel codes for CFD on GTX 1070 and Tesla V100 GPU. A complete GPU code consists of seven parts, namely, getting the device, allocating memory, data transfer from the host to the device, kernel execution, data transfer from the device to the host, free memory space, and resetting the device. In the CUDA programming framework, the execution of a GPU code can be divided into host codes and device codes, which are executed on the CPU and the GPU, respectively. e code on the device side calls the kernel functions to execute on the GPU.
e kernel corresponds to a thread grid, which consists of several thread blocks. One thread block contains multiple threads, and a thread is the smallest execution unit of a kernel. reads within a block can cooperate by sharing data through shared memory. Although the same instructions are executed on the threads, the processed data are different; this mode of execution is called the single instruction multiple thread (SIMT). e GPU is rich in computing power but poor in memory capacity, whereas the CPU is the opposite. To run CFD applications on a GPU, we must fully utilize the CPU and the GPU. us, the GPU is responsible for executing kernel functions, and the CPU only manages the execution of the GPU and communication. In a GPU parallel computing program, a thread is the smallest unit for kernel execution.
reads are executed concurrently in the streaming multiprocessor (SM) as a warp, and a warp contains 32 threads. erefore, the optimal number of threads in each thread block is an integer multiple of 32. For current devices, the maximum number of threads on a thread block is 1024. e greater the number of threads on each thread block, the smaller the number of thread blocks an SM can call. is condition will diminish the advantage of the GPU in utilizing multithreading to hide the delays in memory acquisition and instruction execution due to the issue of thread synchronization. Too few threads in the thread block will result in idle threads and thereby insufficient device utilization. In this study, a thread block size of 256 is used. e number of thread blocks is determined by the scale of workload to ensure that each thread is loaded with the computation of a grid cell.
MPI and OpenMP are two application programming interfaces that are widely used for running parallel codes on a multi-GPU platform. OpenMP can be utilized within a node for fine-grained parallelization using shared memory, and MPI works on shared and distributed memories and is widely used for massive parallel computations. In general, MPI exhibits performance loss compared with OpenMP but is relatively easy to perform on different types of hardware and has good scalability [48,49]. In this work, MPI-based communication for shared or distributed memory is hybridized with CUDA to implement large-scale computation on multi-GPU HPC clusters. e parallel algorithm for CFD on multi-GPU clusters based on MPI and CUDA is shown in Algorithm 1. It is noted that "Block_size" and " read_size" stand for the number of thread blocks and the thread block size, respectively. At the beginning of the calculation, the MPI_Init function is called to enter the MPI environment.
e Device_Query function is called to run the GPU. en, data are transferred from the CPU to the GPU using the cudaMemcpyHostToDevice function, and a series of kernel functions, including the processing of boundary, the computing of local time step, the calculation of gradient of primitive variables, the calculation of fluxes, and the update of primitive variables, are executed on the GPU. Among these kernel functions, the calculation of gradients and fluxes consumes the largest amount of time. For multi-GPU parallel computing, data exchange is required between GPUs, and the Primitive_Variables_Exchange and Grad_Primitive_Variables_Exchange functions are used to exchange the primitive variables and their gradients, respectively. After the kernel iteration ends, the cuda-MemcpyDeviceToHost function is called to transfer the data from the GPU to the CPU for postprocessing. Finally, the MPI_Finalize function is called to exit the MPI environment. e data transfer is essential in multi-GPU parallel computing. In this paper, the exchange of data is done by the CPUs controlling the GPUs. e data transfer process between GPUs is shown in Figure 1. First, we call the cuda-MemcpyDeviceToHost function to transfer data from the device to the relevant host through the PCI-e bus. en, the data are transferred between CPUs with the MPI. Finally, we call the cudaMemcpyHostToDevice function to transfer the data from the host to the target device through the PCI-e bus.

Algorithm Optimizations.
e performance of the GPU parallel program can be optimized using two main methods: parallel execution optimization for the highest degree of parallelism utilization and memory access optimization for maximum memory throughput.

Parallel Execution
Optimization. CUDA programs have two execution modes: synchronous mode and asynchronous mode. Synchronous mode means that control does not return to the host until the current kernel function is executed. Asynchronous mode means that control returns to the host immediately once the kernel function is started. erefore, the host can start new kernel functions and perform data exchange simultaneously. Streams are sequential sequences of commands that can be managed by the CUDA program to control device-level parallelism. Commands in one stream are executed in order, but streams from different commands can be executed in parallel. us, the concurrent execution of multiple kernel functions, which is called asynchronous concurrent execution, can be implemented via streams. For the massive parallel computing of CFD on a GPU, a series of kernel functions needs to be executed. e concurrent execution of these kernel functions can be implemented via streams. CUDA creates and destroys streams through the cudaStreamCreate and cudaS-treamDestroy functions, and synchronization between threads can be achieved through the cudaDeviceSynchronize function.
e implementation of the asynchronous concurrent execution algorithm of CFD on the GPU is shown in Algorithm 2. e initialization of the gradients and residuals, the boundary condition processing, and the time step calculation are concurrently executed by creating a stream. In addition, the calculation of the inviscid fluxes can also be concurrently executed with the calculation of the gradients of primitive variables.

Memory Access Optimization.
e data transfer between the host and the device is realized via the PCI-e bus. e transmission speed is considerably lower than the GPU bandwidth.
erefore, the application should reduce the data exchange between the host and the device as much as possible. In this work, the kernel iteration is completely performed on the GPU, and data transmission only occurs at the beginning and end of the kernel iteration. Intermediate variables can be created in the data memory and released after the calculation is completed. However, for multi-GPU parallel computing, data transfer is inevitable between the host and the device due to the features of communication between GPUs.
e GPU provides different levels of memory structure: global memory, texture memory, shared memory, and the registers. is storage mode ensures that the GPU can reduce the data transfer between the global memory and the device. e global memory locates in the video memory with a large access delay. erefore, the maximum bandwidth can only be obtained by employing the coalesced memory access. Texture memory is also part of the video memory. Compared with the global memory, the texture memory can use the cache to improve the data access speed and obtain a high bandwidth without strictly observing the conditions of coalesced memory access.
e shared memory has a bandwidth much higher than that of the global and texture memories. Data sharing among threads in the SM can be realized by storing the frequently used data in the shared memory. e register is the exclusive storage type of the thread, which stores the variables declared in the kernel to accelerate data processing. e GPU computing efficiency Scientific Programming can be improved by properly using the texture memory, the shared memory, and the registers and reducing the number of accesses to the global memory.

Nonblocking Communication Mode.
When performing the parallel computing of CFD on multi-GPU HPC clusters, the kernel iteration process needs to exchange data on the boundary, including primitive variables and their gradients. e primitive variables U → � ρ u v w p K ω T are chosen to ensure that as small data as possible are exchanged. In this process, CPU_CPU communication and CPU_GPU data transfer exist. Optimizing the communication between GPUs significantly affects the performance of multi-GPU parallel systems. e traditional method is the blocking communication mode, as shown in Figure 2    Scientific Programming because the computations of the inviscid fluxes do not depend on the values being transferred among GPUs. e exchange of the gradients to the host can start as soon as the data are packaged by using stream 0. en, the data transmission between CPUs is implemented with MPI. At the same time, the Convective_Flux_GPU function is executed by using stream 1. Finally, the target device can receive the data from the host with stream 0. erefore, the overlaps among GPU computing, CPU_CPU communication, and CPU_GPU data transfer can be realized. In this work, the nonblocking communication mode based on multistream computing is used to optimize the communication of GPU parallel programs. e nonblocking communication mode calls the MPI_Isend and MPI_Irecv functions for data transmission and reception, respectively, and the MPI_-Waitall function to await the communication completion and query the completion status. e cudaMemcpyAsync function is used for asynchronous data transmission.

Domain Decomposition and Load Balancing.
In the multi-GPU parallel computing, the computational grid needs to be partitioned. Considering the load balancing problem, this work uses the 1D domain decomposition method to load each GPU with approximately the same number of computational grid. e 1D domain decomposition is shown in Figure 4. is method adopts the concept of the bisection method, which is simple to implement and facilitates load balancing. e dichotomy algorithm is shown in Algorithm 5.
Meanwhile, the coalesced memory access is easy to implement due to its boundary data alignment for effectively improving the efficiency of boundary data communication. When the nonblocking communication mode is used, the data in the GPU are divided into three parts (top, middle, and bottom). e top and bottom parts of the data need to be exchanged with other devices. e data transfer of the top and bottom parts can occur simultaneously with the computation of the middle portion.

Test Case: Flat Plate Flow.
e supersonic flow over a flat plate is a well-known benchmark problem for compressible turbulent flow of CFD applications [50,51]. is test case has been studied by many researchers and is widely used to verify and validate CFD codes. e free stream Ma number is 4.5, the Reynolds number based on the unit length is 6.43 × 10 6 , the static temperature is 61.1 K, and the angle of attack is 0°. No-slip boundary condition is applied at the stationary flat plate surface, which is also assumed to be adiabatic. e supersonic flat plate boundary layer problem is solved on various meshes, namely, mesh 1 (0.72 million), mesh 2 (1.44 million), mesh 3 (2.88 million), mesh 4 (5.76 million), mesh 5 (11.52 million), mesh 6 (23.04 million), mesh 7 (46.08 million), and mesh 8 (92.16 million).

Hardware Environment and Benchmarking.
In this work, two types of devices, namely, GTX 1070 GPU and Tesla V100 GPU, are introduced. ese two devices were introduced by NVIDIA Corporation. Table 1 shows the main performance parameters of GPUs. For these types of devices, the single-precision floating-point operations far exceed double-precision ones. erefore, the singleprecision data for GPU parallelization are used. e performance of the latest Turing architecture Tesla V100 GPU is greatly improved compared with the previous architecture GPU.
In this study, we use CUDA version 10.1, Visual Studio 2017 for C code and MPICH2 1.4.1 for MPI communication.

Performance
Analysis. Speedup (SP) and parallel efficiency (PE) are important parameters for measuring the performance of a hybrid parallel algorithm. Speedup and parallel efficiency are defined as follows [9,52]: where t CPU is the runtime of one iteration step for one CPU with eight cores, t GPU is the runtime of one iteration step for GPUs, W 1 and W 2 are the problem sizes, N 1 and N 2 are the number of GPUs, and T 1 and T 2 are the computation times. If W 2 � W 1 , then the problem size remains constant when the number of GPUs is increased. e parallel efficiency of strong scalability is (N 1 T 1 /N 2 T 2 ). If (W 2 /N 2 ) � (W 1 /N 1 ), then the problem size grows proportionally with the number of GPUs; thus, the problem size remains constant in each GPU. e parallel efficiency of weak scalability is (T 1 /T 2 ).
Here, subscript 1 denotes a single GPU and subscript 2 stands for multi-GPU HPC clusters. e runtime of one iteration step is achieved by averaging the execution time of ten thousand time steps.

Single GPU Implementation.
For a single GPU implementation, the time required for the calculation of one iteration step is provided in Table 2 (time is given in milliseconds). Figure 5 shows that the speedup of a single GPU increases with the increase in grid size. For the Tesla V100 GPU, the speedup reaches 135.39 for mesh 1 and 182.11 for mesh 8. us, the Tesla V100 GPU has a considerably greater speedup than the GTX 1070 GPU. GPU parallelization can greatly improve the computational efficiency compared with CPU-based parallel computing. For meshes 7 and 8, the GTX 1070 GPU cannot afford such a large amount of calculations because of the limitation of device memory. For our GPU codes, 1 GB of device memory can load approximately 3 million grid cells. As the grid size is increased, the speedup of the GPU code increases gradually. e reason is that, as the grid size increases, the proportion of kernel execution increases with those of data arrangement and communication. Meanwhile, the growth rate of speedup is gradually decreasing because of the limitation of the number of SMs and CUDA cores.

Scalability.
In this section, the blocking communication mode is used to study the scalability of GPU codes. Here, the performance of GTX 1070 and Tesla V100 multi-GPU HPC clusters is discussed.
Strong scaling tests are performed for meshes 1 to 8 on multi-GPU HPC clusters. Tables 3 and 4 provide the time required for the calculation of one iteration step for multi-GPU implementation (time is given in milliseconds). Figures 6 and 7 show the strong speedup and parallel efficiency, respectively. In Figure 6, the strong speedup is shown for different grid sizes. Evidently, a large grid size can reach a high speedup. For GTX 1070 and Tesla V100 multi-GPU clusters, the speedups of four GPUs reach 172.59 and 576.49, respectively. ese values are far greater than the speedups achieved by a single GPU. us, a high degree of strong scaling performance is maintained. Multi-GPU parallelization can considerably improve the computational efficiency with the increase in grid size. However, multi-GPU parallel computing does not show obvious advantages when the grid size is small. is can be explained by the fact that the relative weight of data exchange is inversely proportional to the grid size. GPU is specialized for compute-intensive, highly parallel computation. In Figure 7, the strong parallel efficiency is shown for different grid sizes. is result is consistent with the change law of speedup; that is, the parallel efficiency increases with the increase in grid size, and the strong parallel efficiency performance of GTX 1070 multiclusters is slightly better than that of Tesla V100 GPUs. For mesh 8, the strong parallel efficiency of four Tesla V100 GPUs is close to 80%. In addition, a larger number of GPUs indicate a lower parallel efficiency because of the increase in the amount of data transfer. Figure 8 shows the amount of memory communications for parallel computing with four GPUs. As the grid size increases, the amount of memory communication increases proportionally. e weak scaling tests for parallel efficiency are shown in Figure 9, and the grid size loaded on each block remains constant. As expected, the parallel efficiency decreases as the number of GPUs increases because the amount of data exchange increases with the increase in the number of GPUs. For the grid size of mesh 6 on each GTX 1070 GPU and Tesla V100 GPU, the weak parallel efficiency of four GPUs can  Grid size (million) Figure 5: e speedup of a single GPU for different number of grid cells.

Performance of Nonblocking Mode between GPUs.
In this section, the nonblocking communication mode is used to study the performance of GTX 1070 GPUs and Tesla V100 multi-GPU HPC clusters. As an example, mesh 6 is investigated on one node with four GPUs to compare the performance of the two communication modes. Figures 10 and 11 show the strong scalability performance of the nonblocking communication mode compared with the blocking communication mode. Figure 12 shows the weak parallel efficiency of the two methods. e results show that the nonblocking communication mode can shield the communication time with the computing time by overlapping the communication and the computation. For four GPUs, the strong speedups of GTX 1070 and Tesla V100 multi-GPUs increase by 15.15% and 19.63%, respectively. In addition, the weak parallel efficiency remains above 87%    with the nonblocking communication mode. Moreover, the GPU-based parallelization with Tesla V100 GPUs can exhibit a better performance improvement than the GTX 1070 multi-GPU HPC clusters. is result is because the relative weight of the communication time of Tesla V100 GPUs is higher.

Conclusions
In this work, a hybrid parallel algorithm of MPI and CUDA for CFD applications on multi-GPU HPC clusters has been proposed. Optimization methods have been adopted to achieve the highest degree of parallelism utilization and maximum memory throughput. In this study, a thread block size of 256 is used. e number of thread blocks is determined by the scale of workload to ensure that each thread is loaded with the computation of a grid cell. For a single GPU implementation, two types of devices have been discussed. We obtain an acceleration ratio of more than 36 times, which indicates that GPU parallelization can greatly improve the computational efficiency compared with CPU-based parallel computing. Meanwhile, a large grid size can reach a high speedup due to the increase in the proportion of kernel executions compared with those of data arrangement and communication. e speedups of four GPUs reach 172.59 and 576.49 for GTX 1070 GPUs and Tesla V100 multi-GPU HPC clusters, respectively. e strong and weak parallel efficiency are maintained at a high level when the grid size is at a large value. us, the parallel algorithm has good strong and weak scalability. Nonblocking communication mode has been proposed to fully overlap GPU computing, CPU_CPU communication, and CPU_GPU data transfer. For four GPUs, the strong speedups of GTX 1070 and Tesla V100 multi-GPUs increase by 15.15% and 19.63%, respectively. In addition, the weak parallel efficiency remains above 87% with the nonblocking communication mode.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.