An implementation of the direct-forcing immersed boundary method using GPU power

A graphics processing unit (GPU) is utilized to apply the direct-forcing immersed boundary method. The code running on the GPU is generated with the help of the Compute Unified Device Architecture (CUDA). The first and second spatial derivatives of the incompressible Navier-Stokes equations are discretized by the sixth-order central compact finite-difference schemes. Two flow fields are simulated. The first test case is the simulated flow around a square cylinder, with the results providing good estimations of the wake region mechanics and vortex shedding. The second test case is the simulated flow around a circular cylinder. This case was selected to better understand the effects of sharp corners on the force coefficients. It was observed that the estimation of the force coefficients did not result in any troubles in the case of a circular cylinder. Additionally, the performance values obtained for the calculation time for the solution of the Poisson equation are compared with the values for other CPUs and GPUs from the literature. Consequently, approximately 3 × and 20 × speedups are achieved in comparison with GPU (using CUSP library) and CPU, respectively. CUSP is an open-source library for sparse linear algebra and graph computations on CUDA.


Introduction
The immersed boundary method (IBM) is a numerical method that is currently the focus of many researchers. The IBM is a tool applied to the solution of flow problems with arbitrary boundaries using a Cartesian-style computational grid. That is, the computational grid does not need to conform to the boundaries, with the desired boundary conditions imposed by applying a numerical algorithm in the vicinity of the immersed boundary (Mittal & Iaccarino, 2005). Therefore, this approach effectively avoids the difficulties faced in the grid generation of complex boundaries. Moreover, the computational cost per grid node is generally much lower than that of general purpose, unstructured grid solvers. As a result, because of the offered simplification in the grid generation and other aspects, the CFD community has recently taken an interest in this technique, which was originally aimed at biological flows. With IBM, equations may be solved only on a Cartesian grid, and there is no need to reconstruct the grid in the case of moving or deforming bodies. Although the immersed boundary CONTACT Bulent Tutkun tutkun@itu.edu.tr method was first proposed and implemented by Peskin (1972) to simulate blood flow interacting with the heart, now it is in use for the solution of various problems, such as moving rigid boundaries (Liao, Chang, Lin, & Mcdonough, 2010), flapping airfoils (Yang, He, & Zhang, 2010), heat transfer (Luo, Zhuang, Fan, & Haugen, 2016), viscoelastic fluid (Lee & Wolgemuth, 2016), acoustic wave scattering (Seo & Mittal, 2011), natural convection (Dash & Lee, 2014), and separated turbulent flow (Bernardini, Modesti, & Pirozzoli, 2016). In his method, Peskin used Lagrangian points connected by springs to simulate the elastic boundary. Later, the method was adapted to simulate rigid boundaries by increasing the spring rigidity (Lai & Peskin, 2000) or using a feedback control (Saiki & Biringen, 1996). However, these methods had heavy time-step restrictions due to the stiffness of the system. To avoid this restriction, Mohd-Yusof (1997) first introduced the direct-forcing approach and imposed the velocity boundary condition directly. Afterwards, this method was implemented by many researchers (Balaras, 2004;Iaccarino & Verzicco, 2003;Kim & Choi, 2006;Tseng & Ferziger, 2003). In direct-forcing IBM, a solid body is immersed in the grid with the use of an added forcing term to the governing equation. The velocity boundary condition on the immersed body is satisfied with the help of the forcing term, which is calculated from the algebraic equations in the discretized problem. An application of a direct-forcing IBM to run on a GPU for the solution of incompressible Navier-Stokes equations is presented in this study. Spectra of length and time scales of turbulent or highly convective flows are quite wide. Therefore, numerical methods used in the simulation of such flows should be competent to resolve the best part of relevant scales. Grid resolution is the decisive factor for the representation of the length scales and the numerical scheme used defines the accuracy level of the computation. Using finer grids with the standard second-order methods can meet these requirements. However, the efficiency of the scheme may be limited by massive memory demand or numerical error. Low-order numerical methods have difficulties in simulating vortex-dominated flow fields. The main reasons for this are deformation and dissipation of the vortical flow structures as a result of numerical diffusion in the solution algorithm. The application of high-order methods may reduce the grid size and enhance the accuracy. In this respect, compact finite difference methods (Lele, 1992) have become quite popular in computational fluid dynamics (CFD) applications.
Compact finite-difference schemes are generally better in resolving the broad spectrum of length scales in comparison with standard finite-difference schemes of the same order. Thus, compact schemes require a relatively smaller stencil and provide better resolution, especially at higher wave numbers. Therefore, they can combine the robustness of finite-difference schemes and the accuracy of a spectral-like solution in an efficient way. Comprehensive studies focusing on the resolution characteristics of the higher-order compact schemes on a uniform grid have been performed. Compact schemes are applied in the simulation of the various incompressible and compressible flow problems (Boersma, 2011;Hejranfar & Ezzatneshan, 2014;Kalita & Dass, 2011;Meinke, Schroder, Krause, & Rister, 2002;Mishra & Sanyasiraju, 2012;Nagarajan, Lele, & Ferziger, 2003;Wang, Zhang, Ma, Qiu, & Liang, 2016). In compact finitedifference schemes, derivatives are computed implicitly with the utilization not only of function values but also of derivative values at the neighboring nodes. In the present study, the sixth-order central compact finite difference schemes are used for the solutions of incompressible flow equations.
The increase in the computational power of computer systems is one of the most prominent factors for the successful applications of large-scale CFD problems. The emergence of graphical processing units (GPU) as a scientific computing platform in recent years is a good example of such advancements. At the beginning, they provided service with continuously increasing gamingcontent performance to gamers in various simulations. The continuous increase in the computational power of GPUs has widened their use from gaming to applications with heavy computational burdens in the field of computational sciences. GPUs, with their many core structures and outstanding processing performance, offer a considerable alternative for scientific computation. It is well known that GPUs are originally designed to achieve high performance at processing big graphics data sets. Since GPUs inherently have an ability to compute in parallel, they can provide fast and low-cost solutions especially for large-scale computational problems. Numerous applications from various engineering fields that exploit the parallel performance of GPUs can be found in the literature (Owens et al., 2007). In the field of CFD, some promising implementations are also realized with various flow-field simulations. One example is in Hagen, Lie, and Natvig (2006), which represents one of the early applications in compressible fluid flows. Pullan (2007, 2008), using GPU computing, performed 2D and 3D simulations of Euler equations and made performance comparisons between GPUs and CPUs. A hypersonicflow simulation on a GPU platform may be found in the research article of Elsen, Legresley, and Darve (2008). They used an NVIDIA 8800GTX GPU and simulated a hypersonic vehicle in cruise at Mach 5 to demonstrate the capabilities of GPU code. They achieved an acceleration of 20x for the Euler equations. Additionally, several recent CFD applications can be seen in Wang, Abel, and Kaehler (2010); Xian and Takayuki (2011);Karantasis, Polychronopoulos, and Ekaterinaris (2014); Ma, Wang, and Pu (2014) ;Liu, Zhong, and Xu (2016), and Jude and Baeder (2016).
In the present study, simulations of two flow fields around a square and a circular cylinder are realized using the hardware and numerical tools mentioned. The flow past a square cylinder is chosen as a test case in order to see behaviors of the high-order compact scheme and the immersed boundary method under the influence of sharp corners. As expected, the high-order compact schemes bring about spurious oscillations in the neighborhood of the immersed boundary. This issue is alleviated by using mirror like and decaying velocity conditions for the solid side of the immersed body to create an internal flow field that ensures no-slip condition on the immersed body. From the point of the computational performance of the GPU, there are two important complications. The first one is the coalescing issue regarding the memory of the GPU. To overcome this issue, an approach making use of the shared memory recommended in Tutkun and Edis (2012) is applied. Second, since the Poisson equation solver is the most time consuming part of the code, performance of this solver is another issue which affects the overall performance of the solution procedure. Hence, a special sparse matrix storage format (ELLPACK) is incorporated into the code, thereupon developed Poisson solver achieves the higher performance by means of this storage format.
Second test case, the flow past a circular cylinder, is chosen to figure out the effects of sharp corners on the accuracy of the simulation. The flow around a circular cylinder is simulated with the same conditions and processes as in the square cylinder case in order to be able to observe the behavior differences of the numerical method under the effects of sharp-cornered and smoothcontoured bodies.
These simulations show the accuracy limits of the direct-forcing IBM and the performance improvements provided by the GPU computing environment. The GPU used in the computations is a Tesla C1060 produced by NVIDIA for general-purpose computing.

Governing equations
In the context of this study, the continuity and Navier-Stokes equations for unsteady, incompressible flows are considered to be: where V is the fluid velocity, ρ is the density, p is the pressure, ϑ is the kinematic viscosity and F is the forcing term obtained from the immersed boundary method. The equations are spatially discretized by using sixthorder, compact finite-difference schemes. In addition, the second-order Adams-Bashforth method with a fractional step approach is applied for the time integration. The details of these discretizations are given in the subsequent sections.

Spatial discretization
A high-order, central, compact finite-difference scheme is applied for spatial discretization on a non-staggered grid system. Compact finite-difference schemes provide more accuracy than the standard finite-difference schemes. To achieve this, compact schemes mimic the behavior of spectral methods that couple derivative terms over all grid nodes. In that way, the compact scheme couples the derivatives of the neighboring nodes to reach an improved accuracy. Lele (1992) introduced the compact finite-difference schemes and investigated their accuracy features. Starting from the Hermitian formula (Hoffmann & Chiang, 2000) and using the Taylor series expansions for the terms, we can come up with a general, five-point formulation for the approximation of the first derivative; where f and f show a flow variable and its derivative, respectively. h is the step size in the corresponding direction. α and β define the characteristic of generated system and it may be either a tri-diagonal or a penta-diagonal one. If the periodic boundary conditions are implemented, then the scheme produces a cyclic tri-diagonal or a cyclic penta-diagonal system. Provided that the coefficients are chosen as in the following expression, then the sixth-order compact formulation is obtained: Compact schemes for the second derivative can be formed in a similar way. Again, we can start with an expression including the second derivatives and function values of the neighboring points. This relation form is also similar to the one for the first derivatives: Again, the sixth-order scheme is shown below: For non-periodic boundaries, one-sided schemes are applied on the boundary because the stencil of the inner compact scheme extends out of the boundary. In the case that the high-order, compact boundary and nearboundary schemes are used with high-order, compact inner schemes, the appearance of numerical instabilities is highly probable (Uzun, Blaisdell, & Lyrintzis, 2004). Therefore, the orders of the used boundary and nearboundary schemes are lower than that of the interior scheme. In addition, the tridiagonal form of the interior scheme is preserved by utilizing these schemes. As a result, the third-order, one-sided, compact scheme and the fourth-order, central, compact scheme are applied for the boundary point i = 1, and the near-boundary point i = 2, respectively. These schemes are presented below for the first and second derivatives: 1st derivative, i = 2(4th order) 2nd derivative, i = 2 (4th order) Similar formulations are implemented for the right boundary at i = N and the near-boundary point at i = N − 1.

Temporal discretization and the immersed boundary method
In this study, the direct-forcing IBM is utilized in the solution of incompressible Navier-Stokes equations. The essence of direct-forcing is the imposition of the boundary velocity V b on the solution of the velocity for the solid boundary. This imposition is applied via a force term that is added to the governing equations. The x-component of the incompressible Navier-Stokes equations is: where F x is the forcing term for the IBM and is inserted into the equation to ensure the proper boundary velocity value on the surface of the solid body. If we simply discretize this equation in time: RHS in this equation includes the convection, diffusion and pressure terms. From this relation, the forcing value ensuring the velocity boundary condition on the wall can be easily written as: In this expression, ε = 1 in the solid-body region, and ε = 0 everywhere else (Laizet & Lamballais, 2009). Therefore, with the use of the forcing term F x , the velocity boundary condition can be imposed directly onto the solid surface. As a result, the boundary conditions on the wall are satisfied at every time step. This is the general overview of the IBM used in this study. Computational structure of the GPU performs much more efficiently with an explicit time integration method. The first and a popular alternative is the Runge-Kutta methods. However, Runge-Kutta methods that necessitate the solution of the Poisson equation more than once per one timestep have fatal effects on the performance of the solver. Therefore, time integration is realized with the use of a second-order Adams-Bashforth method with a fractional step approach. We can combine the time integration with the governing equations to explain the IBM as used in this study.
The F in (2) is the direct forcing vector that is described before. With the use of this forcing term, the required boundary conditions on the solid surfaces are directly imposed. If both the convection and the diffusion terms are represented by A: The time advancement of the momentum equation in (2) may be given as: In this approach, the expression for the direct forcing term related to IBM may be given by: As mentioned before, ε is defined as 1 in the solid-body region and 0 everywhere else. With the insertion of this term into (15), the desired velocity boundary condition in the region prescribed by ε is satisfied in the first step of the fractional step method. Normally, in a standard fractional-step method, the incompressibility condition ∇ · V n+1 = 0 may be satisfied with the solution of a Poisson equation in the form of: When IBM is incorporated into the computational method, this equation is transformed into: When ε = 0, the standard Poisson equation is obtained. However, in the solid-body region, ε = 1 (20) gives the Laplace equation. For the discretization of Equation (20), a second-order, centered, finite-difference scheme is utilized. Solution of the discretized equation is realized using a classical conjugate-gradient algorithm on GPU.
The use of the high-order, centered scheme with the direct-forcing approach may amplify the spurious oscillations occurring in the vicinity of the wall on which the boundary condition is imposed by the use of forcing terms. Therefore, an internal flow field supporting a no-slip boundary imposition on the immersed boundary is created to mitigate this issue. By using this approach, first proposed by Parnaudeau, Lamballais, Heitz, and Silvestrini (2004), we can create an internal flow similar to that of the mirror condition.
Therefore, the target-velocity values of the nodes inside the immersed boundary are specified with the help of the external flow field. This treatment may be seen in Figure 1 for the square and circular cylinder cases. Here, 'm' is a function used to cancel out a possible singularity at the centers of the cylinders. Surely, some other forms of the function 'm' may be used provided that the three following conditions are satisfied: (a) m(1/2) = 1; reverse condition in the vicinity of the body surface, (b) m(0) = 0; singularity cancelation and (c) 0 ≤ m(r) ≤ 1 with moderate first and second derivatives for 0 ≤ r ≤ ½. In the square cylinder case, because the immersed boundaries and the grid lines coincide, specifying the internal flow is performed along each grid line. However, in the circular cylinder case, a polar coordinate system is used to determine the internal flow field. In addition to this, we apply zero gradient condition for the pressure. This is performed by conditioning first two grid nodes inside the immersed boundary with the pressure values coming from the corresponding nodes outside the body.
To assess the success of this approach, a seventh-order polynomial function is used to see the error which compact scheme (Equation 4) together with a mirror-like boundary condition generates. The seventh-order polynomial is in the form of: over the domain x = [0, 1]. Two ghost nodes are identified at the boundaries of the domain in order to be able to apply the Equ. (4) over the whole interval. Also this mirror-like condition mimics the treatments for the immersed boundary in the vicinity of the solid wall. Mentioned boundary condition for the left boundary (i = 1) may be given as: This shows the relations just for the left boundary, corresponding relations for the right boundary can be identified simply. The maximum error occurs at the boundaries as expected because of the applied ghost node conditions. Therefore reduction in the error value (f exact − f computed ) for the left boundary point is presented in Figure 2 (left). Spoiling effects of the implemented boundary conditions reduce sixth-order accuracy of the compact scheme to nearly second-order accuracy. Although we face a degradation in accuracy of the compact scheme, a sixth-order accuracy can be recovered outside from the border vicinity as shown in Figure 2 (right).

GPU and code design
Computations in this study are done on a NVIDIA Tesla C1060 GPU. This device has 30 multi-processors, and each of them comprises eight scalar processors, clocked at 1.3 GHz. Our host system accommodates an AMD Phenom X4 9850 CPU and 8 GB memory. The code for the simulations of flow problems is generated by using CUDA toolkit that is developed by the NVIDIA in order to make programming the GPUs straightforward. The GPU with the capability of performing large-scale computations in parallel is considered as a coprocessor designed to supplement the abilities of the main CPU. In this approach, compute-intensive parts of the code running on the CPU are directed to the GPU by employing a kernel (may be viewed as a function) executed on the GPU by many-core computational structure. In the present study, structure of the threads generated on the GPU is similar to that applied in Tutkun and Edis (2012). One thread is assigned to each finitedifference node in the computational domain. As a result of this approach, all calculations required for a node are done by the same thread. The CUBLAS library for CUDA is utilized for the solutions of the cyclic-tridiagonal and tridiagonal systems resulting from the high-order, compact schemes. In order to solve the Poisson equation, a conjugate-gradient algorithm is implemented through a custom kernel in which the ELLPACK format (Bell & Garland, 2009) is also used to store sparse matrices. The ELLPACK format provides an increase in performance as will be shown in the corresponding section. Double precision is applied in the conjugate-gradient algorithm to ensure convergence. Moreover, the shared memory with its on-chip nature and increased speed compared to the global memory is exploited to alleviate certain non-coalescing issues.

Applications and performance comparisons
Fluid flow around bluff bodies is one of the most commonly encountered flow problems in various engineering fields. It has many practical applications in terms of CFD. One of the main characteristics of this configuration is the flow separation at the immersed body surface. Thus, in turn, it leads to a remarkable wake region that is characterized mainly by the shedding vortices from the upper and lower sides of the body. Depending on the flow system, this wake region may present certain benefits, such as enhanced heat and mass transfer, but at the same time, it may cause certain complicating issues, such as unwanted vibrations. Therefore, predicting the wake region accurately is quite important. In this study, the flows around two important bluff bodies, a square and a circular cylinder, are simulated using the immersed boundary method.

Flow around a square cylinder
Simulation of the flow around a square cylinder is applied to evaluate the immersed boundary method on GPU for Re = 100. Sixth-order, compact, central finite-difference schemes are utilized for both the first and the second spatial derivatives for the discretization of the incompressible Navier-Stokes equations. The flow domain of this test case and the applied boundary conditions are given in Figure 3. The square cylinder is placed between  x = 6, x = 7 and y = 5.85, y = 6.85. The side length of the cylinder is D = 1. The values of the dimensions in the streamwise and the stream-normal directions are X/D = 25.5 and Y/D = 12.7 respectively. At the inlet (i = 1), the applied boundary conditions are u = U ∞ , v = 0 and ∂p/∂x = 0. At the outlet (i = N), the boundary conditions are ∂u/∂x = ∂v/∂x = 0 and p = 0. Moreover, periodic boundary conditions are imposed on the remaining boundaries. Uniform grids are used for the computations of this test case. A grid convergence test is also performed, as shown in Table 1 for the Re = 100 case. The Strouhal number (St) in Table 1 is also an important non-dimensional parameter giving information about the vortex shedding frequency of the flow, and it is defined as: where, f , D and U ∞ stand for the shedding frequency, the cylinder dimension and the free-stream velocity, respectively. The Reynolds number is defined as Re = U ∞ D/ϑ. As a result of this study, the total node numbers along the Cartesian coordinates (x, y) are selected as 384 × 192, respectively. The near-wall part of the computational grid created for the square cylinder test case may be seen in Figure 4. In vortex-dominated flow fields, it is very important to accurately calculate not only the region near the body but also the wake region. This is also another factor for the usage of the high-order, compact scheme that has high accuracy and low numerical dissipation in the region far from the immersed boundary. In this respect, using the high-order compact scheme also provides an opportunity to apply the uniform grid throughout the flow domain.
In Table 2, the x values showing the distance of the first node to the solid body are indicated for various studies in the literature and the present study. The table also provides the numerical methods used for the studies mentioned. As shown in the table, the distance of the first node to the body is at least six times higher in this study than in the others.
In the context of this study, Figure 5 shows the instantaneous streamlines of the flow domain for Re = 100. The form of the antisymmetric wake flow, called the von  Sohankar, Norberg, and Davidson (1998) 0.004 Finite volume method. 2nd-3rd order in space. Sahu, Chhabra, and Eswaran (2009) 0.01 Finite volume method. 2nd order in space. Singh, De, Carpenter, Eswaran, and Muralidhar (2009) 0.007 Finite volume method. 2nd order in space. Sen, Mittal, and Biswas (2011) 0.001 Finite element method Present 0.066 Karman vortex street, is presented. As shown in Figure 5, with the purpose of describing the characteristics of the immersed boundary method, this illustration does not involve a square cylinder in the flow domain as an object, but the flow behaves as if there is one. The flow around a square cylinder shows similar characteristics with the flow around a circular cylinder if we consider instabilities only. However, the separation process and dependence of the shedding frequency on the Reynolds number differs greatly. Unlike the circular cylinder, the separation points of the square cylinder are fixed at its leading or trailing edges. Moreover, the wake width just at the back side of cylinder is at least one diameter, but for the circular cylinder it is less than half a diameter. As a result, the vortex formation region is much longer and broader for the square cylinder than for the circular cylinder. A totally attached fluid flow is observed over the square cylinder at Re = 1. However, at around Re ≈ 2 − 3 the flow starts separating at the trailing edge of the cylinder. Consequently, a confined recirculating region consisting of two symmetric vortices develops behind the cylinder. Size of the recirculating region increases with Reynolds number and the flow shows steady characteristics. However, at a critical onset Reynolds number the two vortex structure begins to exhibit an unstable behavior and the wake starts oscillating periodically. This periodic phenomenon is known as vortex shedding. In Sohankar, Norberg, and Davidson (1997), for square cylinder flow the reported experimental value for the onset of vortex shedding is Re = 47 ± 2. Moreover, Figure 6 (a)-(h) depicts the instantaneous streamlines in the vicinity of a square cylinder for a full vortex shedding cycle. In Figure 6, there are also the corresponding images below each phase of the cycle, showing the results of another computational study (Sharma & Eswaran, 2004). With its eight consecutive frames, Figure 6 covers a vortex shedding period. The centers and saddles related to the vortex shedding are also indicated in the figure. In addition, the process of instantaneous 'alleyways' mentioned in Perry, Chong, and Lim (1982) is also demonstrated in the figure. In this process, the fluid
As an example, two of these alleyways are shown by the red arrows in Figure 6 (b) and (e). For instance, looking at the figure, it can be seen that the period begins with a vortex maturing and shedding from the lower side of the cylinder. At the same time, during this interval, the fluid is constantly pulled into the recirculation region from the upper side. After the vortex sheds, the same process starts again but with the opposite sides. This time, the vortex matures and sheds from the upper side of the cylinder, and the fluid is pulled into the recirculation region from the lower side.
In addition, Table 3 shows a comparison of the recirculation length, defined as the streamwise distance from the base of the square cylinder to the re-attachment point along the wake centerline. Recirculation length regarding Table 3 is based on the time-averaged flow field since the flow is unsteady. The table shows that the obtained recirculation length is in good agreement with the results of other computational studies.
The shedding frequency may be obtained by applying a fast Fourier transform (FFT) to the lift-coefficient or the drag-coefficient signals. Figure 7, as a result of an FFT implementation, shows the power spectrum of C l and C d for the Re = 100 case. As the figure indicates, whereas the lift coefficient oscillates at the shedding frequency, the drag coefficient oscillates with twice the shedding frequency.
In this context, Table 4 shows a comparison made between various previous studies and the present one in  Sohankar et al. (1998) 0.146 Sahu et al. (2009) 0.148 0.159 Singh et al. (2009) 0.147 0.156 Sen et al. (2011) 0.145 0.160 Robichaux et al. (1999) 0. 164 Franke, Rodi, and Schonung (1990 Furthermore, because the no-slip boundary condition is not enforced strictly on the body during the solution process, this condition is satisfied approximately. One reason for this is the use of the high-order, compact, central finite-difference scheme with the direct-forcing approach. Although it is highly accurate in space, it may not be inherently as good at resolving the discontinuities because of its quasi-spectral behavior. Internal treatment of the immersed boundary applied here may smooth the discontinuity within limits. However, in the case of the square cylinder, the sharp corners of the cylinder have disturbing effects on the treatment. Consequently, the penetration of certain streamlines into the immersed boundary may occur, as shown in Figure 6. As a result of this, although the wake region and the vortex shedding are simulated accurately, this issue somewhat deteriorates the force calculations. The RMS of the lift coefficients and the time-averaged drag coefficients may be seen in Table 5, along with the results from other studies. In the present study, forces on the immersed boundary are calculated by a surface force integration approach. Immersed boundary is discretized into elements with sizes same with the grid spacing. Then, by using these nodal values, we calculate velocity derivatives at the surface through one sided-differencing. So in this respect the surface force distribution is evaluated directly. Since we are doing this calculation after the whole flow field  has been solved, it is actually a post-processing step. It is obvious that the results from this approach depend on the resolution of the surface discretization and the position of the elements and nodes. The dimensionless lift and drag coefficients are defined by: Here F l and F d are the lift and the drag forces respectively. The table also shows that the Cl rms value deviates much more than the Cd ave value.

Flow around a circular cylinder
As mentioned in the previous section, the force calculations of the square cylinder case show discrepancies in regard to the values from other studies in the literature. Therefore, to observe the possible effects of the sharp corners of the square cylinder, in this section we simulate the flow around a circular cylinder to obtain the Strouhal numbers and force coefficients. The flow domain and the applied boundary conditions are the same as in the square-cylinder case. Near-wall part of the computational grid created for the circular cylinder test case may be seen in Figure 8. Figure 9 shows the instantaneous streamlines and the u velocity contours of the flow domain for Re = 100. The form of the antisymmetric wake flow, called the von Karman vortex street, can be seen. The internal treatment approach used for the nodes in the vicinity of the immersed boundary provides a better representation of the boundary in the circular-cylinder case; correspondingly, the calculated Strouhal numbers and the force coefficients show good agreement with other studies. Related comparisons may be seen in Tables 6 and 7 for the Strouhal numbers and the force coefficients, respectively. We can say that the smooth contour of the circular cylinder is more appropriate for the direct-forcing immersed boundary method in which high-order, central, compact finite-difference schemes are implemented for the spatial discretization.

Computational breakdown and performance
Generally, the solution process of the Poisson equation takes the vast majority of the running time for one time step. Therefore, the performance of the solution process of this equation directly affects the overall performance of the time-dependent incompressible Navier-Stokes equation solver. Table 8 shows the timing breakdown of one time step in the solution of the incompressible Navier-Stokes equations. It can be seen that the Poisson equation solution is the most decisive part in terms of the general performance. This points out the importance of using efficient linear-algebra algorithms in the solution  Liu, Zheng, and Sung (1998) 0.164 Ren, Wu, Shu, and Yang (2012) 0. 164 Braza, Chassaing, and Minh (1986) 0.164 Lai and Peskin (2000) 0.184 Zhang, Fey, Noack, Konig, and Eckelmann (1995) 0.192 Zhang and Zheng (2007) 0.195 Present 0.165 0.193 Ren et al. (2012) ±0.35 1.34 Braza et al. (1986) ±0.28 1.33 Lai and Peskin (2000) ±0.59 1. 44 Chiu, Lin, and Sheu (2010) ±0.303 1. 35 Silva, Silveira-Neto, and Damasceno (2003) 1 . 3 7 Su, Lai, and Lin (2007) 1 . 3 9 Present ±0.35 1.37 ±0.60 1.38 process. The discretized Poisson equation gives a sparse linear system, generally known as a bandwidth-limited problem. For instance, the CPU in our system, AMD Phenom X4 9850, has a maximum bandwidth of 17.1 GB/s, and the GPU, NVIDIA Tesla C1060, provides a maximum bandwidth of 102.4 GB/s. Inspecting these figures, we can say that the acceleration should be 6x for a bandwidthlimited problem. Thus, we can make several comparisons between the CPU and the GPU performance for the solution of the Poisson equation to justify the extra effort in using the GPU. Further analysis of the timing breakdown of the relevant parts of the code yields Table 9. This table, similar to the previous one, also indicates a timing breakdown of a certain part in the code in which the conjugate-gradient iterative solver occurs. It may also be noted from the table that sparse matrix-vector multiplication is the most time-consuming section of the iterative process ( ≈ 74%). The ELLPACK sparse matrix storage format is applied in the custom kernel generated for the process of sparse matrix-vector multiplication. The ELLPACK format is more efficient than other commonly used formats (COO,  Figure 10. ELLPACK format for a general sparse matrix A. CSR, etc.) for the GPU computations. In this format, an important factor is the maximum number of nonzero elements in a row (say K). As a result, non-zero elements of a sparse matrix M by N are stored into an M x K data array. Rows that do not have K nonzero elements are padded with zeroes. Likewise, column indices are held in another array with a similar padding approach. In addition to this, if the data storing is performed in the form of column-major order, then nearly full memory coalescing is provided on the GPU, which in turn produces higher performance. Moreover, another vector containing the exact number of nonzero elements of each row is also generated. Therefore, unnecessary operations of padded zeroes are prevented with the incorporation of this additional vector into the algorithm. This format is represented in Figure 10. This figure shows the arrays of data, column indices, and nonzero elements per row for a general sparse matrix A.
The results of the constituted conjugate-gradient iterative solver that uses the ELLPACK format are compared with the performance values in another CPU and GPU application from the literature (Layton, Krishnan, & Barba, 2011). In that application, the conjugate-gradient solver of the CUSP library is used for the solution of sparse linear systems, and a performance comparison between a GPU (CUSP) and a CPU implementation is presented for the solution of sparse linear systems. CUSP is an open-source library for sparse linear algebra and graph computations on CUDA. The CPU and GPU used in that study are Intel Xeon X5650 (32 GB/s bandwidth) and Tesla C2050 (144 GB/s bandwidth), respectively. Moreover, as a compilation option, the basic optimization syntax '−O' is used. Figure 11 shows this comparison for the solution of sparse linear systems. It may be said that a sparse linear system is a good example of bandwidthlimited problems, thus we can expect a speedup value as being the ratio of the bandwidths of the GPU and CPU. In this respect, considering the bandwidths of the GPU (102 GB/s) used in the present study and the CPU (32 GB/s), potential speedup value will be approximately 3.2x for a merely bandwidth problem. Systems in this comparison are formed by the use of a general five-point Poisson stencil. As shown in the figure, the present application of the conjugate-gradient solver yields approximately three times better performance than the application of the CUSP library on the GPU.

Conclusions
An implementation of an immersed boundary method on a GPU is performed within the scope of this study. Incompressible Navier-Stokes equations are solved on Cartesian meshes with the purpose of simulating the flows around a square cylinder and a circular cylinder. A direct-forcing approach is utilized for the immersed boundary method. In this approach, the forcing term is inserted into the momentum equation to ensure the proper boundary velocity value on the surface of the immersed body. Moreover, sixth-order, compact, central schemes are used for the discretization of the first and the second spatial derivatives. Time advancement is ensured with the use of a second-order Adams-Bashforth scheme.
Observing the solutions of the flow around a square cylinder, it is seen that the direct-forcing immersed boundary with the compact schemes successfully simulates the wake region and the vortex shedding cycle. However, it is not so successful for the estimation of the force coefficients. One of the reasons for this is spurious oscillations resulting from the quasi-spectral behavior of the compact schemes. Another factor in this context is the sharp corners that somewhat disable the internal flow treatment applied for smoothing the discontinuities on the immersed boundary. Therefore, the flow around a circular cylinder is also simulated to see the effects of the sharp corners on the force calculations. As expected, the estimation of the lift and drag coefficients is accomplished with greater accuracy. As a result, it may be said that issue of fluid penetration that is a potential limitation of direct-forcing immersed boundary method is not tolerated well enough in the case with the sharp corners (square cylinder). As a future work to alleviate this limitation, one thing is to use a finer mesh resolution around the immersed boundary. However, high-order compact scheme should be modified in order to be applied with different mesh resolutions in the domain. In this context, another approach may be the addition of an auxiliary forcing term to satisfy the mass conservation in the vicinity of immersed boundary.
Moreover, a conjugate-gradient iterative method is also implemented on the GPU for the solution of the Poisson equation. A written custom kernel is used for sparse matrix-vector multiplication, which is the decisive part of the algorithm. The ELLPACK sparse matrix storage format is utilized and found to be more efficient in comparison with other formats.
The main reason for this performance increase is that the ELLPACK format allows nearly full memory coalescing and unnecessary operations of padded zeroes are prevented with the incorporation of an additional vector into the algorithm.
A comparison is performed between the present implementation and the other GPU and CPU applications for the sparse linear system solutions. As a consequence, approximately 3x and 20x accelerations are achieved in comparison with the GPU (CUSP) and the CPU, respectively. It should be noted that this speedup value is attained with respect to a single CPU core. Considering that the CPUs are primarily designed for distributed memory computing, a future work may be to compare the performances of CPU clusters and multi-GPU applications. Generally, GPUs present considerable potential as parallel computing devices. With appropriate algorithms and advancements in numerical tools that can effectively exploit the strengths of the GPUs, they may become the most influential assistants of researchers in dealing with large, complex fluid-flow problems.

Disclosure statement
No potential conflict of interest was reported by the authors.