High-Performance Parallel Simulation of Airflow for Complex Terrain Surface

It is important to develop a reliable and high-throughput simulation method for predicting airflows in the installation planning phase of windmill power plants. /is study proposes a two-stage mesh generation approach to reduce the meshing cost and introduces a hybrid parallelization scheme for atmospheric fluid simulations. /e meshing approach splits mesh generation into two stages: in the first stage, the meshing parameters that uniquely determine the mesh distribution are extracted, and in the second stage, a mesh system is generated in parallel via an in situ approach using the parameters obtained in the initialization phase of the simulation./e proposed two-stage approach is flexible since an arbitrary number of processes can be selected at run time. An efficient OpenMP-MPI hybrid parallelization scheme using a middleware that provides a framework of parallel codes based on the domain decomposition method is also developed. /e preliminary results of the meshing and computing performance show excellent scalability in the strong scaling test.


Introduction
Power generation using natural energy sources has gradually replaced that from traditional thermal and atomic power sources due to concerns regarding the environment and resource sustainability.In particular, wind and solar have potential to provide low-emission power generation.Recently, large windmill power plants have achieved high efficiency in power generation.e selection of installation sites for windmill power plants directly impacts the entire life cycle and costs, including those of design, installation, operation, and removal.
It is thus important to develop a method for predicting detailed wind conditions utilizing data from observations and simulations.Many factors affect prediction results, and thus, numerous calculations with high-resolution meshes are usually required to obtain reliable results of wind flow over complex terrains.Reliable prediction of wind power generation requires taking into account various factors, including wind direction, turbulence, land features, atmospheric stratification, and periodicity.A simulation of wind conditions requires at least a 10-min time integration, which is very computationally intensive.To reduce wind flow simulation time, the speeding up and parallelization of the simulation algorithms is essential.In addition, the cost of mesh generation is one of the main issues in computational fluid dynamics (CFD) because meshing is extremely demanding and time-consuming for operators and sometimes requires user skill.
e present study proposes a two-stage mesh generation approach that greatly reduces the meshing cost and introduces a hybrid parallelization scheme for atmospheric fluid simulation.e meshing approach splits mesh generation into two stages: in the first stage, the parameters that uniquely determine the mesh distribution are extracted, and in the second stage, a mesh system is generated in parallel via an in situ approach based on the parameters obtained in the initialization phase of the simulation.To facilitate the extraction of meshing parameters, an easy-to-use graphical application that allows the user to interactively explore the optimal parameters is developed.
e proposed twostage approach is flexible since an arbitrary number of processes can be selected at run time.An efficient OpenMP-MPI hybrid parallelization scheme using middleware that provides a framework of parallelism based on the domain decomposition method is also developed.e preliminary results of the meshing and computing performance show excellent scalability.

Grid Generation
2.1.Grid System.Mesh generation for atmospheric fluid simulation with a complex terrain shape must reflect the complex configuration of the ground in the mesh system and appropriately resolve the boundary layer.Figure 1 shows an example of the mesh system used in the present research.A uniform Cartesian mesh is used for the XY-plane (horizontal direction), and a nonuniform distribution is used in the Z (vertical) direction, similar to the sigma coordinate system [1].Equation (1) expresses the mapping relation between the physical and computational coordinate spaces: x � x(ξ), 2.2.Two-Stage Approach.Mesh generation for complex configurations is typically conducted using commercial software suites, e.g., [2], which have practical functions and user-friendly graphical user interfaces (GUIs).However, when such applications are run on commodity desktop PCs, limited memory becomes a critical constraint for generating large-scale mesh systems.In addition, the file size of the generated mesh is large, making the transfer of mesh data from a PC to computing servers time-consuming.For fast, large-scale mesh generation, the present study proposes a scalable, in situ two-stage meshing approach that can avoid time-consuming file access [3]. is approach splits mesh generation into two stages: in the first stage, the parameters that uniquely determine the mesh distribution are extracted using a GUI application, and in the second stage, a mesh system is generated in parallel using the parameters obtained in the initialization phase of the simulation.An overview of this procedure is illustrated in Figure 2.

Graphical Application.
In the first stage, the parameters that uniquely determine the desired mesh system are found.Generally, terrain data are provided from satellite or geographic information system (GIS) services [4] and are usually in DEM or STL format, as shown in Figure 3.In this example, the STL format is used.A graphical application called FXgen is used to help determine the optimal parameters, which are used in the second stage.e essential parameters that determine the mesh distribution are the size of the computational region, the number of points, and the mesh spacing for the two ends of the line segments in the Z direction.ese parameters can be interactively explored via a user-friendly GUI application, as illustrated in Figure 4.Although this process is heuristic, it is rather quick since the interface is efficient.Users can find the optimal parameters using the interface shown in Figure 4(b).For instance, users can specify the number of divisions in the X and Y (horizontal) directions, and the mesh distribution results will immediately be displayed from the top view, as shown in Figure 5(a).To determine the mesh distribution in the Z direction, the commonly used Vinokur's stretching function is used [5], where the number of divisions in the Z direction and the mesh spacing for the two ends of the line segments are specified.
e result is a nonuniform stretched mesh in the Z direction, as depicted in Figure 5(b).Figure 5(c) shows the mesh distribution around the surface.Although the mesh distribution around a steep cliff is notoriously difficult to correctly obtain in mesh generation, the mesh obtained using the proposed method was well generated since the region just above the surface is filled by high-resolution fine meshes.Finally, the obtained parameters are output in JSON format, as shown in Figure 6.

Handling of Geometry
Data.In a practical simulation, complex geometry shapes, given in file formats such as STL, OBJ, or iGES, are often encountered.In order to handle such geometry files in parallel, several libraries can be used.For instance, OpenFOAM provides the distributedTriSurfaceMesh class to load and redistribute mesh data [6], and Ray et al. presented a parallel mesh framework called MOAB (Mesh-Oriented dAtaBase) for building parallel mesh generators [7].One of the present authors developed a polygon handling library called Polylib to manage polygonal elements in parallel [8], which provides functions such as load, save, redistribution, manipulation, grouping, and migration.

On-the-Fly Meshing.
e obtained parameters are used to generate the same mesh generated in the first stage.In the second stage, the entire computational domain is divided by the number of processes in the control file or using command line arguments during the invocation of the parallel simulation program.
e domain decomposition process is described using application programming interfaces (APIs) provided by the CBrick library [9], which are designed to facilitate the construction of message passing Modelling and Simulation in Engineering interface (MPI) applications and the optimization of their performance.A domain decomposition pattern is automatically calculated by the algorithm.By applying the same algorithm for mesh generation in both stages, the desired meshes can be generated using the same parameters.e proposed two-stage approach is flexible since an arbitrary number of processes can be selected at run time, thanks to on-the-fly meshing.A more detailed description can be found elsewhere [3].

Parallel Application
In this section, the performance enhancement for the wind simulator RIAM-COMPACT [10] is described.

Governing Equation and Solution Method.
RIAM-COMPACT can be used for large-scale simulations of turbulence to predict the local wind flow over complex  terrains.It uses collocated grids in the boundary-fitted coordinate system.Equations ( 2) and ( 3) show the nondimensional governing equations of the flow, the filtered continuity equation for incompressible fluid, and the filtered Navier-Stokes equations, respectively.
where u i is the velocity component, and p, v, v e , and S ij denote the pressure, fluid viscosity, eddy viscosity of the Smagorinsky model, and strain tensor, respectively.Note that the prime symbol indicates dimensional values.

Parallel Mesh Generation at Initialization.
RIAM-COMPACT is parallelized using OpenMP and MPI. Figure 7 presents pseudocode of the initialization phase of the simulation, including domain decomposition, polygonal data loading/distribution, and mesh generation.Object D is an instance of the CBrick class, which provides APIs for domain management.Note that the argument "XY" of the method findOptimalDivision() sets the division of the domain to be only on the XY-plane.is is done because the stretching function is applied to the Z direction, and thus domain decomposition must be avoided in the Z direction.
Object PL shows the use of polygon management class MPIPolylib to load the STL file at rank 0 and to distribute the partitioned polygon data to other ranks.Inside the for loop, object sf calculates the coordinate values at a given line segment in the Z direction and stores the values in the work array zx. e calculated coordinate values in array zx are copied to array Z. e object's method sf-> distribution(zx) means that the mesh distribution of a line segment between pos_z and max_z is calculated using Vinokur's stretching function, where the spacings of the two ends are ds_st and ds_ed, respectively, and the number of divisions is NK.For performance, the outer for loop is parallelized using OpenMP and the function Zcopy() is vectorized.

Parallelization. RIAM-COMPACT is written in
For-tran90 and C/C++.Fortran90 is used for the main algorithms, and C/C++ is used for the main function, allocating memory, utilities, and bridging other C++ class libraries.For MPI-based parallelization, the APIs provided by the MPI library are used to describe the parallel code.For frequently used communication patterns, such as Allreduce and neighbor (point-to-point) communications, the CBrick library provides convenient APIs.OpenMP is used for threading do/for loops.

Performance Monitoring.
It is very useful to assess the performance of applications in the production run and the tuning phase, which include various computing platforms.e performance monitoring library called PMlib [11] was used here to measure the parallel performance of RIAM-COMPACT at run time.PMlib provides simple APIs for instrumenting code and creating useful reports.Figure 8 shows pseudocode of instrumentation using PMlib.In this example, the user function mykernel() is being measured by calling the PM.start() and PM.stop() methods just before and after the mykernel() call.Figure 9 shows a report in simple format.A detailed report includes the performance of each process.

Related Work
In order to construct a high-throughput simulation method for atmospheric fluid flow, an efficient meshing method and scalable parallelization for kernel codes are essential.e literature on mesh generation was reviewed.Various mesh implementations have been proposed, such as a partitioning-based unstructured mesh [12], an octree or hierarchical Cartesian mesh with a space-filling curve [13], and a structured mesh with simple domain decomposition [14].ese methods are categorized as in situ meshing approaches, i.e., the mesh is automatically generated in the initialization phase of the CFD simulation.Among them, the Cartesian-based approach is probably the fastest and most powerful for meshing complex configurations.However, a naive implementation of this approach can have some difficulty creating smooth meshes that appropriately resolve the boundary layer near the surface just above the ground level.To overcome this problem, Yamazaki et al. used a Cartesian mesh with block-structured mesh refinement to concentrate the mesh around the surface, and introduced the cut-cell technique to generate a terrain-following mesh.eir method captures the boundary layer, but its scalability was only confirmed up to 16 threads [15].
Another issue of meshing is the automatic generation of a mesh system with complex terrain geometry for atmospheric simulation.Gargallo-Peiró et al. proposed an automatic procedure for generating hybrid meshes to simulate turbulent flows for wind farms [16].ey developed a mesh system that captures the topographic features of the ground and turbine area.eir mesh generation process is separated in two steps: in the first step, a background mesh is generated; in the second step, the mesh system around the turbine is inserted.ey generated a mesh system for a wind farm with an element count on the order of 10 million in about 300 seconds using a meshing application on a PC.
Evetts reported that the Glyph script helps to automate specific meshing processes for wind farms [17].e Glyph Modelling and Simulation in Engineering script has the following steps: import terrain, refine the area of interest, create a surface mesh, create a volume mesh, smooth and analyze the volume mesh, and export the file to a CFD solver.e script can make a grid system in a few minutes.is procedure is similar to our approach but it generates a grid file.Modelling and Simulation in Engineering

Evaluation Environment.
e supercomputer ITO at Kyushu University [18] was used to evaluate the performance of the mesh generation process and computing process.e specifications of ITO subsystem A are shown in Table 1.e evaluation was carried out for Flat-MPI mode and hybrid parallel (OpenMP + MPI) mode using up to 256 nodes, i.e., 8,192 processes for Flat-MPI mode and 512 processes for hybrid mode.
e computation time was measured for 20 steps of time integration from a given initial condition, and the performance was measured using the function provided by PMlib.

Performance of Flat-MPI Mode.
Figure 10 shows the performance results for Flat-MPI mode.e generated mesh size was 2001 × 721 × 721 (approximately 1 billion points), and the required memory for meshing was 66 GB.In this   Modelling and Simulation in Engineering case, the number of processes ranged from 2 to 8192.e obtained meshing performance was excellent with 512 processes but became worse with over 1024 processes.e main reason for this performance degradation is load imbalance, as the given mesh size is not an aliquot part, especially with over 512 processes.e performance of the fluid calculation exhibits the same behaviour.As shown in Figure 10, the cost of meshing is two orders of magnitude lower than that of the fluid calculation.

Performance of Hybrid Mode.
Figure 11 shows the performance results for the hybrid parallel mode.e generated mesh size was 4001 × 721 × 721 (approximately 2 billion points), and the required memory for meshing was 132 GB.
e performance was measured for up to 512 processes, with 18 threads used per process.e performance of the fluid calculation was excellent compared to that for the Flat-MPI mode because the degree of freedom to divide the domain in the X direction was higher and thus load balancing improved.Although meshing performance dropped for the case with over 128 processes, the ratio of meshing time to flow calculation time was under 0.01; meshing was thus a negligible part of the entire computation time.Note that since the time integration was only 20 steps for this test case, the meshing time in a practical case that requires many time steps for integration can be ignored.

Computed Flow Field.
Figure 12 shows that the fine mesh surrounding the surface was sufficient to correctly capture the boundary layer.is confirms that the structure of the flow separation at the ridge is well reproduced.

Conclusions
A two-stage in situ mesh generation approach was proposed for reducing the computational cost of the meshing process for large-scale parallel atmospheric fluid simulation with complex terrain.e proposed mesh generation process has two steps: in the first step, the parameters that uniquely determine the mesh distribution are extracted using a heuristic approach; in the second step, a mesh is automatically generated by a parallel meshing algorithm based on the obtained parameters in the initialization phase of the simulation.
e proposed two-stage approach is flexible since an arbitrary number of processes can be selected at run time, thanks to the in situ meshing.
An efficient implementation of RIAM-COMPACT using OpenMP-MPI hybrid parallelization was also developed.RIAM-COMPACT was constructed using middleware that provides a framework of parallel codes based on the domain decomposition method.
e preliminary performance results for the meshing and computing

Figure 1 :Figure 2 :Figure 3 :Figure 4 :
Figure1: Cross section of the mesh system for an isolated peak configuration.A uniform Cartesian mesh is used in the horizontal direction, and nonuniform stretching is used in the vertical direction.

Figure 5 :
Figure 5: Meshing process in FXgen.e GUI application allows the desired parameters to be explored with confirmation of the mesh distribution.(a) Uniform mesh in the XY direction.(b) Stretched mesh in the Z direction.(c) Closeup view near the surface.

Figure 6 :
Figure 6: Extracted mesh parameters.GlobalOrigin and GlobalRegion give information on the computational region, GlobalDivision is the number of divisions in the X, Y, and Z directions, and StartPitch and EndPitch represent the spacings of the two ends of line segments, respectively.

Figure 7 :
Figure 7: Pseudocode of the initialization phase of RIAM-COMPACT.

Figure 10 :
Figure 10: Performance of grid generation and fluid calculation for Flat-MPI mode.

Figure 11 :
Figure 11: Performance of grid generation and fluid calculation for hybrid mode.

Table 1 :
Specifications of ITO subsystem A.