Massively Parallel CFD Simulation Software: CCFD Development and Optimization Based on Sunway TaihuLight

. A parallel framework software, CCFD, based on the structure grid, and suitable for parallel computing of super-large-scale structure blocks, is designed and implemented. An overdecomposition method, in which the load balancing strategy is based on the domain decomposition method, is designed for the graph subdivision algorithm. This method takes computation and communication as the limiting condition and realizes the load balance between blocks by dividing the weighted graph. The fast convergence technique of a high-eﬃciency parallel geometric multigrid greatly improves the parallel eﬃciency and convergence speed of CCFD software. This paper introduces the software structure, process invocations, and calculation method of CCFD and introduces a hybrid parallel acceleration technology based on the Sunway TaihuLight heterogeneous architecture. The results calculated by Onera-M6 and DLR-F6 standard model show that the software structure and method in this paper are feasible and can meet the requirements of a large-scale parallel solution.


Introduction
Computational fluid dynamics (CFD) is a technique for numerical simulation and analysis of fluid mechanics problems. CFD technology is increasingly used in aerospace, meteorological prediction, and other fields [1][2][3][4]. Some parallel computing frameworks have been used with parallel CFD [5,6], e.g., OpenFOAM [7] and SU 2 [8,9]. Parallel computing methods are increasingly used to solve largescale computationally intensive problems. Parallel programming environments such as OpenMP [10,11] and MPI [12,13] have appeared on parallel machines, networked workstations, and supercomputers. e development of CFD in numerical methods, turbulent models, and mesh generation technology has led to its making strides in the simulation accuracy and capacity of complex geometric shapes. Meanwhile, with the increasing complexity of engineering problems and the rapid advancement of numerical simulation calculation technology, the requirements for simulation accuracy are getting increasingly stringent, and the amount of calculation required has increased geometrically. Parallel CFD technology has become the primary method for solving complex simulation calculations [14,15].
Current CFD methods use a computational grid to simulate the flows in complex geometries and form different cells through the discreteness of the grid, which is then followed by the numerical calculation [15][16][17]. In order to simulate the shape structure more realistically, a multiblock approach is generally used to simulate different spatial regions. A commonly used method for solving large-scale CFD problems is applying domain decomposition technology in order to decompose more subblocks and allocating these decomposed calculation regions to different processes or threads for performing parallel calculations [18][19][20]. e methods of parallel computing are very different under different parallel programming environments.
e MPI method allocates one or more computing areas to a processor. As the different computing areas are completely isolated, the data exchange at the boundary needs to be completed through communication. e OpenMP method uses the shared memory in an environment in which the computing area is actually on one processor, and the grid data are completely shared, rendering data communication unnecessary. erefore, designing different parallel frameworks is important when using different programming environments and grid types [21,22].
Supercomputer architectures are generally either homogeneous [23] or heterogeneous [24]. Each core architecture of a homogeneous machine is the same, and their status is equal. ey can share the same code or execute different codes on each core [25]. Homogeneous processors can be interconnected using shared storage or through cache [26][27][28]. Heterogeneous architectures represent a new hardware platform that allows different types of processors to work together efficiently in shared memory [24]. e heterogeneous architecture includes not only the traditional CPU but also other accelerator units such as GPUs and MICs [24,29]. ese different computing units have different instruction set architectures and memory spaces [30]. e CPU and the accelerators have unified access to the system memory through the Memory Management Unit (MMU) [31,32]. With the evolution of supercomputer technology, the traditional homogeneous architecture has been unable to meet the increasing requirements for computing power and storage. For this reason, the heterogeneous architecture has become the most important technology for the development of supercomputers. Heterogeneous architecture machines account for more than 30% of the current TOP500 list of supercomputers [33]. Of the top five parallel computers, four have heterogeneous architectures. e main purpose of the research content in this paper is to develop an open-source large-scale parallel solver based on a multiblock structure grid. First, multicore parallel computing based on the homogeneous architecture was implemented. Second, based on the Sunway TaihuLight heterogeneous system [34], the calculation of data on the CPU and accelerator communication were achieved using the MPI + OpenACC/Athread hybrid programming model [35] and direct memory access (DMA) technology [36].
ird, through the use of stencil calculation, the boundary information is exchanged through register communication, and the single instruction, multiple data (SIMD), and assembly instructions are used to optimize the computationally intensive area and further improve calculation performance. Finally, the parallel simulation of the Onera M6 wing model [37] realizes a parallel calculation of 500,000 cores of 1 billion grids and achieves ideal parallel acceleration efficiency.

CCFD Software Design and Implementation
2.1. CCFD Software Architecture. CCFD is designed by object-oriented programming [38], and its architecture is shown in Figure 1. e entire software platform consists of an input/output control module, a computational geometry module, a solver module, and a parallel algorithm module.
e system architecture of CCFD needs to take into account the characteristics of various calculation methods and physical models and design a simulation software with flexibility, robustness, accuracy, and data security. In addition, the parallel framework design needs to meet the requirements for high parallel scalability, high data transmission efficiency, and fast response time. To increase maintainability and scalability, the CCFD architecture is organized with blocks as the basic data structure units. Based on this, the use and loading of data by each module in the platform greatly simplifies the program interface and reduces the difficulty of development.
e CCFD software platform system is built on block units. e structure and basic unit calling of CCFD are the blocks and boundaries. e internal points and boundary points of the blocks are divided into two data structures: BLOCKs and BCs. e main information such as geometric variables and calculation variables are stored in the block, while the boundary type and starting information are stored in the boundary. In accordance with the advantages and disadvantages of an array of structures (AoS) and a structure of arrays (SoA), the variables of CCFD on the block are organized in the form of SoA. e advantages of this method can be stated as follows. On the one hand, the centralized use and loading of grid variables improves the hit rate and data utilization of cache. On the other hand, it facilitates the packaging and collection operation of parallel communication, which is beneficial to the parallel optimization of the large-scale computation of CCFD.
CCFD integrates various computing models and methods effectively. In order to facilitate the software module specifications, code organization development, and extended maintenance, we organize the entire software system with the application layered model, which makes the code modules independent between layers and interconnects the calling relationships. e solver can be divided into five layers. e first layer is grid input, which controls the file input and the load balancing of block calculation and communication on the processor. In the second layer, the calculation is divided into a multiblock method and conventional calculation methods according to the calculation demand. e third layer is divided into Euler, N-S, and turbulent model based on the flow equation. e fourth layer is a method of selecting spatial and temporal terms for different equations. e fifth layer is a basic algorithm layer that includes a sparse matrix solving method, modular parallel communications, and error information output.

CCFD Software Process.
Software process design needs to fully take into account the actual calculation methods used. For example, CCFD uses a multiblock structure grid based on a second-order precision finite volume method to solve partial differential equations and uses multigrid methods to accelerate convergence techniques. In order to allow large-scale structure blocks to be calculated on parallel machines, the domain decomposition technique is used to divide a large-scale block into smaller-scale subblocks. e adjacency relationships between the subblocks are stored in the information of the block, and the information is shared and updated after each iteration of calculation to ensure the compatibility between the results of the partial differential equations and the unpartitioned block. e area of the block is assigned to different processors for calculation. is mapping and allocation process involves the calculation and communication load balancing of the block on the processor. erefore, the entire software system can be thought of as being roughly divided into five parts: preprocessing, area decomposition, load balancing, flow field calculation, and postprocessing output. e software process design considers the calculation methods used in practice and meets the communication needs between the multiblocks. Regardless of partition parallel computing or multiblock computing, it is essential to calculate the partial differential equations on each block. e calculation time of subblocks of different sizes differs depending on the process. e larger the scale, the larger the calculation time. If the subblock sizes are not uniform, there will be multiple processes waiting for individual processes. Based on the above characteristics of CCFD software, we designed a software parallel process, as shown in the flowchart in Figure 2.
e unit data based on the block, which is the functional module structure of CCFD, are shown in Figure 3. e structure is divided into eight modules. e calling and connection between each module are based on the block unit.
e main body of the calling is the parallel solver. rough the parallel solver, the block geometry module, parallel module, source term solution, right term solution, turbulent viscous, and output control are used to complete the entire solution process. e connection lines in Figure 3 indicate the link relationship between the modules. e entire process largely determines the parallel expansion efficiency of the software system. e allocation strategy of blocks to processors also affects how much the parallel efficiency can be improved. For example, an imbalance in the calculation and the communication load between blocks on the process will cause the process to wait idle for the overall calculation.

Load Balancing Strategy.
One goal of domain decomposition technology is to subdivide a large size of block into smaller subblocks. e subblocks' computing area is allocated to different processes, or threads, for parallel computing processing [19,20]. As shown in Figure 4, a three-dimensional structure block is subdivided into multiple smaller subblocks, each having multiple adjacency surfaces, which represent the communication relationship between the subblocks. When the subblocks are calculated on different processors, the adjacent surfaces need to carry out data transmission and communication.
Once the domain is decomposed, a larger number of smaller, more uniform blocks are formed. ese subblock computing areas are allocated to different processes, or threads, for concurrent execution. In this particular process, we need to ensure that the calculation and communication Scientific Programming allocated to each processor are relatively balanced, and there is no idle waiting of the processor. is process of block and course mapping allocation is called load balancing. Currently, there is much research taking place on load balancing methods, and several algorithms have been proposed, including greedy algorithm, spectral method, genetic algorithm, and ant colony algorithm [39]. ese algorithms are applicable to unstructured grids but cannot be directly applied to structured grids. Traditional structure grid load balancing methods generally use a cyclic allocation method or a uniform distribution method, neither of which consider the load balance of the calculation amount and communication [20].
Aiming at the problem of large-scale multiblock structure grid computing and communication load balancing, CCFD scientists proposed an overdecomposition method. CCFD divides the blocks and processor mapping and distribution process into two layers: coarse and fine. At the coarse layer, domain decomposition methods such as recursive boundary are used to subdivide large and uneven blocks into smaller and more uniform subblocks. e algorithm is a method of approximate linear time complexity, and the decomposition time is related to the number of blocks and the maximum dimension of the block. e purpose of the subdivision operation in the coarse layer is to lay a foundation for further balancing the calculation and communication loads. For example, uneven blocks make it difficult to balance the calculation and communication allocation, which will result in a series of processor-independent block sequences. Before using the fine-layer load balancing algorithm, the subblocks of the coarse layer need to be renumbered. e number of grids in each subblock was used as the statistical standard for the calculation amount, and the adjacent faces of each subblock and other subblocks are the statistical standard for communication volume. is information is redrawn into a multidirectional weighted  graph based on the subblock topology, and the communication volume and calculation volume are the two weights of the weighted graph [40]. e fine-layer graph partitioning algorithm performs graph segmentation based on the weighted graph formed by the upper layer. e total amount of calculations of the subdomains formed as far as possible is balanced, and the total amount of communication in the subdomains is balanced with the smallest amount of communication between subblocks [41]. Figure 5 shows a multidirectional weighted graph G � (V, E ) drawn after preprocessing the CCFD blocks, and finding a partitioning of the vertices of G into n sets in such a way that the sums of the vertices weights in each set are as equal as possible, and the sum of the edges weights crossing between sets is minimized. erefore, this partitioning of the vertices guarantees the computation in each processor is equal, and the communication between processors is minimized. e method based on k-way graph partitioning has moderate computational complexity, its complexity for computing is O(|V|log k), and log k is levels of recursive bisection for the algorithm.
In this paper, we discuss how we found that the calculation scale and the number of subblocks of the calculation example have a direct relationship with the amount of calculation and communication load balancing. For example, the number of subblocks and the size of the subblocks directly affect the distribution of the amount of calculation in the process. e greater the number of subblocks and the balance of the size, the more balanced the distribution of the calculations. However, the greater the number of partitions, the larger the adjacent area of the block.
is condition directly leads to an increase in communication and affects the realization of communication load balancing among the blocks. As an example, the block size in the M6 calculation ranges from 11,000 to 13,000, and the area of cut face is between 450 and 625. e block size and the adjacent surface after domain decomposition are relatively balanced, which is conducive to load balancing. Figure 6 shows the actual calculation load distribution of each process in the M6 calculation example, using 256 or 512 threads.
Testing the M6 model, we found that, with the increase in parallel scale, the actual distribution of the computing load will become worse, along with an increase in the  number of threads. In this case, 256 threads are the critical point of the calculation load balancing. When the number of threads is less than 256, the overall computational load balance is ideal. However, the actual distribution of computational load will be significantly different when the number of threads is greater than 256. When the subblock size and the total number of blocks in the partition are stable, the number of blocks allocated to each process decreases, with the number of threads increasing. en, the influence on the distribution of the calculation amount caused by the difference in the number of blocks on each thread and the actual size of the block increases, resulting in a decrease in the degree of interprocess calculation load balancing. After the domain decomposition, the total calculation amount of the block remains stable, and the absolute amount of communication between the blocks increases with the increase of the number of parallel cores. In terms of communication load balancing between blocks, the degree of communication load balancing decreases with the increase in parallel scale. Figure 7 shows the actual communication load distribution of each process in the M6 calculation example, using 256 and 512 threads.  Processor#2 w(g i 2 , g i 3 )  All data can be divided into global, local, and exchange data. Global data refer to the data that all processors will save and not modify during calculation, such as control parameters. Local data refer to the data that are only stored locally and are not shared with other processors, such as flow field data in blocks' interior points. Exchange data refer to the data that need to be exchanged with other processes, such as boundary data of adjacent faces.
When large-scale parallel computing is performed in CFD, huge data reads and exchanges offer a great test to the data structure and data flow. e reasonable data structure and data flow design directly affect the reliability and efficiency of the platform operation. In order to ensure the efficiency and security of data, the design of CCFD parallel software is based on the block data structure. e grid data, boundary data, and flow field data are stored in each block unit. As shown in Figure 8, a block stores information such as grid coordinates, boundary information, and primitive variables [40,41].

Parallel Communication Optimization.
e proportion of blocks computing and communicating is one of the main factors in the parallel expansion performance of software. Ideally, a large amount of local computing is a computational priority problem, and a small amount of communication is good for parallel expansion. If the amount of local computing is small, the large amount of communication is a communication priority problem, which leads to poor parallel scalability. e computing model and method used by parallel software determine the amount of local computing, and the communication mode of parallel software affects the amount of communication. For example, in actual calculations, there are some challenges: many small and frequent data communications, a large amount of data and a fixed number of a block boundary updates, one-to-many broadcast data, and many-to-one master-slave communication. In terms of these challenges, using a different and flexible communication mode facilitates the software's parallel expansion.
CCFD uses the mode of traversing and allocating blocks to communicate. Each process traverses all block numbers. Only the local blocks are calculated, while the nonlocal blocks are used for asynchronous communication according to the actual communication needs. For example, when process 1 communicates with process 2, each process packs the mesh belonging to the adjacency surface relationship with nonblocking transmissions and then directly performs subsequent calculations. It does not detect whether the communication has been completed until the communication data are used locally. If the communication is completed, it can be used directly; otherwise, wait for the communication to complete. In Figure 9, B4 and B6 are the respective senders and receivers of data. Under nonblocking communication, B4 and B6 only need to send data to perform subsequent calculations and do not need to wait for the completion of data reception.
In the actual calculation of CCFD, there is a large amount of small and frequent communication, such as the detection of the local subblock, the updating of the local subblock, the updating of the flow field data, the calculation of the turbulent model, and the multigrid up-down interpolation process, which occurs after the block is read. ese small and high-frequency communications are very timeconsuming and affect the parallel efficiency of the processor. To solve these problems, we generally use methods such as collective communication, data prefetching, and computational alternative communication. Collective communication packs the number of high-frequency communication times in order to reduce the number of communication requests, which in turn reduces the communication time.
e aim of prefetching technology is to obtain data in advance in each calculation iteration and hide the communication time. e alternative method of calculation refers to using calculations to complete the local results, which can  When there are many uneven blocks in the local process, the program is prone to idle waiting in the blocking or sequential traversal execution mode. For example, the small block quickly ends the calculation and enters the communication. Large blocks have not yet entered the communication, and the data are not ready, which will cause a long wait. e overlap of calculation and communication is shown in Figure 10. Each communication request ends after the data are sent, and the detection of the completed communication is not performed until the data to be received are used in the calculation process. For example, processor P1 and processor P2 send data to each other, but these data are not necessarily used immediately once the communication ends. In this case, we can send the data to the background for continued processing, and the processor can perform subsequent calculations until the data are used and whether they have been received is detected. is method allows data communication and calculation to be performed at the same time. Because the communication time is hidden under the calculation time, it is called the overlapping technology of calculation and communication.
e design and selection of communication modes largely determine the computing efficiency and parallel scalability of parallel software. Different computing modes and methods and data structures require the design of a reasonable and flexible communication mode according to the actual situation. CCFD parallel software adopts flexible communication modes and carries out different parallel optimization designs for different calculation models and methods. Moreover, it also presents the modular design for the general data communication methods, which greatly simplifies the process of program programming and maintenance and achieves ideal parallel efficiency and data utilization.

Optimization on Sunway TaihuLight Heterogeneous Platforms
Sunway TaihuLight is the first 100P heterogeneous system supercomputer in China. It was developed by China's National Research Center of Parallel Computer Engineering & Technology (NRCPC) and is installed at the National Supercomputing Center in Wuxi province. It is powered exclusively by Sunway's SW26010 processors, with an HPL mark of 93.0 petaflops, and kept its number three spot in the TOP500 list of supercomputers [33]. e SW26010 multicore processor is composed of four computing core groups (CGs), and the four CGs are interconnected at high speed through the on-chip network, as shown in Figure 11. Each CG consists of a management processing element (MPE) and a multicore array containing 64 computing processing elements (CPEs). MPE can directly access memory, while CPE can directly access main memory and supports DMA batch data transfer [34]. e SW26010 processor has slow memory access and large memory access instruction delay, a memory access problem that has seriously affected the performance of the CFD solver. e CCFD solver has been transplanted and optimized in Sunway TaihuLight, and the program memory access problem has been solved in three ways: using highspeed storage instead of main memory access; eliminating or reducing memory access operations; and hiding memory access time. e optimization work in this paper is divided into the following aspects. We use MPI to achieve parallel communication between the core groups and use Athread to achieve the master-slave core parallelism in the core group. e data of the slave core are stored in the core group, the subarea is partitioned, and DMA batch transfer eliminates main memory fetch operations. e matrix data continuity access is optimized. e fast Carmack algorithm is used. e vectorized calculation and manual assembly instruction are rearranged. e use of a double cache method to cover fetch time significantly improves the parallel efficiency.

Data Partitioning and DMA Parameter Tuning.
CCFD is optimized based on the memory size of SW26010 and the performance of DMA to achieve domain decomposition and DMA parameter tuning. e block division of the many-core group calculation grid used by CCFD is shown in Figure 12. Using a 32 * 32 * 32 block as an example, in order to efficiently use the storage space of CPE, 64 slave core blocks divide the grid data along the X-Y cross section. en, the divided block data are allocated to 64 slave cores, and the 64 slave cores execute the same calculation process synchronously to complete a flow field calculation of the entire block.

Scientific Programming
In actual programming, the size partition and direction selection of computational subdomains need to be further tuned, combined with the calculation process and the performance of the DMA. e local data memory (LDM) of the core processing unit is 64 kB, and in the actual flow field calculation, the size of the storage space required for different calculation processes ranges from 12 kB to more than 100 kB. Based on this, the temporarily stored LDM data need to strictly control the calculation scale, flexibly utilize DMA batch transmission, and eliminate the fetch operation to obtain data from the memory, thereby reducing the fetch time. In Figure 11, DMA is called by slave core. X is the lowest dimension of the data block, Y is the secondary dimension, and Z is the highest dimension. e data along the highest dimension are allocated to the same slave processor, and they are calculated along the Z-axis.  transmitted to local are calculated cyclically by the slave core processor to complete the iterative solution of the flow field of a single block.

Register Communication Optimization.
After the block subarea data are evenly distributed to the slave core arrays, the slave cores that are physically located adjacent to each other on the on-chip bus have the adjacent global grid data. e SW26010 processor supports the same columns of slave core array registers performing point-to-point discrete communication.
e aggregate bandwidth of the register communication theory is much larger than the theoretical aggregate bandwidth of the DMA. is paper uses register point-to-point communication to obtain the boundary data stored on the adjacent slave cores on the X-Y plane in the slave core array group. Moreover, the register communication and the asynchronous mechanism of arithmetic operation are used to fill the instructions that have nothing to do with the communication data between the RLC_GET/ RLC_PUT instructions, to cover the communication time between the register communications.
As shown in Figure 13, using the XY plane as an example, the green dots indicate the grid data stored on the slave core LDM. e flow field calculation of the local cell is performed on each slave core, and the update result is returned. However, the local cells are calculated using the data of cells in four directions which contain part of halo cells in Figure 13 and assigned to adjacent slave calculation local cell to complete the calculation of the internal point. For the three-dimensional 13-point model discussed in this paper, after reading one layer of calculation subregion data blocks from the core, it is necessary to read two layers of halo data blocks from the surrounding four directions to complete the calculation. e SW26010 supports register communication between peers and the same column. On the accelerator array, the registers of accelerators in the same row or the same column realize interconnected communication and can transmit 128 bits of data each time. As shown in Tables 1 and 2, the peer processors place the data in the register and specify the target number of the peer processor and then send the data to the destination. e target processor will collect the data over the Internet and save it to the register of the processor, realizing the communication operation of the register.

SIMD Calculation and Assembly Instruction
Optimization.
e SW26010 supports the 256-bit SIMD instruction set, which can perform four double-precision or eight single-precision data operations simultaneously. To ensure the use of SIMD, the calculation process and the address alignment of the variable array need to be alignment-adjusted. At the same time, SIMD_LOAD/ SIMD_STORE and the calculation instruction are manually adjusted under the premise of ensuring that the calculation is correct, so that the LOAD/STORE and the calculation time are partially hidden. Data access and storage processing and arithmetic of the SW26010 are operated on two pipelines. To achieve the overlap of the data fetch and arithmetic operations, this paper analyzes the order of instructions generated by the compiler, through manually adjusting the order of instructions, and implements the overlap of data fetch load/store operations and data-independent operations.
is method can eliminate data dependencies and improve code efficiency.
To further improve the calculation performance, the calculation code is disassembled. As shown in Table 3, by observing the order of the instruction execution sequence after disassembly, we found the memory access and calculation instructions are completely separated, and nearly six access instructions are issued on the pipeline before one calculation instruction is launched. After ensuring that the data are not relevant, we use inline assembly functions and handwritten assembly codes instead of function interface codes. As shown in Table 4, after manually adjusting the assembly code, the overlap of memory access and calculation instruction time is realized.

CCFD Application and Results
To verify the architecture and parallel performance of the CCFD software, this paper selects the DLR-F6 wing-body assembly, the CT-1 standard model, and the ultra-large-scale M6 wing model for testing. e test platforms are the Yuan supercomputer at the Computer Network Information Center in Beijing and the Sunway TaihuLight supercomputer at the National Supercomputing Center in Wuxi.

Optimization Results on Sunway
TaihuLight. Using a block of size 32 * 32 * 32 as the optimization case, the block data were divided into the 8 * 8 slave cluster on CGs, the main calculation function and hot spots of CCFD were optimized, the flow field calculation was accelerated, and the entire CFD simulation process was completed. Figure 14 shows the SIMD optimization provided by Sunway's compiler in the Roe scheme calculation part, and the comparison of the average cycles number before and after inline assembly instruction optimization. e optimization effect of SIMD and assembly instructions is significant, in that implementing the overlap of fetching and manual instruction adjustment further optimizes the calculation.
Now that CCFD has completed the acceleration optimization of the main functional modules, including flux calculation, turbulent model, and time advancement method, and the slave core acceleration ratio has finally reached more than 30 to 1. Figure 15 shows the comparison of calculation time before and after the optimization of the main calculation function module of CCFD.

DLR-F6 Wing Body Assembly.
e German Aerospace Center's DLR-F6 wing-body assembly is a typical example of a transonic transport aircraft, which verifies the CFD solver's ability to calculate complex shapes [42]. e calculation conditions are Mach number Mach � 0.7510, Reynolds number Re � 0.0212465 × 106, angle of attack α � 1.003°, and side slip angle β � 0°. e calculation method uses Roe scheme, LU_ADI time advancement method, three-layer mesh sequencing strategy, and the Spalart-Allmaras Scientific Programming turbulent model. Figure 16 shows the pressure nephogram of the entire assembly. Figure 17 shows the comparison of the experimental results and calculated pressure coefficients of the different positions of the DLR-F6 wing. e CCFD calculation results are consistent with the experimental results, which verifies that CCFD can simulate the flow field calculation of a complex-shaped aircraft.

CT-1 Standard Model.
e CT-1 is a model of highangle static aerodynamic characteristics released by the China Aerodynamics Research and Development Center in 2005 [43]. We focus on the numerical simulation capabilities of the CFD solution software at high angles of static aerodynamic characteristics. Figure 18 shows the pressure nephogram at different angles of attack. Figure 19 shows a comparison of the experimental results and the drag coefficients for each angle of attack. By comparing the experimental results with the calculation results of the static aerodynamic characteristics of the CT-1 standard model at high angles of attack, the CCFD calculation results are consistent with the experimental results, which verifies the numerical simulation ability of CCFD to solve the static aerodynamic characteristics at high angle of attack.
e calculation conditions are Mach number Mach � 0.840, Reynolds number Re � 21.70 × 106, angle of attack α � 3.06°, and side slip angle β � 0°. e calculation method uses Roe scheme, LU_ADI time advancement method, and the Spalart-Allmaras turbulent model.   CCFD is based on the Sunway TaihuLight heterogeneous platform, with its core groups communicating with each other using MPI, and 64 slave cores in the core group accelerated in parallel using Athread. Figure 20 shows that the parallel efficiency of CCFD reached 60% under the Sunway TaihuLight heterogeneous platform with 13,000 cores and 500,500 cores. e parallel result test indicates that the super-large-scale parallel computing scalability of CCFD has met the design requirements.

Conclusions and Future Work
Using the multiblock structure grid parallel technology of domain decomposition, we design the parallel software framework, software process, parallel data structure, and communication mode of parallel CFD solver software. e designed CCFD can be used for large-scale mega-core parallel computing tasks and has good parallel expansion ability and parallel efficiency. e overdecomposition load balancing strategy for CCFD guarantees the load balancing performance of computing and communication. e use of a mesh sequencing method based on multigrid technology can accelerate the iterative convergence speed. We perform the optimizations on DMA, SIMD, assembly instruction rearrangement, and double buffering in the Sunway TaihuLight heterogeneous architecture. e super-large computational scale test with a maximum of 505,000 cores in parallel across the core group achieved a parallel efficiency of 60% based on 13,000 cores. In the future, we will use direct numerical simulation of turbulence models to achieve more realistic geometric flow simulations, increase the scale of cases and parallelism, and achieve more efficient operation of parallel CFD software in heterogeneous systems.

Data Availability
e grid data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest regarding the publication of this paper.