On the Efficiency of OpenACC-aided GPU-Based FDTD Approach: Application to Lightning Electromagnetic Fields

An open accelerator (OpenACC)-aided graphics processing unit (GPU)-based finite difference time domain (FDTD) method is presented for the first time for the 3D evaluation of lightning radiated electromagnetic fields along a complex terrain with arbitrary topography. The OpenACC directive-based programming model is used to enhance the computational performance, and the results are compared with those obtained by using a CPU-based model. It is shown that OpenACC GPUs can provide very accurate results, and they are more than 20 times faster than CPUs. The presented results support the use of OpenACC not only in relation to lightning electromagnetics problems, but also to large-scale realistic electromagnetic compatibility (EMC) applications in which computation time efficiency is a critical factor.


Introduction
Rigorous modeling and solution of large-scale electromagnetic compatibility (EMC) problems often require prohibitive computational resources. Fast algorithms and techniques, as well as hardware platforms with high pipelining capability, are usually used to circumvent this problem (e.g., [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]). Graphics processing units (GPUs) are characterized by a high parallelism capability. Figure 1 presents schematically the CPU and GPU architectures. As can be seen in Figure 1, the number of threads in GPUs is dramatically higher than that in CPUs. A CPU consists of a few immense arithmetic logic unit (ALU) cores aiming to process with high cache memory and one enormous control module capable of managing a few threads at the same time. On the other hand, a GPU includes many small ALUs, small control modules, and a small cache. Furthermore, it can execute thousands of threads concurrently since it is optimized for parallel operations [16]. One of the particular examples of a large-scale EMC problem is the evaluation of lightning electromagnetic fields, and coupling to structures. It is a large-scale problem because of (i) the scale of the solution domain, which can be in the order of several tens of kilometers, and (ii) the complexity of the propagation media, when considering the inhomogeneity and roughness of the soil (e.g., [17][18][19][20][21][22][23][24]). In most of the studies focused on the evaluation of lightning radiated electromagnetic fields and their induced disturbances on nearby structures, such as overhead transmission lines and buried cables (e.g., [25][26][27][28][29][30][31]), computations have been carried out by making use of CPU-based systems. However, there exist several attempts in which different hardware platforms have been used for the same study (e.g., [32][33][34][35][36]). One example is the use of GPU-based compute unified device architecture (CUDA) programming for the finite difference time domain (FDTD) evaluation of lightning electromagnetic fields. Due to its high-speed calculation, this approach facilitates threedimensional modeling of a problem, taking into account parameter uncertainties such as irregular lightning channel and surface roughness [34]. It is noted that this approach is supposedly 100 times faster than the serial processing approach in CPU [33]. Although the GPU-based CUDA programming approach is highly efficient and provides the programmer with the flexibility to utilize various memories such as cache memory, it requires the involvement of the programmer for the determination of many programming details [37]. The OpenACC programming model suggested by NVIDIA, CAPS, Cray, and the Portland Group is a general user-driven directive-based parallel programming model developed for engineers and scientists in OpenACC. The programmer has the capability of incorporating compiler directives and library routines into FORTRAN, C, or C++ source codes to assign the area within the code that needs expedition in parallel on GPU. In fact, the programmer is not preoccupied with parallelism details and, in turn, leaves these tasks to the compiler. This programming model efficiently and intelligently launches the kernels and parallels the code onto the GPU. OpenACC was proposed for the first time in 2011 as a high-level programming approach that has offered almost all types of processors with high performance and portability. This programming model allows executing and creating codes using the available and future graphics hardware [38]. This is because OpenACC can transfer calculations and data from the host to the accelerator device. Despite their different architectures, the two former hardware components (host and accelerator device) and their corresponding memories are either shared or separated. An accelerated computing model of OpenACC is presented in Figure 2. According to Figure 2, the OpenACC compiler executes the code and manages the transferred data between the host and accelerator device. Furthermore, OpenACC benefits from a top-level compiler directive for the sake of portability to parallelize different sections of the code and use parallel and optimized compilers to develop and execute the corresponding codes. Several compilation directives or programs can be defined by OpenACC for parallel execution of code fragments. Recently, graphical processing using OpenACC has caught the attention of many researchers due to its relative simplicity and high performance (e.g., [39,40]). One of the particular examples of using OpenACC on graphics processors is to calculate the vector potential using the finite difference generated by time-domain However, there exist several attempts in which different hardware platforms have been used for the same study (e.g., [32][33][34][35][36]). One example is the use of GPU-based compute unified device architecture (CUDA) programming for the finite difference time domain (FDTD) evaluation of lightning electromagnetic fields. Due to its high-speed calculation, this approach facilitates three-dimensional modeling of a problem, taking into account parameter uncertainties such as irregular lightning channel and surface roughness [34]. It is noted that this approach is supposedly 100 times faster than the serial processing approach in CPU [33]. Although the GPU-based CUDA programming approach is highly efficient and provides the programmer with the flexibility to utilize various memories such as cache memory, it requires the involvement of the programmer for the determination of many programming details [37]. The OpenACC programming model suggested by NVIDIA, CAPS, Cray, and the Portland Group is a general user-driven directive-based parallel programming model developed for engineers and scientists in OpenACC. The programmer has the capability of incorporating compiler directives and library routines into FORTRAN, C, or C++ source codes to assign the area within the code that needs expedition in parallel on GPU. In fact, the programmer is not preoccupied with parallelism details and, in turn, leaves these tasks to the compiler. This programming model efficiently and intelligently launches the kernels and parallels the code onto the GPU. OpenACC was proposed for the first time in 2011 as a high-level programming approach that has offered almost all types of processors with high performance and portability. This programming model allows executing and creating codes using the available and future graphics hardware [38]. This is because OpenACC can transfer calculations and data from the host to the accelerator device. Despite their different architectures, the two former hardware components (host and accelerator device) and their corresponding memories are either shared or separated. An accelerated computing model of OpenACC is presented in Figure 2. According to Figure 2, the OpenACC compiler executes the code and manages the transferred data between the host and accelerator device. Furthermore, OpenACC benefits from a top-level compiler directive for the sake of portability to parallelize different sections of the code and use parallel and optimized compilers to develop and execute the corresponding codes. Several compilation directives or programs can be defined by OpenACC for parallel execution of code fragments. Recently, graphical processing using OpenACC has caught the attention of many researchers due to its relative simplicity and high performance (e.g., [39,40]). One of the particular examples of using OpenACC on graphics processors is to calculate the vector potential using the finite difference generated by time-domain Green's functions of layered media where the computational speed was 45.97 times the CPU computational speed on MATLAB [40]. Green's functions of layered media where the computational speed was 45.97 times the CPU computational speed on MATLAB [40]. In this paper, OpenACC-based GPU processing is used to tackle the problem of long computation time in the FDTD method for the evaluation of electromagnetic fields generated by a lightning channel. It is worth noting that OpenACC benefits from relatively simple programming and dramatically increases computational speed. The rest of the paper is organized as follows. Section 2 presents the required steps for executing the computational FDTD code in graphics processing using OpenACC. Section 3 describes the adopted models and computational methods, as well as FDTD parameters. Section 4 presents the processing speed for the proposed method, and the results are compared with CPU serial processing speed. Concluding remarks are provided in Section 5.

FDTD Algorithm
The flowchart of the FDTD algorithm to calculate lightning electromagnetic fields consists of five main modules [41,42], as shown in Figure 3. In the FDTD algorithm, first, all the variables and parameters are set and defined in the initialization step, followed by the assignment of the required memory to store each component of the electromagnetic field. In each time step, the lightning channel source model excites the desired space. Then, the components of the electric ( , , ) and magnetic ( , , ) fields are computed and updated. Finally, boundary conditions are applied to avoid unwanted reflections of the electromagnetic field. These steps are repeated until the simulations are performed in the desired time period. The output of the FDTD algorithm would be the electric and magnetic fields at each observation point within the specified time period. In this paper, OpenACC-based GPU processing is used to tackle the problem of long computation time in the FDTD method for the evaluation of electromagnetic fields generated by a lightning channel. It is worth noting that OpenACC benefits from relatively simple programming and dramatically increases computational speed. The rest of the paper is organized as follows. Section 2 presents the required steps for executing the computational FDTD code in graphics processing using OpenACC. Section 3 describes the adopted models and computational methods, as well as FDTD parameters. Section 4 presents the processing speed for the proposed method, and the results are compared with CPU serial processing speed. Concluding remarks are provided in Section 5.

FDTD Algorithm
The flowchart of the FDTD algorithm to calculate lightning electromagnetic fields consists of five main modules [41,42], as shown in Figure 3. In the FDTD algorithm, first, all the variables and parameters are set and defined in the initialization step, followed by the assignment of the required memory to store each component of the electromagnetic field. In each time step, the lightning channel source model excites the desired space. Then, the components of the electric (E x , E y , E z ) and magnetic (H x , H y , H z ) fields are computed and updated. Finally, boundary conditions are applied to avoid unwanted reflections of the electromagnetic field. These steps are repeated until the simulations are performed in the desired time period. The output of the FDTD algorithm would be the electric and magnetic fields at each observation point within the specified time period.

Hardware and Software Used
For the GPU implementation of the adopted FDTD approach, the GeForce GTX 1050 Ti hardware and PGI Community Edition 17.10 software were used. The PGI compiler was developed by the Portland Group and NVidias [43], and the PGI Community Edition includes a free license of the PGI C++ compilers and tools for multicore CPUs and NVIDIA Tesla GPUs. Community Edition enables the development of performance-portable High-performance computing (HPC) applications with a uniform source code in the most widely used parallel processors and systems [44]. Details of the GPU are presented in Error! Reference source not found.. . For comparison purposes, we also repeated the calculations on CPU by making use of an Intel ® Core™ i7-6800K @ 3.40 GHz hardware.

OpenACC Programming
This section aims to provide more elaborate details about OpenACC programming, as well as compiler specifications for building and executing codes on the GPU.

Hardware and Software Used
For the GPU implementation of the adopted FDTD approach, the GeForce GTX 1050 Ti hardware and PGI Community Edition 17.10 software were used. The PGI compiler was developed by the Portland Group and NVidias [43], and the PGI Community Edition includes a free license of the PGI C++ compilers and tools for multicore CPUs and NVIDIA Tesla GPUs. Community Edition enables the development of performance-portable High-performance computing (HPC) applications with a uniform source code in the most widely used parallel processors and systems [44]. Details of the GPU are presented in Table 1. For comparison purposes, we also repeated the calculations on CPU by making use of an Intel ® Core™ i7-6800K @ 3.40 GHz hardware.

OpenACC Programming
This section aims to provide more elaborate details about OpenACC programming, as well as compiler specifications for building and executing codes on the GPU.

OpenACC Data Clause
The data required for the process implementation are transferred to the GPU memory prior to the parallel region in the code, and the appropriate OpenACC directives are then inserted into the Appl. Sci. 2020, 10, 2359 5 of 15 main code for each loop. Since it is burdensome for a compiler to determine whether data will be needed in the future, they are copied conscientiously in case of future demand. Data locality helps to solve this problem so that the data remain in the local memory in the host or system as long as needed. Data clauses enable the programmer to control the method and time of generating/copying the data from/on the device. The data directives are briefly described as follows: • Copy: Provision of space for the variables listed on the device, initialization of the variables via copying data on the device at the region's first parts, copying the generated results on the host at the region's last parts, and, finally, releasing the space on the device [38,45]; • Copyin: Provide the required space for the variables available on the device list, carry out the initialization of the variables via copying data on the device at the region's beginning, and free the space on the device without copying the data back onto the host [38,45].
In the FDTD calculations, the start and end points of data locality were positioned before the time loop (iteration) and at the end of all modules and functions, respectively. As shown in Figure 4, the starting point of data locality in the FDTD algorithm was placed before the time loop (iteration) in the code so that the proper data needed for the process execution were moved to the GPU memory using copy directive. The ending point of data locality was also placed after performing (execution) all of the modules and functions so that the generated output data returned to the CPU memory using copyin directive. This data directive avoids extra and inefficient data transfer for the calculation. Therefore, the data transfer time between the GPU and CPU is reduced using the data clause, remarkably reducing the computation time.

OpenACC Data Clause
The data required for the process implementation are transferred to the GPU memory prior to the parallel region in the code, and the appropriate OpenACC directives are then inserted into the main code for each loop. Since it is burdensome for a compiler to determine whether data will be needed in the future, they are copied conscientiously in case of future demand. Data locality helps to solve this problem so that the data remain in the local memory in the host or system as long as needed. Data clauses enable the programmer to control the method and time of generating/copying the data from/on the device. The data directives are briefly described as follows: • Copy: Provision of space for the variables listed on the device, initialization of the variables via copying data on the device at the region's first parts, copying the generated results on the host at the region's last parts, and, finally, releasing the space on the device [38,45]; • Copyin: Provide the required space for the variables available on the device list, carry out the initialization of the variables via copying data on the device at the region's beginning, and free the space on the device without copying the data back onto the host [38,45].
In the FDTD calculations, the start and end points of data locality were positioned before the time loop (iteration) and at the end of all modules and functions, respectively. As shown in Figure 4, the starting point of data locality in the FDTD algorithm was placed before the time loop (iteration) in the code so that the proper data needed for the process execution were moved to the GPU memory using copy directive. The ending point of data locality was also placed after performing (execution) all of the modules and functions so that the generated output data returned to the CPU memory using copyin directive. This data directive avoids extra and inefficient data transfer for the calculation. Therefore, the data transfer time between the GPU and CPU is reduced using the data clause, remarkably reducing the computation time.

OpenACC Parallelize Loops
Parallel and kernel regions are two different approaches to incorporate parallelism into the code provided by OpenACC: • Structure of kernels: The structure of kernels determines a code region that can contain parallelism. However, the automatic parallelization capabilities of the compiler, identification of the loops that Appl. Sci. 2020, 10, 2359 6 of 15 are secure enough for parallelization, and acceleration of these loops influence the analysis of the region. The compiler is free to select the best way of mapping the parallelism in the loops onto the hardware [38,45]; • Structure of parallel: If this directive is put in a loop, the programmer claims that the affected parallelization loop is secured and enables the compiler to choose how to schedule the loop iterations on the target accelerator. In this case, the programmer determines the availability of parallelism per se, whereas the decision regarding mapping parallelism onto the accelerator depends on the compiler's knowledge of the device [38,45].
The OpenACC parallelizing directives were applied to different parts of the code to implement parallel computing. For example, as can be seen in Figure 5, the #pragma ACC kernels loop is set at the beginning of each of the three loops, which calculate and update each of the electromagnetic components. Table 2 presents the computation time associated with these two directives. According to Table 2, C on CPU runtime (without optimization) is 2231.59 s, which is the benchmark time. The speed was increased by a factor of 2.43 by optimizing the programming C on CPU. Moreover, as can be observed in the table, the computing speed on the GPU processor increased by factors of 10.49 and 11.14 by using different directive OpenACC including the #pragma ACC parallel loop and the #pragma ACC kernels loop OpenACC, respectively.

Save Variable
As shown in Figure 1, a GPU processor is presented as a set of multiprocessors, each with its own stream processors and memory. The stream processors have the full capability to execute integer and single precision floating point arithmetic, including extra cores applied to double-precision procedure [16]. The results of Table 3 reveal that when the variables are of the float type, the calculation time in the GPU process is reduced. Precisely, in GPU processing by storing variables with single precision, the computing speed can be increased up to 20.02 times the speed of storing variables by double processing in the CPU series.

Optimization by Tuning Number of Vectors
The OpenACC performance model includes three sections: Gang (Thread block), Worker (Warp), and Vector (Threads in a Warp) [38].
Gangs, vectors, and workers can be automatically set up by the OpenACC compiler or by the developer to yield the optimum approach. The number of vectors is modified in the OpenACC directives to gain the maximum occupancy and speed-up in GPU. Modifying the vector length clause can be beneficial to reaching the optimal runtime value by trying different vector lengths for the accelerator, which was used in this study. Figure 6 shows the relative runtime derived by varying the vector length compared to the compiler-selected value. It is worth noting that the best performance was obtained for a vector length of 1024.

Optimization by Tuning Number of Vectors
The OpenACC performance model includes three sections: Gang (Thread block), Worker (Warp), and Vector (Threads in a Warp) [38].
Gangs, vectors, and workers can be automatically set up by the OpenACC compiler or by the developer to yield the optimum approach. The number of vectors is modified in the OpenACC directives to gain the maximum occupancy and speed-up in GPU. Modifying the vector length clause can be beneficial to reaching the optimal runtime value by trying different vector lengths for the accelerator, which was used in this study. Figure 6 shows the relative runtime derived by varying the vector length compared to the compiler-selected value. It is worth noting that the best performance was obtained for a vector length of 1024.

APPLICATION: MODELS AND HYPOTHESES CONSIDERED FOR ANALYSIS
The geometry of the problem is shown in Figure 7. The aim is to evaluate lightning electromagnetic fields over a mountainous terrain by making use of the FDTD technique. We assume a pyramidal mountain with a height H , which is varied from 0 (flat ground case) to 1000 m, while its cross-section is assumed to have an area of A = 40,000 m . The ground and mountain are assumed to be a perfect electric conductor (PEC) and the ground depth h = 100 m. The vertical electric field

Application: Models and Hypotheses Considered for Analysis
The geometry of the problem is shown in Figure 7. The aim is to evaluate lightning electromagnetic fields over a mountainous terrain by making use of the FDTD technique. We assume a pyramidal mountain with a height H m , which is varied from 0 (flat ground case) to 1000 m, while its cross-section is assumed to have an area of A = 40,000 m 2 . The ground and mountain are assumed to be a perfect electric conductor (PEC) and the ground depth h m = 100 m. The vertical electric field and the radial magnetic field components are calculated for three observation points A, B, and C. These points are, respectively, 100, 400, and 500 m far from the lightning channel base, and all located on the ground surface. In the FDTD calculations, a vertical array of current sources was used for the modeling of the lightning channel [46]. The modified transmission line model with exponential decay (MTLE) [47,48] was used for modeling the return stroke channel. For subsequent stroke current waveforms, a summation of two Heidler functions was used [49], which is given in Equation (1): where I 01 = 10.7 kA, 11 = 0.25 µs, 21 = 2.5 µs, n 1 = 2, ŋ 1 = 0.639, I 01 = 6.5 kA, 11 = 0.25 µs, 21 = 230 µs, n 1 = 2, and ŋ 1 = 0.867, respectively [49]. The current decay constant and the return stroke speed are considered to be λ = 2000 m and v = 1.5 × 10 8 m/s, respectively. The time increment was set to 9.5 ns. The dimensions of the computational domain are 1500 × 1500 × 1500 m with a spatial mesh grid composed of square cells of 5 × 5 × 5 m. In an unbounded space, the electromagnetic response analysis of a structure absorbing boundary conditions in which unwanted reflections are suppressed must be applied to planes to truncate and limit the open space and the working volume, respectively. In order to avoid reflections at the domain boundaries, Liao's condition is adopted as the absorption boundary condition [50]. The horizontal magnetic field (H y ) components at the observation points A, B, and C are calculated assuming different heights for the mountain. The results are presented in Figure 8.
According to Figure 8, the time delay, peak value, waveshape, and amplitudes of the horizontal magnetic fields (H y ) are affected by the presence of the mountain. Although the effect of the mountain on the time delay (peak field and onset time) was negligible for point A, its effect on points B and C (right side of the hill) was significant [51]. The results represented in Figure 8 have been validated using canonical structures against the simulation results obtained in [51]. The only difference was that the GPU was used and the dimensions were 1500 × 1500 × 1500 m in this study, whereas a CPU processor was used in [51]. Figure 9 presents, for validation purposes, the obtained results compared with those obtained using a 2D-FDTD method.  According to Figure 8, the time delay, peak value, waveshape, and amplitudes of the horizontal magnetic fields (H y ) are affected by the presence of the mountain. Although the effect of the mountain on the time delay (peak field and onset time) was negligible for point A, its effect on points B and C (right side of the hill) was significant [51]. The results represented in Figure 8 have been validated using canonical structures against the simulation results obtained in [51]. The only difference was that the GPU was used and the dimensions were 1500 × 1500 × 1500 m in this study, whereas a CPU processor was used in [51]. Figure 9 presents, for validation purposes, the obtained results compared with those obtained using a 2D-FDTD method.

Performance Analysis
This section compares the graphics processing time using OpenACC for the considered 3D-FDTD problem in this study (dimensions of the volume 1500 × 1500 × 1500 m), with the execution time of this algorithm using a serial C programming language in CPU. Table 4 presents the comparison of the performance of OpenACC compared to CPU serial C. As can be seen in Table 4, the OpenACC execution time in an efficient state was 21.22 times that of the serial processing in the CPU without compromising precision. As can be seen in Table 3, the calculation time in the GPU process was drastically reduced by storing the variables of the float type (single precision). Although float variables (single precision) led to a decrease in the accuracy of the computation, this reduction was negligible. Error! Reference source not found. depicts the difference (computing error) between CPU and GPU processing for point A, at which the mountain height was zero (H m = 0). The error is smaller than 0.008% and can be considered insignificant. In addition, Figure 11 indicates the gain in the computation speed as a function of the size of the workspace. It can be seen that the GPU performance increases with larger simulation domains.

Performance Analysis
This section compares the graphics processing time using OpenACC for the considered 3D-FDTD problem in this study (dimensions of the volume 1500 × 1500 × 1500 m), with the execution time of this algorithm using a serial C programming language in CPU. Table 4 presents the comparison of the performance of OpenACC compared to CPU serial C. As can be seen in Table 4, the OpenACC execution time in an efficient state was 21.22 times that of the serial processing in the CPU without compromising precision. As can be seen in Table 3, the calculation time in the GPU process was drastically reduced by storing the variables of the float type (single precision). Although float variables (single precision) led to a decrease in the accuracy of the computation, this reduction was negligible. Figure 10 depicts the difference (computing error) between CPU and GPU processing for point A, at which the mountain height was zero (H m = 0). The error is smaller than 0.008% and can be considered insignificant. In addition, Figure 11 indicates the gain in the computation speed as a function of the size of the workspace. It can be seen that the GPU performance increases with larger simulation domains.      Speed-up Working volume size (million cells) Figure 11. Gain in GPU runtime with respect to CPU for various numbers of nodes.

Conclusions
An OpenACC-aided graphics processing unit (GPU)-based finite difference time domain (FDTD) method was presented for the first time for the 3D evaluation of lightning radiated electromagnetic fields along a complex terrain with arbitrary topography. The OpenACC (Open accelerators) directive-based programming model was used to enhance the computational performance and the results were compared with those obtained by using a CPU-based model. It was shown that OpenACC GPUs can provide very accurate results while being more than 20 times faster than CPUs. The presented results support the use of OpenACC not only for lightning electromagnetics problems, but also for large-scale realistic EMC applications in which computation time efficiency is a critical factor. Funding: This work was supported by the Bakhtar Regional Electricity Company, Arak, Iran.

Conflicts of Interest:
The authors declare no conflict of interest.