A Large-Batch Orthorectification Generation Method Based on Adaptive GPU Thread Parameters and Parallel Calculation

Orthorectification reflects a large amount of real and objective information, such as the characteristics of images and the geometric accuracy of maps. Conducting a large batch of orthorectification is a process with high time cost owing to the pixelwise correction each image. A common approach is to use graphics processing unit (GPU) parallel computing to accelerate orthorectification processing. However, most of the existing GPU acceleration studies have adopted experimental testing methods to determine thread parameters, which are inapplicable to different GPUs and affect the GPU acceleration performance. We put forward an adaptive calculation method for GPU thread parameters based on the performance parameters of different GPUs and by simultaneously blocking the image automatically according to the GPU memory space. We used 112 ZY-3 images to test the adaptive GPU and compare it to a general GPU. The experimental results show the following: first, for a single ZY-3 image, the GPU acceleration by the adaptive calculation method presented in this article is 43.22% higher than that by the general GPU, and the correction time is 34.41 times faster than that of the central processing unit. The result of the automatic image blocking was the same as that of the artificial blocking. Second, the experimental performance on four different GPUs indicated that all GPUs exhibited a significant speed boost. Third, for large-batch images, the GPU acceleration by the adaptive GPU was 32.6% higher than that by the general GPU, which provides an adaptive optimization strategy for large-batch image orthorectification.


A Large-Batch Orthorectification Generation Method
Based on Adaptive GPU Thread Parameters and Parallel Calculation data with different repeat cycles and resolutions. Orthorectification is an image processing method that can obtain the correct geographical coordinates of RS images by eliminating the geometric distortions due to terrain and sensor errors [1]. The generated orthoimagery after correction contains substantial real and objective information, not only including the geometric accuracy of maps but also the characteristics of the images. Therefore, the rapid generation of large-scale digital orthophoto maps has garnered considerable attention to meet the requirements of different application scenarios, such as postdisaster monitoring and map drawing [2]. There are two methods of orthorectification: direct and indirect. Direct correction starts from the original satellite images and directly calculates the geodetic reference space coordinates corresponding to each pixel using the algorithm models. Indirect correction is based on simulative orthoimagery; the first affine transforms the image coordinates to ground coordinates. Subsequently, its coordinates on the original image are calculated using an algorithmic model, and finally, grey value resampling is conducted to generate the orthoimagery. Orthorectification algorithms include the physical sensor and general empirical models [3]. With the collinearity conditions, the former expounds a calculation formula of the original image coordinates and actual object geographic coordinates with strict physical parameters [4]. The ratio of one or two polynomials consists of a general empirical model, which is a mathematical fit of physical sensor models. The rational function model (RFM) is used as most general empirical model, and is expressed as transformation relation by the cubic polynomial ratio [5], [6], [7]. Tenuous level and smooth splines modeling technology is used to abate the deviation of the RFM polynomial coefficients for correction [8]. Oh and Jung [9] presented an efficient method that integrates time and cost for automatic rational polynomial coefficients (RPCs) offset compensation correction of high-resolution satellite images. A lookup table can reduce workload of polynomial coefficients calculation and accelerate the correction of images [10]. Wang et al. [11] computed the image orientation parameters based on the plane block adjustment method before correcting every satellite image by the RFM. However, orthorectification is a pixelwise process, which leads to a large number of calculations. Consequently, many existing research work is based on small-batch images, but the orthoimagery generation of largebatch satellite images is relatively less. High-performance computing (HPC) is used in various applications with the advantages of the acceleration aspects of large data volume computing [12], [13], [14]. The HPC model includes heterogeneous computer network, cloud, field programmable gate array, and cluster-based multi-central processing unit (CPU) and graphics processing unit (GPU) systems [15], [16], [17]. The high-efficiency processing algorithms are explored in remote sensing image with HPC, such as hyperspectral imaging and endmember extraction [18], [19], [20], [21]. They concluded that HPC technology exhibits a significant improvement in efficiency compared to traditional computing methods. Owing to the large amount of data from high-resolution satellite images, a CPU-based single processor system cannot satisfy the requirements of fast and real-time data processing, and the multi-CPU system and FPGA are limited by high cost and complex programming; thus, GPU parallel computing is a better choice to accelerate the orthorectification processing. A new inexact Newton beam adjustment algorithm based on a multicore CPU and a single GPU, and the results of that showed that the efficiency in speed of a single GPU increased by twoto three-fold compared to that of a multicore CPU [22]. GPU parallel computing is applied to accelerate the orthorectification process of hyperspectral and unmanned aerial vehicle (UAV) images [23], [24]. Dai and Yang [25] implemented delicate parallel resampling to achieve fast correction. The images can be corrected fast based on the flow model and inverse sensor algorithm in a GPU-accelerated cluster environment [26], and the efficiency of input-output (I/O) operation in GPU correction by maximizing the device occupancy, which enhances the memory access efficiency [27]. Multiple GPUs are applied to efficient processing of modulation transfer function compensating and geo-rectification, and further improved the GPU performance in three aspects: memory throughput, instruction control, and multistream processing [28]. This method artificially divides an image into multiple blocks, which is unsuitable for large high-resolution satellite image processing operations by different GPUs. Li et al. [29] used GPU to accelerate the correction processing of UAV image distortions by configuring the best grid and multilevel memory. However, the Compute Unified Device Architecture (CUDA) thread parameters are determined by experimental testing, which implies that this method may be unsuitable for applications on different GPUs.
According to the aforementioned analysis, some problems still exist in the current research: first, in most studies implementing the existing GPU accelerating correction, the thread block parameters are determined experimentally, which is inapplicable to different GPUs and affects the acceleration efficiency; Second, in the current methods, images are corrected and need to be artificially partitioned, which affects the automation of correction. Therefore, the purpose of this study is to propose an automatic and adaptive GPU thread parameter calculation method for the generation of large-scale orthoimagery to improve the calculation efficiency of orthorectification. Two critical problems are explored in this regard: the adaptive calculation of thread parameters in different GPUs; and the correction of the automatic blocking of the image. The contributions of this article are as follows. 1) We utilize GPU parallel computing to accelerate the indirect orthorectification process; the proposed method is applied to large-area and large-batch high-resolution satellite images, which solves the low time efficiency problem of conventional orthorectification. 2) We propose an adaptive GPU thread parameter calculation method, which solves the problem of optimized calculation to determine the thread parameters of GPUs, and significantly decreases the time requirements. 3) We propose an automatic image blocking method to solve the problem of GPU-accelerated correction of large images that needs to be partitioned according to the size of the GPU memory. The main organizational framework of this article is as follows: the adaptive GPU thread parameters, automatic blocking of the image method, and the specific process of orthorectification are explained in Section II. Section III provides the results and analysis of the orthorectification experiment. A discussion of the influence of thread parameters is presented in Section IV. Finally, Section V concludes this article.

II. METHOD
The process comprises three main steps, which are illustrated in Fig. 1. First, the satellite image is preprocessed, and the connection points are extracted for adjustment, and the result is used to compensate for the RPC. Second, we calculate the geographical longitude and latitude of satellite image four corner points to establish a simulative orthoimagery, and adaptively partition the image and calculate the thread parameters using the GPU. Finally, we use the GPU to calculate the original image coordinates for each pixel along with the grey value resampling to generate the orthoimagery. Fig. 1 shows that the first step in this study is adjustment preprocessing, which eliminates the relative positioning error of the satellite images and the existing error in the initial RPC. First, The SIFT can detect points with the same characteristics and corresponding physical information in the image was used to extract the connection points from the images [30], and the leastsquares method was used to reduce some useless connection points. The block adjustment model was constructed according to the connection points, and the common affine transformation model was used to construct the difference equation for the RFM.

A. Preprocessing
The object parameters and other information can be calculated using the forward intersection calculation method and iteratively solved to obtain the adjustment parameters [31], [32], [33]. Finally, the adjustment results are used to compensate for the RPC parameters and update the RPC files, which are convenient for subsequent calculations using the RFM.

B. Adaptive Calculation of the GPU Thread Parameters and
Image Blocking 1) Automatic Image Blocking: When a GPU is used to accelerate the orthorectification of images, it is necessary to cut the images into several blocks when dealing with large satellite image data owing to the limited storage space of the GPU. The large image mentioned here is relative to the GPU memory space, and when the data size of the processed image exceeds the GPU memory space, it is considered as the large image. Generally, images are artificially divided into small blocks, which are unsuitable for different GPUs. Therefore, we propose an automatic image-blocking method based on the size of the remaining available video memory space and processing data size of the GPU. The data used are as follows: input data are the original image and the digital elevation model (DEM), whose sizes are R img and R DEM , respectively; Output data are the orthoimagery, simulative orthoimagery are considered as orthoimagery after grayscale assignment, the size of the image is the ratio of maximum and minimum latitude and longitude differences to pixel width and height, the product of the data type of the image and the image size is the size of the memory occupied by the orthoimagery, which is R DOM ; the aforementioned data are put into the GPU; and R Gfree is the free space of the GPU memory. As the amount of data in DEM is small compared to that in the original images, it is unnecessary to partition the DEM. Thus, we only divide the input original image and the output orthoimagery, and the calculation scheme is as follows.
Case 1: The total size of the input and output data is less than the available space of the GPU memory, and no division is necessary.
Case 2: The sum of the input and output data is greater than the available space of the GPU memory; the original image and orthoimagery need to be partitioned, and the number of blocks is defined as Block n . The range of the orthoimagery block reflected in the original image is wider, and the excessive part size is k * R img ; k = (latitude ul − latitude ur )/ (latitude max − latitude min ), where latitude ul and latitude ur are the latitudes of the upper left and upper right corner points of the original image, respectively; and latitude max and latitude min are the maximum and minimum latitudes of the original image, respectively. Image block size was ( R img Block n + k * R img ), and the orthoimagery block size was R dom Block n after blocking the images. The formula for the calculation of the image blocks is defined as follows, and ceil () is rounded up: (1) 2) Adaptive Calculation of GPU Thread Parameters: When using the CUDA framework, the pixelwise correction processing part with high parallelism is mapped into the kernel function of CUDA, and the kernel function that starts the correction must set the size of the thread blocks and grids. Conventional GPU acceleration sets the thread block size to 256, which is a two-dimensional size of 16 × 16. In previous studies, the size of the thread block was tested from 64 to 1024, and the optimal number of thread blocks was selected by testing the kernel function performance. The selected number of threads is not always the best for different GPUs, and unsuitable thread parameters will affect the running time of the kernel function. Therefore, we propose an adaptive GPU thread parameter calculation method to determine the thread parameters in CUDA according to the different performance parameters of the GPU to ensure that different GPUs can accelerate the correction process. Fig. 2 shows that this method first obtains the physical performance parameters from the GPU; for example, the available registers and the available share memory (SM) size in a streaming multiprocessor. These parameters are then used to calculate the thread blocks per SM that can be limited by threads, registers, and SM, and calculate the number of warps. The minimum value of the three warp numbers was used to calculate the warp occupation rate and filter out the thread block size with the largest kernel occupation rate. If multiple thread blocks have the same occupation rate, the minimum value is considered as the final thread block size. The thread grid size and boundary qualifier parameter can then be determined according to the thread block size.
From the parameters of the GPU, we can obtain the number of threaded warps used by each SM, denoted as Num warp_SM . A thread warp contains Num thread_warp threads; the thread warp's allocation granularity is Num warp_alloc ; register allocation unit size is Num reg_alloc ; An SM has Num reg_SM registers; and a thread has Num reg_thread registers. Therefore, we can calculate the number of threaded warps in an SM limited by the registers (2)  Owing to the restriction imposed by the number of registers used in the kernel function, the upper limit of a thread block size is determined by the following formula: where Num reg_SM is the amount of registers in a GPU SM and Num reg_thread is the amount of registers used in a thread. And, we assume that the thread block size is Num thread_block = 64, 128 . . . , Maxnum thread_block , and the actual thread warp used by each thread block is The size of SM used by each thread in a kernel function is Size Smemory_block ; the allocation unit size of SM is Size Smemory_alloc ; and the SM size of an SM is Size Smemory_SM . Therefore, the amount of thread blocks that the streaming multiprocessor can allocate is restricted by threads, registers, and SM ⎧ ⎪ ⎨ We can calculate the corresponding thread warp occupancy Max(Occupancy n ) by traversing the set number of threads and obtain the minimum threads when the occupancy is at its maximum, and Num thread_block is the thread block number. CUDA sets the thread block and thread grid parameters through <<<>>> when the orthorectification kernel function begins, and the thread block size is determined using the aforementioned calculation. Selecting the two-dimensional thread indices Block x and Block y should be as average as possible and calculated using the following formula: ⎧ ⎨ ⎩ Block x = 2 m < ceiling Num thread_block , 2 (m = 1, 2, . . . , n) The corresponding two-dimensional thread grids is: Grid x = ceil( Img cols Block x ) and Grid y = ceil( Img rows BLock y ), We can use the boundary qualifier "__launch_bounds_" to specify the thread and thread warp to be used in the compiler [34]. By specifying different parameters to increase the kernel occupation, a suitable kernel can achieve better performance optimization. In this study, the size of the thread block calculated above is threads perB as the first parameter, and (8) is used to calculate the thread warp W arp launch as the second parameter. Simultaneously, the L1 cache is opened in the GPU to further improve efficiency.

3) Orthorectification Based on GPU Parallel Computation:
This study uses the indirect correction method and accelerates the processing using GPU parallel computation. The indirect correction method begins from the orthorectification side and corrects each pixel. It is relatively independent and does not compete with the direct correction calculation; thus, it is more suitable for GPU parallel calculation.
The GPU parallel computing algorithm for orthorectification is presented in Table I. First, the RFM model is constructed according to the RPC parameters and the original image coordinates according to (9) Second, the average elevation of the DEM is used to iteratively calculate the longitude and latitude (B, L) of the four corners and determine the four solstices of the image. The image range can be determined to establish a simulative orthoimagery. Simultaneously, affine transformation parameters can be established according to the spatial resolution. Subsequently, the RPC is put into the constant memory, and the original image, DEM, and simulative orthoimagery are stored in the global memory of GPU, and adaptively calculate the GPU thread parameters simultaneously. Subsequently, the pixelwise (x, y) correction is conducted from the simulative orthoimagery, which uses GPU acceleration to correct each pixel: the corresponding geographical coordinates (B, L) are calculated by the established affine transformation parameters; the corresponding row and columns in the DEM are transformed using (B, L) and the DEM affine parameters; resampling interpolation is conducted to obtain the elevation value H of the longitude and latitude (B, L); the coordinates (X, Y ) of the original image side corresponding are calculated according to (B, L, H) through the positive solution formula of the RFM; the grey value calculated by resampling interpolation on this coordinate is given to the orthoimagery pixel (x, y). Finally, the correction data are transmitted to the CPU, and the orthoimagery is generated after a grey value is assigned to the simulative orthoimagery.

A. Data and Hardware Environment
The data used in this study were obtained from ZY-3 satellite images. China ZY-3 satellite is the earliest high-precision civil stereoscopic optical satellite, which acquires three-dimensional images of a linear array [36]. In this study, 112 images of Taihu Basin, China were used as the original images for orthorectification. The pixel space size is 3.5 m of the forward/backward images and the nadir image resolution is 2.1 m [37]. The details of the original images are presented in Table II. In addition, public 30-m DEM data were used to obtain the elevation (http: //www.gscloud.cn/sources/accessdata/310?pid=302).
The experiment was conducted using Visual Studio 2017. The CPU was Intel i7-10510U, and the GPU was NVIDIA MX 350 2G with Pascal architecture, which matches the computing power of CUDA of 6.1. The GPU parameters used to test the adaptive thread parameters on the different GPUs are presented in Table III.

B. Result of GPU Acceleration Efficiency of a Single Image
The calculation results of the adaptive GPU thread parameters and image blocks for the ZY-3 nadir and forward/backward images are compared in MX 350. As presented in Table IV, the thread block size is 64 (8×8), the thread grid size of the forward/backward image is 2039 × 2048, and that of the nadir image is 3065 × 3072. The value of Warp launch is 32, which is the number of warps actually used by the qualifier. As the forward/backward image data are small, the GPU does not require partitioning, and the number of adaptive blocks of the  II  DETAILED DESCRIPTION OF ZY-3 SATELLITE IMAGES   TABLE III  GPU PARAMETERS USED IN THE EXPERIMENT   TABLE IV  CALCULATION RESULTS OF ADAPTIVE GPU THREAD PARAMETERS AND IMAGE  BLOCKS nadir image is two, which is consistent with the number of artificial blocks. Subsequently, we compared the performance of the general and adaptive GPU calculations, and the time recorded in this section is focused on the pixelwise correction, and does not include other overheads, such as reading and writing images and preprocessing. The nvprof performance analysis tool provided by NVIDIA is used to record the correction time when using CUDA for GPU acceleration. Based on the proposed calculation results of adaptive GPU thread parameters and image blocks, we used bicubic interpolation in the resampling method to conduct a time comparison experiment of orthorectification. The clock () function is used in the CPU to record the quadrupole coordinates require and pixelwise correction time. The calculation of the quadrupole coordinates requires 0.02 s, which only accounts for a very small part of the total running time; this article mainly records the pixelwise corrections, as presented in Table V. For the forward/backward images, the pixelwise correction time varies greatly, the time of pixelwise correction of the CPU single thread is 326.08 s, and the correction time of the general GPU is 16.21 s. The correction time of the proposed adaptive GPU thread parameter method is only 9.43 s, which is 53.17% higher than that of the general GPU, and is 34.58 times faster than that of CPU correction. For the nadir image with more data, the time of CPU correction is 729.83 s and that of general GPU correction is 37.36 s. Adaptive GPU calculation has an acceleration ratio of 34.41 for the CPU, which is 43.22% higher than that of the general GPU. The comparison results demonstrate the adaptive GPU calculation is preferable to those of the CPU and general GPU methods. The blocks set manually are more suitable for the GPU than a conventionally set thread block size. The qualifier parameters calculated according to the determined thread blocks can effectively improve the correction efficiency of the GPU.
In addition, in order to comprehensively evaluate the performance of the proposed method, five rounds independent experiments are conducted and the average time is calculated, as shown in Fig. 3

C. Result of the GPU Acceleration Efficiency of Large-Batch Images
Simultaneously, we individually tested the improvement of batch images correction time and corrected 112 ZY-3 images with the proposed adaptive calculation method. The recording time includes the input, correction, output, and dividing block times of the image. First, the multithreaded correction of a nadir image is tested in the CPU matched with MX 350, as shown in Fig. 4. When the number of threads reaches eight CPU cores, the increase in multithreading time gradually decreases and converges. To prevent CPU overload when running many threads, eight threads are selected to process batch images, and the acceleration times of the GPU are compared. Therefore, the experimental results of orthorectification of 112 ZY-3 images are shown in Fig. 5, and the block-based Wallis color correction is used. As presented in Table VII,    the adaptive calculation in this article, the optimized GPU correction time reached 2084.043 s, which improved the efficiency by 32.6% compared to that of the conventional GPU. Therefore, when mass production of orthoimagery, after optimization using the calculation method of the adaptive GPU thread parameters in this study, the efficiency improvement was better compared to that of the conventional GPU acceleration.

D. Orthorectification Accuracy Based on GPU Parallel Computing
After experimental image preprocessing, we used the bicubic interpolation method to resample the ZY-3 nadir image when the orthorectification was based on GPU parallel computing. The correction accuracy evaluated using the RMSE formula for plane deviation with the same points is presented in Table VIII. We converted the deviation values of longitude and latitude into meters, and the deviation between this correction and the local positioning system (LPS) correction results in the longitudinal and latitudinal directions are 2.2656 and 1.2075 m, respectively, and the overall deviation is 2.5673 m. In this study, no control point was used for orthorectification, and the deviation was approximately one pixel. This effect is good and meets the needs of orthoimagery generation.

IV. DISCUSSION
First, we test the influence of different thread number on the running time of the kernel functions. The maximum threads Maxthread block can be determined by (3), which is 640. Fig. 6 shows that the height of the two-dimensional thread block is 1, and the width is increased by 64 threads each time, from a minimum of 64 to a maximum of 640 correction programs supported by MX 350. The experimental results show that when the line block size is 64, the kernel function runs the least, which is consistent with the results of the adaptive computing thread parameters proposed herein. The increase or decrease in time in Fig. 6 is consistent with the change in threaded warp utilization at values of 64 and 320; in this case, the kernel thread bundle utilization is the highest, thereby enabling the higher performance of the GPU, and the thread parameters are calculated using the thread warp in this study.
The running time of the correction kernel function in the default L2 cache is 31.957 s, whereas that after enabling the L1 cache increased by 3.26% to 30.9152 s, which improved the efficiency. The CUDA kernel function uses a two-dimensional thread index to process images. The number of the two-dimensional thread block is the arithmetic square root of the thread block size or is close to it, with a great effect. The running time results when the calculation thread block size is 64 are shown in Fig. 7. The two-dimensional thread block was 8 × 8, and the kernel function running time was 31.3118 s with the best performance.
Subsequently, we analyzed the impact of the qualifier on the kernel function time when using the different thread warps. The thread number of the qualifier during conventional implementation was 256, and that determined by the adaptive calculation  in this study was 64. The second parameter of the qualifier is the number of threaded warps working in the SM, which can change the warp utilization. Fig. 8 shows that as the utilization rate increases, the kernel running time gradually decreases. The highest performance was obtained when the utilization rate was 100%. When implementing the correction method in this study, we observed that the higher the kernel occupancy, better the performance.

V. CONCLUSION
Orthorectification of images is a time-consuming task. The size of the thread block in most existing parallel computing methods is determined by testing in existing GPU optimization methods, which affects the orthorectification performance on different GPUs. We propose a parallel computing method for orthorectification with adaptive GPU thread parameters, thus solving the problems of adaptively calculating GPU thread parameters in different GPUs and automatically blocking the image to be corrected. After images preprocessing, automatically blocked according to the memory size of the GPU, and the GPU thread parameters are calculated by the physical parameters of the GPU; Finally, orthorectification is achieved based on GPU parallel computing. To verify the efficiency of adaptive computing, experiments were conducted on 112 ZY-3 images, and the results were compared to those using CPU and general GPU methods. The following conclusions were drawn. 1) For a single ZY-3 image, the GPU acceleration by the adaptive calculation method presented in this article is 43.22% higher than that of the general GPU, and 34.41 times faster than the correction time of the CPU. The result of the automatic image blocking was the same as that of the artificial blocking.
2) The method was tested in four different GPUs, and their performances were compared. The results indicate that all the GPUs exhibited a great speed boost. The correction time of the best GPU was only 2.21 s for a single image. 3) For large-batch images, the GPU acceleration by the adaptive calculation method presented in this article is 32.6% higher than that of a general GPU, which provides an adaptive optimization strategy for large-batch image orthorectification. Notably, the main idea of the proposed method can apply in large batches of image processing tasks; for example, image mosaicing, block adjustment, and dodging uniform colors. In the future, we will attempt to further optimize the utilization of streaming multiprocessors in the GPU to achieve better efficient image processing.