Design of GPU Network-on-Chip for Real-Time Video Super-Resolution Reconstruction

Deep learning has a better output quality compared with traditional algorithms for video super-resolution (SR), but the network model needs large resources and has poor real-time performance. This paper focuses on solving the speed problem of SR; it achieves real-time SR by the collaborative design of a deep learning video SR algorithm and GPU parallel acceleration. An algorithm combining deep learning networks with a lookup table (LUT) is proposed for the video SR, which ensures both the SR effect and ease of GPU parallel acceleration. The computational efficiency of the GPU network-on-chip algorithm is improved to ensure real-time performance by three major GPU optimization strategies: storage access optimization, conditional branching function optimization, and threading optimization. Finally, the network-on-chip was implemented on a RTX 3090 GPU, and the validity of the algorithm was demonstrated through ablation experiments. In addition, SR performance is compared with existing classical algorithms based on standard datasets. The new algorithm was found to be more efficient than the SR-LUT algorithm. The average PSNR was 0.61 dB higher than the SR-LUT-V algorithm and 0.24 dB higher than the SR-LUT-S algorithm. At the same time, the speed of real video SR was tested. For a real video with a resolution of 540×540, the proposed GPU network-on-chip achieved a speed of 42 FPS. The new method is 9.1 times faster than the original SR-LUT-S fast method, which was directly imported into the GPU for processing.


Introduction
The demand of high-resolution (HR) images is growing for the advancement of information technology and modern industrial civilization, but image clarity is limited by the functionality of sensors and processors. The goal of super-resolution (SR) is to convert low-resolution (LR) images into high-resolution (HR) images, which has important research value. The deep learning SR method is a hot topic of research because it can achieve impressive results. In 2014, Dong et al. [1] first proposed the SRCNN method to achieve image SR reconstruction; many scholars then carried out further research based on SRCNN, such as FSRCNN [2], LapSRN [3], IMDN [4][5][6][7], etc. These methods achieved good results in the peak signal-to-noise ratio (PSNR), but usually cannot be implemented in real-time because of a large computational burden. Despite the availability of specialized computing engines that can enhance efficiency, the fixed "black box" network models make it is difficult to accelerate the calculation. The high hardware requirements and large computational costs hinder the wide application of deep learning SR algorithm.
In recent years, significant progress has been made in the hardware deployment of SR technology. Researchers have gradually applied SR technology to various hardware platforms by optimizing network structures, network quantization, and hardware acceleration methods. Lightweight network structures have notable advantages in hardware deployment. For example, Lim et al. [8] proposed that the EDSR, which is based on deep residual There is an exponential increase in the size of the LUT with the RF size. In practice, if the RF size is 3 or greater, the complete LUT can become very large. To mitigate the size of the LUT, a uniform sampling method can be employed to reduce the RF size. In the paper, a lightweight neural network SR-Net with an RF size of 4 is designed. SR-Net consists of a convolutional neural network (CNN) with 6 convolutional layers, in which the first layer has a kernel size of 2 × 2 and the rest have a kernel size of 1 × 1. The number of output channels for the first 5 convolutional layers is set to 64, and the number of output channels for the last layer is set to the square of the image upscaling factor.
To enhance the extraction of local structural features at different angles and increase the RF size, this paper adopts the method proposed by Jo et al. [17]. The image is rotated to 0, 90, 180, and 270 degrees, and the four outputs of the rotated images are fused to obtain the SR result. The rotational strategy has been used to maximize the accuracy only at the test time in previous deep SR works [8,18]. In the original fusion method, the average operation is used according to the Formula (1). Formally, the final output, , can be expressed as follows: in which is the final output of SR reconstruction, is LR input patch, is the coefficients for different angles, is the deep SR network, is image rotation operation to × 90 degree, and is the reverse rotation operation. In the paper, considering that the local texture structure used in SR has directionality, the weights at different directions should be different. Therefore, the coefficient value, . for each pixel at different angles is obtained through a lightweight CNN called ROT-Net, whose network structure is shown in Figure 2. ROT-Net consists of three convolutional There is an exponential increase in the size of the LUT with the RF size. In practice, if the RF size is 3 or greater, the complete LUT can become very large. To mitigate the size of the LUT, a uniform sampling method can be employed to reduce the RF size. In the paper, a lightweight neural network SR-Net with an RF size of 4 is designed. SR-Net consists of a convolutional neural network (CNN) with 6 convolutional layers, in which the first layer has a kernel size of 2 × 2 and the rest have a kernel size of 1 × 1. The number of output channels for the first 5 convolutional layers is set to 64, and the number of output channels for the last layer is set to the square of the image upscaling factor.
To enhance the extraction of local structural features at different angles and increase the RF size, this paper adopts the method proposed by Jo et al. [17]. The image is rotated to 0, 90, 180, and 270 degrees, and the four outputs of the rotated images are fused to obtain the SR result. The rotational strategy has been used to maximize the accuracy only at the test time in previous deep SR works [8,18]. In the original fusion method, the average operation is used according to the Formula (1). Formally, the final output,ŷ i , can be expressed as follows:ŷ in whichŷ i is the final output of SR reconstruction, x i is LR input patch, A j is the coefficients for different angles, f is the deep SR network, R j is image rotation operation to j × 90 degree, and R −1 j is the reverse rotation operation. In the paper, considering that the local texture structure used in SR has directionality, the weights at different directions should be different. Therefore, the coefficient value, A j . for each pixel at different angles is obtained through a lightweight CNN called ROT-Net, whose network structure is shown in Figure 2. ROT-Net consists of three convolutional layers and two fully connected layers, in which the ReLU activation function and max pooling operation are used for feature extraction. To adapt to images with different scales, an adaptive average pooling is introduced to fix the feature size. Finally, the four results from the LUT are fused according to the weights from ROT-Net, and a SR result with more detailed information is obtained. layers and two fully connected layers, in which the ReLU activation function and max pooling operation are used for feature extraction. To adapt to images with different scales, an adaptive average pooling is introduced to fix the feature size. Finally, the four results from the LUT are fused according to the weights from ROT-Net, and a SR result with more detailed information is obtained. The training process of the network comprises two stages. In the first stage, a lightweight SR-Net network is trained using the DIV2K dataset to obtain complete LUT parameters. In the second stage, we fixed the parameters of the SR-Net and trained the ROT-Net. The four results with different rotation angles from the LUT were fused according to the adaptive weights from ROT-Net to generate the final SR image.

LUT and Interpolation Design
After training the SR-Net network, the LUT is designed for a SR-Net network with an RF size of 4. The mapping method of the LUT is shown in Figure 3. The output values of the network are indexed by four corresponding input values and saved in the LUT. As illustrated in Figure 3, the LUT is the term 4D-LUT, which consists of four input values and two output values.  The training process of the network comprises two stages. In the first stage, a lightweight SR-Net network is trained using the DIV2K dataset to obtain complete LUT parameters. In the second stage, we fixed the parameters of the SR-Net and trained the ROT-Net. The four results with different rotation angles from the LUT were fused according to the adaptive weights from ROT-Net to generate the final SR image.

LUT and Interpolation Design
After training the SR-Net network, the LUT is designed for a SR-Net network with an RF size of 4. The mapping method of the LUT is shown in Figure 3. The output values of the network are indexed by four corresponding input values and saved in the LUT. As illustrated in Figure 3, the LUT is the term 4D-LUT, which consists of four input values and two output values. layers and two fully connected layers, in which the ReLU activation function and max pooling operation are used for feature extraction. To adapt to images with different scales, an adaptive average pooling is introduced to fix the feature size. Finally, the four results from the LUT are fused according to the weights from ROT-Net, and a SR result with more detailed information is obtained. The training process of the network comprises two stages. In the first stage, a lightweight SR-Net network is trained using the DIV2K dataset to obtain complete LUT parameters. In the second stage, we fixed the parameters of the SR-Net and trained the ROT-Net. The four results with different rotation angles from the LUT were fused according to the adaptive weights from ROT-Net to generate the final SR image.

LUT and Interpolation Design
After training the SR-Net network, the LUT is designed for a SR-Net network with an RF size of 4. The mapping method of the LUT is shown in Figure 3. The output values of the network are indexed by four corresponding input values and saved in the LUT. As illustrated in Figure 3, the LUT is the term 4D-LUT, which consists of four input values and two output values.  Once the LUT is built, SR is performed solely with the LUT. However, mapping the complete network parameters to the LUT will consume a significant amount of storage resources. If the dimensions of the LUT increase, the storage resource requirements increase exponentially. By uniformly sampling for the original input space of 2 8 bins (0-255 for an 8-bit input image), the size of the LUT can be reduced. For example, a 4D-LUT can be divided into 2 4 + 1 bins as uniformly sampled spaces by using an interval of 2 4 [17]). After uniform sampling, the entire LUT process is divided into two specific steps: calculating the original quantized coordinates and performing interpolation operations. The interpolation coordinates are calculated from the original quantized coordinates and weights.
After obtaining the original quantized coordinates, an interpolation calculation is performed using the tetrahedral interpolation method [19]. A tetrahedron is formed by four non-coplanar three-dimensional coordinate points, and interpolation is performed on the four triangles formed by the tetrahedron. Based on the weight of the interpolation point relative to the four vertices of the tetrahedron, the final interpolation result is achieved by weighting the interpolation results of individual triangles. An example of triangular interpolation is a sampled 2D-LUT with the sampling interval of 2 4 . An example of interpolation for a sampled 2D-LUT with the sampling interval of 4 is shown in Figure 4. For the query input, I 0 = 24 and I 1 = 60, the nearest points, P 00 , P 01 , and P 11 , and the corresponding weights, w 0 , w 1 , and w 2 , are determined. The output value is calculated as the weighted sum. The same principle applies to 3D and 4D LUTs. Once the LUT is built, SR is performed solely with the LUT. However, mapping the complete network parameters to the LUT will consume a significant amount of storage resources. If the dimensions of the LUT increase, the storage resource requirements increase exponentially. By uniformly sampling for the original input space of 2 bins (0-255 for an 8-bit input image), the size of the LUT can be reduced. For example, a 4D-LUT can be divided into 2 + 1 bins as uniformly sampled spaces by using an interval of 2 . The overall size will be reduced from 64 G ( [17]). After uniform sampling, the entire LUT process is divided into two specific steps: calculating the original quantized coordinates and performing interpolation operations. The interpolation coordinates are calculated from the original quantized coordinates and weights.
After obtaining the original quantized coordinates, an interpolation calculation is performed using the tetrahedral interpolation method [19]. A tetrahedron is formed by four non-coplanar three-dimensional coordinate points, and interpolation is performed on the four triangles formed by the tetrahedron. Based on the weight of the interpolation point relative to the four vertices of the tetrahedron, the final interpolation result is achieved by weighting the interpolation results of individual triangles. An example of triangular interpolation is a sampled 2D-LUT with the sampling interval of 2 . An example of interpolation for a sampled 2D-LUT with the sampling interval of 4 is shown in Figure 4. For the query input, = 24 and = 60, the nearest points, , , and , and the corresponding weights, , , and , are determined. The output value is calculated as the weighted sum. The same principle applies to 3D and 4D LUTs. In Figure 4 (the 2D equivalent triangular interpolation), first, the input values = 36(00100100 ( ) ) and = 60(00111100 ( ) ) are split into the 4 most significant bits The predicted output value of this 2D-LUT example is: In Figure 4 (the 2D equivalent triangular interpolation), first, the input values I 0 = 36(00100100 (2) ) and I 1 = 60(00111100 (2) ) are split into the 4 most significant bits (MSBs) and 4 least significant bits (LSBs). The MSBs of I 0 and I 1 are 2 and 3, respectively, which are used to determine the nearest sampling point. The LSBs of I 0 and I 1 are L x = 4 and L y = 12, respectively, which are used to determine the weights of the boundary triangle and boundary vertices. Two boundary vertices are fixed at P 00 = LUT[2] [3] and P 11 = LUT[2 + 1][3 + 1], and the third vertex is determined by comparing L x and L y . Here, we choose P 01 = LUT [2][3 + 1]. The weight of each vertex is proportional to the area of the corresponding triangle, which can be calculated as: w 0 = 2 4 − L y , w 1 = L y − L x , and w 2 = L x . The predicted output value of this 2D-LUT example is: The same principle applies to 4D LUTs. Based on the LSB values (corresponding to L x , L y , L z , and L t for I o , I 1 , I 2 , and I 3 , respectively), out of the 24 possible combinations of L x , L y , L z , and L t , one branch is selected as the predicted output value using a judgement mechanism. Based on the coordinate information and using tetrahedral interpolation, the predicted output valueV of the 4D-LUT is calculated using the Formula (2).
where, W is the sampling interval, w i is the weight assigned to the interpolated point with respect to sampling coordinate points, and P i is the set of all interpolation coordinate points.
As the input LR image size increases, the computation time for SR using LUTs grows exponentially. Therefore, optimizing the LUT structure and interpolation algorithm is extremely important.

GPU On-Chip Optimization Implementation of the Algorithm
GPU is a high-performance computing device. In 2007, NVIDIA introduced the Compute Unified Device Architecture (CUDA). To deploy the algorithm on a GPU and implement it using a GPU NoC, we adopted the CUDA programming framework, which supports GPU computing. Various storage systems have been designed in CUDA [20][21][22], including global memory, shared memory, and constant memory, among others. The storage systems are used at different locations in the algorithm, based on their storage capacity and access speed. The effective allocation of data storage space plays a crucial role in enhancing the overall performance of the algorithm. In this study, the quantized coordinate information of each pixel and the final interpolation process are obtained through a LUT, which requires traversing all pixels in the image. If executed in serial, its performance cannot meet the real-time requirements. Therefore, the computation of original quantized coordinates and the final SR process are realized by GPU parallel. In addition, three optimization strategies are used to optimize the GPU parallel algorithm for video SR scenarios.

Implementation of the Parallel SR Algorithm Based on CUDA
In the new SR algorithm, based on the LUT, the tetrahedral interpolation method is used according to the original quantized coordinates and weights. The interpolation coordinates can be directly obtained for sampling points, but the nearest neighbor points need to be calculated for non-sampling points. For example, the original quantized coordinates P (the neighbor points) include: in 2D, only P 00 , P 01 , P 10 , and P 11 ; in 3D, 8 neighboring points from P 000 to P 111 ; and in 4D, 16 neighboring points from P 0000 to P 1111 . Thus, if the dimension of the LUT increases, the number of calculated neighboring points will also increase. In the paper, two kernel functions, PCoordinatesKernel and SRLinearKernel, are designed. The PCoordinatesKernel obtains the coordinates P by performing a lookup in the LUT using the input image data. The SRLinearKernel performs image SR by interpolating the original image using coordinate data.
For 4D-LUT, in order to facilitate data manipulation in CUDA, image data is compressed into a one-dimensional matrix. A one-dimensional coordinate matrix p i is obtained by querying a LUT, which can be represented as Formula (3): where, b is the set of index values corresponding to the higher four bits of the image pixel space on the LUT, r represents the magnification factor, and i corresponds to the number of pixels after a single pixel SR (equal to all integers between 0 and r 2 ). The final original quantized coordinates P are represented as follows: where o is the number of coordinates and R is the dimension of the LUT. The coordinates of the current pixel are fixed by the index value in the CUDA thread. We set block_size = k (0 < k ≤ 1024) and grid_size = (total number of pixels + k − 1)/k; block_size is the thread block size of a single grid, and grid_size is the grid size allocated by the kernel function. Using the built-in one-dimensional indexing variables blockDim.x, blockIdx.x, and threadIdx.x in CUDA, image pixel traversal can be accomplished. blockDim.x represents the number of threads contained in a thread block. blockIdx.x is the index of the thread block in the grid, indicating the encoding of the current thread in its thread block. threadIdx.x is the index of the thread in its thread block [23]. The precise scheduling of GPU threads can be achieved by combining these three built-in variables. The thread index tidx/tidy can be calculated as Formula (5): where tidx and tidy denote the indexes mapped to the x and y directions of the thread block, respectively. The pre-processed image data and LUT data are inputted into the PCoordinatesKernel GPU kernel function to compute the coordinates, P, which are then used for the subsequent calculation of interpolated coordinates. The flow chart for SR of LUTs on the CPU/GPU is shown in Figure 5. where is the number of coordinates and is the dimension of the LUT. The coordinates of the current pixel are fixed by the index value in the CUDA thread. We set block_size = k (0 < k <= 1024) and grid_size = (total number of pixels + k − 1)/k; block_size is the thread block size of a single grid, and grid_size is the grid size allocated by the kernel function. Using the built-in one-dimensional indexing variables . , . , and threadIdx.x in CUDA, image pixel traversal can be accomplished.
. represents the number of threads contained in a thread block.
. is the index of the thread block in the grid, indicating the encoding of the current thread in its thread block. ℎ . is the index of the thread in its thread block [23]. The precise scheduling of GPU threads can be achieved by combining these three built-in variables. The thread index / can be calculated as Formula (5) where and denote the indexes mapped to the x and y directions of the thread block, respectively. The pre-processed image data and LUT data are inputted into the PCoordinatesKernel GPU kernel function to compute the coordinates, , which are then used for the subsequent calculation of interpolated coordinates. The flow chart for SR of LUTs on the CPU/GPU is shown in Figure 5. The interpolation process requires compressing the three-channel RGB image into a one-dimensional array and sending it to the SRLinearKernel kernel function. Its culation process requires mapping from a three-dimensional matrix (Formula (2)) to a one-dimensional matrix. Specifically, the mapping to a one-dimensional coordinate space needs to be calculated for each of the RGB channels, followed by interpolation. The final predicted output of the compressed image data in one dimension is obtained in CUDA as follows: where, R, G, and B represent the interpolation of three channels and theV r,g,b is the onedimensional data of the final output image. index r = (x × h + y) × 4 2 , x and y correspond to the coordinates on the x-axis and y-axis, respectively, and h and w represent the width and height of the image. M = h × w × r 2 , the size of a channel image. H represents the two-dimensional coordinate matrix H[i, j] = i × r + j(i, j between 0 − (r − 1)) where the interpolated values are located. The specific coordinate, P, chosen needs to be calculated by comparing the values of the LSB.
In practical applications, when traversing an input image, the current pixel position needs to be determined through a two-dimensional coordinate. Therefore, it is necessary to map the two-dimensional coordinates to the index of GPU threads. According to Equation (5), we use tidx and tidy to denote the index of image pixels mapped to threads, respectively. For GPUs based on the Turing architecture, there are restrictions on the grid size and thread block size. The maximum values allowed for the grid size in the x, y, and z directions are 2 31 − 1, 65,535, and 65,535, respectively. The maximum values allowed for the thread block size in the x, y, and z directions are 1024, 1024, and 64, respectively, and the product of the sizes in the three directions cannot exceed 1024. Regardless of how it is defined, a thread block can have a maximum of 1024 threads [24].

Optimization Strategy for Algorithms on CUDA
The dimension of the LUT is higher, the number of calculated coordinates and the number of conditional branches in the interpolation calculation will increase, requiring more resources and computing power. Optimizing the GPU parallel strategy is crucial. It is important to minimize the data transfer between CPU and GPU, to increase the arithmetic strength of the kernel functions, and to increase the parallel scale of the kernel functions [24]. In this paper, three specific optimization strategies are used to achieve efficient data transfer, enhance the arithmetic strength of the kernel function, and increase the kernel function parallelism scale of the algorithm, as follows: (1) Storage access optimization During the computation of the original quantized coordinates, it is necessary to allocate the memory for the LUT, image data, and the required quantization coordinates on the GPU simultaneously. In the case of video SR, SR for each frame of the video should not create and destroy all data memory as often, as some of the data memory can be applied to the calculation of SR for all frames. For example, by copying the LUT to the GPU only once, with all frames accessing that LUT, the SR of each frame will save time by copying the LUT to the GPU once. Global memory in GPUs has a large capacity, typically greater than 2 GB and sometimes even larger, and is accessible to all threads. However, because transferring data between CPU and GPU takes a significant amount of time, it is necessary to organize data transfer in memory based on the characteristics of the data.
Once the image size is known in a video stream, the original quantized coordinates, P, and the size of the output result can be calculated. The storage access optimization adopted in this paper mainly includes the following: In the CUDA kernel function, repeated memory allocation, and resource recovery are reduced by modifying data in memory. The memories for LUT, quantization coordinates, P, and the final output result are accordingly allocated on the GPU, and their storage size is fixed and calculated based on the input image. These memories are not released before thread exit, thus saving time for memory initialization and data transfer between the host and device.
Accessing global memory in CUDA is comparatively slow, and computing coordinates and interpolation requires multiple accesses to image data and interpolation coordinate data stored in GPU memory. To solve this problem, the built-in function __ldg() in CUDA can be used to read data from global memory, and by specifying read-only caching with __ldg(), the GPU can directly access the data from the faster Texture Cache. Using function __ldg() in CUDA can reduce latency and improving the efficiency of global memory data access.
(2) Conditional branching function optimization In the interpolation process, the interpolation coordinates need to judge the LSB in different pixel spaces to determine the weights of the boundaries and boundary vertices. However, the frequently used i f branching statements in the code make different threads execute different instruction paths, which will result in more time and resources needed for thread synchronization and cooperation. This paper improves the execution efficiency of the CUDA program by using high-speed bitwise and logical operations instead of branching statements. For example, if the LSBs of the image are calculated as f a and f b, their difference can be calculated and right-shifted 31 bits. If f a is greater than f b, the highest bit of the difference is 0; otherwise, it is 1. The logical AND operator is applied to the operation that needs to be executed. If the flag is 1, the operation is executed; otherwise, it is not.

(3) Threading optimization
In RTX 3090GPU, each Streaming Multiprocessor (SM) can support up to 2048 thread blocks and each thread block can support up to 1024 threads. However, the number of thread blocks is also limited by the amount of shared memory and registers available. If a fixed number of thread-blocks and grid sizes are used, it may lead to wasted resources or an insufficient thread scale. The thread-dispatch experiment in Figure 6 of the paper tested the relationship between the number of threads and pixels in the GPU. Therefore, an adaptive calculation should be performed when allocating the number of threads in CUDA. The adaptive calculation formula is as Formula (7): where, G x and G y represent the corresponding grid sizes in the x and y directions, respectively, and n represents the number of pixels processed by a single thread. The maximum thread block size is 1024 (32 × 32). For the stability of the program and to avoid resource waste, a grid size of (G x , G y ) and thread block size of (32, 32) are allocated adaptively within constraints. This adaptive allocation can achieve good performance for any image size. However, if on low-performance hardware, reducing the grid size and thread block size is necessary to lower the number of CUDA cores and achieve real-time performance. Figure 7 shows the minimum number of CUDA cores required to reach a frame rate of over 20 FPS for various image sizes. Since the LUT SR process requires rotating the image four times (0, 90, 180, and 270 degrees), calling the kernel function four times for image interpolation would take a considerable amount of time. Using CUDA Stream to asynchronously execute kernel functions allows multiple kernel functions to be executed simultaneously. By using the CUDA Stream asynchronous mechanism, device waiting time for the host can be reduced, thereby improving overall performance. Since the LUT SR process requires rotating the image four times (0, 90, 180, and 270 degrees), calling the kernel function four times for image interpolation would take a considerable amount of time. Using CUDA Stream to asynchronously execute kernel functions allows multiple kernel functions to be executed simultaneously. By using the CUDA Stream asynchronous mechanism, device waiting time for the host can be reduced, thereby improving overall performance.

Experimental Results and Analysis
In this study, the CPU used in the experimental environment is Intel(R) Core (TM) i9-10900k, and the GPU used is RTX 3090. The experimental platform is Ubuntu18.04. The experimental includes algorithm ablation experiments, comparison tests with classical SR algorithms, and real video image testing.

Thread Allocation Ablation Experiment
The test time for obtaining the output result of SR using LUT before optimization with an input image resolution of 540 × 540 is 22.8 ms, and the resolution of the output  Since the LUT SR process requires rotating the image four times (0, 90, 180, and 270 degrees), calling the kernel function four times for image interpolation would take a considerable amount of time. Using CUDA Stream to asynchronously execute kernel functions allows multiple kernel functions to be executed simultaneously. By using the CUDA Stream asynchronous mechanism, device waiting time for the host can be reduced, thereby improving overall performance.

Experimental Results and Analysis
In this study, the CPU used in the experimental environment is Intel(R) Core (TM) i9-10900k, and the GPU used is RTX 3090. The experimental platform is Ubuntu18.04. The experimental includes algorithm ablation experiments, comparison tests with classical SR algorithms, and real video image testing.

Thread Allocation Ablation Experiment
The test time for obtaining the output result of SR using LUT before optimization with an input image resolution of 540 × 540 is 22.8 ms, and the resolution of the output

Experimental Results and Analysis
In this study, the CPU used in the experimental environment is Intel(R) Core (TM) i9-10900k, and the GPU used is RTX 3090. The experimental platform is Ubuntu18.04. The experimental includes algorithm ablation experiments, comparison tests with classical SR algorithms, and real video image testing.

Thread Allocation Ablation Experiment
The test time for obtaining the output result of SR using LUT before optimization with an input image resolution of 540 × 540 is 22.8 ms, and the resolution of the output image is 2160 × 2160 (×4). We set the number of threads in a block to 1024 and used the thread multi-element revisit strategy [25] to allow GPU threads to process multiple pixels. Figure 6 shows the performance improvement ratio when the number of pixels processed by threads varies. The performance improvement ratio L is calculated as Formula (8): where, T be f and T a f t represent the execution time of a certain optimization before and after its effectiveness. From the Figure 6, it can be found that the method achieves a performance improvement ratio of 3.8% when a single GPU thread processes 2 pixels. This experiment proves that a single thread can handle multiple pixels, reducing the grid size. In the experiment, the overall GPU memory consumption is 2622 MiB (a memory usage rate is 10.7%). The results show that the new method can be implemented on lower-memory embedded hardware. This paper achieves high GPU acceleration performance by using a small grid size. A smaller grid size means a reduced GPU load and scheduling, thereby improving the parallel computing efficiency of the algorithm. Moreover, a smaller grid size indicates good scalability of the algorithm, making it easier to implement on GPUs with lower computing power.

Image Ablation Experiment with Different Resolution
We selected commonly seen low-resolution images in real-world scenarios, with pixel counts of 100,000, 300,000, 500,000, and 1,000,000, which correspond to resolutions ranging from 360 × 360 to 1080 × 1080, to test the performance. As the resolution increases, the number of computing units executed by the kernel also increases. The minimum required number of CUDA cores for the algorithm to run in real-time with different video resolutions is shown in Figure 7. The 360 × 360 resolution requires a minimum of 36 cores and achieves 39 FPS; the 540 × 540 resolution requires 81 cores and achieves 25 FPS; and the 720 × 720 resolution requires 256 cores and achieves 20 FPS.
As the input image size increases, the algorithm's running time also increases. The speed tests are performed on images with different resolutions, and the test results are shown in Figure 8. The figure displays the consumed time (in milliseconds) and FPS for images of different resolutions. The bar chart represents the average FPS of the output displayed after the input image is SR four times, and the line chart represents the stable single-frame time consumption. The maximum frame rate was 80 FPS for the 360 × 360 image, and the average elapsed time was 12.5 ms per frame.
where, and represent the execution time of a certain optimization before and after its effectiveness. From the Figure 6, it can be found that the method achieves a per formance improvement ratio of 3.8% when a single GPU thread processes 2 pixels. Thi experiment proves that a single thread can handle multiple pixels, reducing the grid size In the experiment, the overall GPU memory consumption is 2622 MiB (a memory usag rate is 10.7%). The results show that the new method can be implemented on lower memory embedded hardware.
This paper achieves high GPU acceleration performance by using a small grid size A smaller grid size means a reduced GPU load and scheduling, thereby improving th parallel computing efficiency of the algorithm. Moreover, a smaller grid size indicate good scalability of the algorithm, making it easier to implement on GPUs with lower com puting power

Image Ablation Experiment with Different Resolution
We selected commonly seen low-resolution images in real-world scenarios, with pixel counts of 100,000, 300,000, 500,000, and 1,000,000, which correspond to resolution ranging from 360 × 360 to 1080 × 1080, to test the performance. As the resolution increases the number of computing units executed by the kernel also increases. The minimum re quired number of CUDA cores for the algorithm to run in real-time with different video resolutions is shown in Figure 7. The 360 × 360 resolution requires a minimum of 3 cores and achieves 39 FPS; the 540 × 540 resolution requires 81 cores and achieves 25 FPS and the 720 × 720 resolution requires 256 cores and achieves 20 FPS.
As the input image size increases, the algorithm's running time also increases. Th speed tests are performed on images with different resolutions, and the test results ar shown in Figure 8. The figure displays the consumed time (in milliseconds) and FPS fo images of different resolutions. The bar chart represents the average FPS of the outpu displayed after the input image is SR four times, and the line chart represents the stabl single-frame time consumption. The maximum frame rate was 80 FPS for the 360 × 36 image, and the average elapsed time was 12.5 ms per frame.  As the resolution increases, the GPU memory usage also increases. The test results about GPU memory usage of video streams with different resolutions are shown in Figure 9. Among them, the minimum memory usage is 2010 MiB at a resolution of 360 × 360. With the increase of resolution, the memory usage shows a clear upward trend, and reaches 5810 MiB at a resolution of 1080 × 1080.
As the resolution increases, the GPU memory usage also increases. The test results about GPU memory usage of video streams with different resolutions are shown in Figure  9. Among them, the minimum memory usage is 2010 MiB at a resolution of 360 × 360. With the increase of resolution, the memory usage shows a clear upward trend, and reaches 5810 MiB at a resolution of 1080 × 1080.

Network-on-Chip Performance Comparison with Classical SR Algorithms
Four public datasets that have been widely used in SR task evaluation, Set5, Set14, BSDS100 [26], and Urban100 [27], are used to compare the paper with classical algorithms. For quantitative evaluation, PSNR and structural similarity index (SSIM) [28] are used, which are traditionally used for image quality assessment. The paper compared our method with several classic SR algorithms. The runtime test was conducted by using a 320 × 180 LR image as input to generate a 1280 × 720 HR RGB image, and the average was obtained from 10 tests. Due to our successful optimization and deployment of the algorithm on the GPU, the execution environment is set to GPU, while other programs such as callers, data transmission, etc., are run on the CPU.
In Table 1, our proposed algorithm shows good PSNR and SSIM values on four datasets, which are higher than those of SR-LUT-S. The largest improvement in PSNR is achieved on the Set14 dataset, with an increase of +0.51 dB. At the same time, our algorithm runs 9.1 times faster than the SR-LUT-S fast algorithm, and even faster than the SR-LUT-V algorithm, with a stable output of 10 ms. We also employ GPUs for inference on other neural network models. Our algorithm is faster than mainstream lightweight neural networks. It is 5.5 times faster than the quickest lightweight neural network, IMDN, greatly surpassing the inference speed of lightweight neural networks.

Network-on-Chip Performance Comparison with Classical SR Algorithms
Four public datasets that have been widely used in SR task evaluation, Set5, Set14, BSDS100 [26], and Urban100 [27], are used to compare the paper with classical algorithms. For quantitative evaluation, PSNR and structural similarity index (SSIM) [28] are used, which are traditionally used for image quality assessment. The paper compared our method with several classic SR algorithms. The runtime test was conducted by using a 320 × 180 LR image as input to generate a 1280 × 720 HR RGB image, and the average was obtained from 10 tests. Due to our successful optimization and deployment of the algorithm on the GPU, the execution environment is set to GPU, while other programs such as callers, data transmission, etc., are run on the CPU.
In Table 1, our proposed algorithm shows good PSNR and SSIM values on four datasets, which are higher than those of SR-LUT-S. The largest improvement in PSNR is achieved on the Set14 dataset, with an increase of +0.51 dB. At the same time, our algorithm runs 9.1 times faster than the SR-LUT-S fast algorithm, and even faster than the SR-LUT-V algorithm, with a stable output of 10 ms. We also employ GPUs for inference on other neural network models. Our algorithm is faster than mainstream lightweight neural networks. It is 5.5 times faster than the quickest lightweight neural network, IMDN, greatly surpassing the inference speed of lightweight neural networks. The SR comparison results are shown in Figure 10. Based on visual inspection, the proposed method exhibits prominent restoration effects compared to bicubic interpolation and shows better sharpness in some detail recoveries than SR-LUT-V/S. A+ [32] 1748 ms * 15. The SR comparison results are shown in Figure 10. Based on visual inspection, the proposed method exhibits prominent restoration effects compared to bicubic interpolation and shows better sharpness in some detail recoveries than SR-LUT-V/S.

Real Video Testing
In practical applications, this paper uses a ZWO ASI462 industrial camera to capture images in real-time. By sampling a 540 × 540 video area for SR, the algorithm generates a 2160 × 2160 result for display, achieving 42 FPS with a memory consumption of 2622 MiB. The representative results are shown in Figure 11. From the results, it can be seen that the reconstruction of SR using the method proposed in this paper can significantly improve the effects on lines and contours, and achieve completely real-time video SR reconstruction.

Real Video Testing
In practical applications, this paper uses a ZWO ASI462 industrial camera to capture images in real-time. By sampling a 540 × 540 video area for SR, the algorithm generates a 2160 × 2160 result for display, achieving 42 FPS with a memory consumption of 2622 MiB. The representative results are shown in Figure 11. From the results, it can be seen that the reconstruction of SR using the method proposed in this paper can significantly improve the effects on lines and contours, and achieve completely real-time video SR reconstruction. The SR comparison results are shown in Figure 10. Based on visual inspection, the proposed method exhibits prominent restoration effects compared to bicubic interpolation and shows better sharpness in some detail recoveries than SR-LUT-V/S.

Real Video Testing
In practical applications, this paper uses a ZWO ASI462 industrial camera to capture images in real-time. By sampling a 540 × 540 video area for SR, the algorithm generates a 2160 × 2160 result for display, achieving 42 FPS with a memory consumption of 2622 MiB. The representative results are shown in Figure 11. From the results, it can be seen that the reconstruction of SR using the method proposed in this paper can significantly improve the effects on lines and contours, and achieve completely real-time video SR reconstruction.

Conclusions
This paper explored the application of a LUT combined with deep learning for video Figure 11. SR effect of real video.

Conclusions
This paper explored the application of a LUT combined with deep learning for video SR, and optimized the SR algorithm through GPU implementation. Firstly, the design and training strategy of the SR network were described, which effectively utilized the influence of different rotation angles on the LUT. The weights of features from different angles are adaptively changed to improve the quality of the SR reconstructed image. The network parameters are uniformly sampled and added to a carefully designed multidimensional LUT. The video real-time SR reconstruction is achieved by fully utilizing the data characteristics of the LUT and the CUDA kernel function to execute a large number of threads in parallel. GPU optimization strategies are proposed, which include storage access optimization, conditional branching function optimization, and threading optimization. Finally, the test results show that the speed of the new SR is 9.1 times faster than SR-LUT-S. In a quantitative comparison on public datasets, the new method outperforms SR-LUT in terms of PSNR and SSIM. Additionally, the GPU-accelerated implementation of the new algorithm only requires 81 CUDA cores and 2622 MiB of GPU memory. Real-time ×4 SR of 540 × 540 resolution video streams can be achieved with 42 FPS. Theoretically, our method can also be applied to embedded hardware environments with lower computing power.