Pixel-Level Kernel Estimation for Blind Super-Resolution

Throughout the past several years, deep learning-based models have achieved success in super-resolution (SR). The majority of these works assume that low-resolution (LR) images are ‘uniformly’ degraded from their corresponding high-resolution (HR) images using predefined blur kernels — all regions of an image undergoing an identical degradation process. Furthermore, based on this assumption, there have been attempts to estimate the blur kernel of a given LR image, since correct kernel priors are known to be helpful in super-resolution. Although it has been known that blur kernels of real images are non-uniform (spatially varying), current kernel estimation algorithms are mostly done at image-level, estimating one kernel per image. These algorithms inevitably become sub-optimal in handling scenarios where an image is degraded non-uniformly. A divide-and-conquer form of approach, dividing an image into several patches for individual kernel estimation and SR can be a simple solution for this matter. Nevertheless, this approach fails in practice. In this paper, we address this issue by pixel-level kernel estimation. The three main components for training a SR framework based on pixel-level kernel estimation are as follows: Kernel Collage — a method for synthesizing non-uniformly degraded LR images, designed considering the coherency of kernels at neighboring regions while abruptly changing at times, the indirect loss — a novel loss for training the kernel estimator, based on the reconstruction loss, and an additional optimization — a scheme to robustify the SR network to minor errors in kernel estimations. Extensive experiments show the superiority of pixel-level kernel estimation in blind SR, surpassing state-of-the-art methods in terms of quantitative and qualitative results.


I. INTRODUCTION
Single image super-resolution (SISR) aims to recover a highresolution (HR) image from a given low-resolution (LR) image. Deep learning-based SISR methods have accomplished remarkable results recently and a large portion of these methods are trained by using LR-HR image pairs. Since it is extremely expensive to obtain real LR-HR pairs, a majority of SISR methods use synthesized LR-HR pairs for training. LR images are typically assumed to be degraded from its HR version using predefined blur kernels, and LR-HR pairs are synthesized based on this assumption.
The associate editor coordinating the review of this manuscript and approving it for publication was Charalambos Poullis .
It has been shown that the blur kernels vary for each image [1], [2] and that an accurate blur kernel prior can greatly improve performance SISR [3]. Accordingly, there have been several attempts to estimate the blur kernels [1], [2], [4]. Current kernel estimation methods mostly estimate one kernel per image, based on an assumption that LR images are degraded uniformly -same kernel applied to all regions within an image. However, blur kernels of real images are not always uniform [5]- [7].
Although the problem of kernel estimation on nonuniformly degraded images has been discussed in the field of deblurring [8]- [10], it has not received sufficient attention in SR. In the field of deblurring, kernel estimations on non-uniform degradation are mostly done in a divideand-conquer manner -dividing a whole image into several VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ patches and applying kernel estimation algorithms to those patches individually. This patch-wise strategy can be utilized in SR, but this has a contradictory limitation that the kernels can still vary across pixels within a patch.
In this paper, we focus on pixel-level kernel estimation. To enable pixel-level kernel estimation, we introduce a non-uniform degradation method, Kernel Collage. In Kernel collage, diverse kernels are used within a single image in a spatially coherent manner, while being inconsistent at times. Since pixel-level kernel estimation cannot obtain information from other regions of the image, we suggest a novel loss to learn kernel estimation via reconstruction error between super-resolved images using the ground-truth (GT) kernels and predicted kernels. Once the SR network and the Predictor are fully trained, the SR network is re-trained with the pretrained Predictor's weights fixed to get robust to incorrect estimations on kernels.
Our main contributions are summarized as follows: • We suggest Kernel Collage, a simple yet effective degradation method to train a SR framework based on pixel-level kernel estimation.
• We introduce a novel indirect loss for pixel-level kernel estimation, to relieve the issues of the conventional loss in image-level kernel estimation.
• Extensive evaluations show that employing pixel-level kernel estimation surpasses state-of-the-art techniques in blind SR, both on synthetic and real images.

II. RELATED WORK A. SUPER-RESOLUTION 1) CNN-BASED SUPER-RESOLUTION
Ever since Dong et al. [11] introduced the first Convolutional Neural Network (CNN) based super-resolution model, numerous CNN based models for super-resolution have been proposed [12]- [15]. Conventionally, the networks are trained with the pixel-wise L 1 or L 2 loss function using synthesized LR-HR image pairs. Furthermore, Spatial Feature Transform (SFT) layer was introduced to incorporate categorical conditions into account [16] and an internal learning-based approach has shown to be effective in recovering recurrent contents throughout an image [17].

2) BLIND SUPER-RESOLUTION
Blind SR aims to super-resolve an image where the degradation parameters, such as blur kernels are unknown. As it has been known that an accurate kernel prior can be very helpful in blind SR [3], most blind SR methods are done in a sequential combination of kernel estimation and superresolution.
Michaeli and Irani [1] estimate the blur kernel based on the recurrence of small patches within a single image and take advantage of it in the SR process. The inevitable problem of kernel mismatch has been addressed by Gu et al. [4] and they try to solve the problem by iteratively correcting the predictions on kernels. Recently, KernelGAN [2] has been proposed to estimate the degradation kernel of the given image through internal learning. All of these works estimate a kernel at image-level, one kernel per image. They are all based on an assumption that the degradation process is uniform across an entire image [1], [2], [4], which is not always the case in practical scenarios [5]- [7].

B. NON-UNIFORM DEGRADATION
Non-uniform degradation refers to a method of which an image is degraded in a spatially variant manner throughout a single image. It has been told by prior arts [5]- [7] that the blur kernels are non-uniform in real scenarios.
In the field of deblurring, attempts on non-uniform kernel estimation have been studied [8]- [10]. These approaches try to solve non-uniform blur kernel estimation by dividing the given image into patches and conducting kernel estimations individually. In contrast, non-uniform degradation and kernel estimation has not been discussed sufficiently in super-resolution. SRMD [18] and UDVD [19] proposed SR networks that could handle spatially varying degradation, but these are for non-blind SR, of which the degradation parameters must be given.
To our knowledge, the only work considering non-uniform degradation in blind SR is [20]. This work uses a kernel discriminator to make corrections on kernel estimation and can handle spatially variant degradation. However, it differs from our work in several aspects. For one, the entire SR framework of [20] is trained with uniformly degraded LR images only. Hence, it never encounters a non-uniformly degraded LR image in the training phase and becomes sub-optimal to super-resolve a non-uniformly degraded LR image. Moreover, although their method can make pixel-level estimations on kernels in theory, they conduct kernel estimation and super-resolution at patch-level in practice and do not address the necessity of pixel-level kernel estimation.
Current kernel estimation-based image restoration techniques for non-uniform degradation are mostly done at patch-level and attempts to aggregate the results afterward. In this paper, we propose a method to train a SR framework based on pixel-level kernel estimation that can handle non-uniformly degraded LR images without dividing the input image into multiple patches.

III. MOTIVATION
Current state-of-the-art kernel estimation methods typically estimate one kernel from an image or conduct kernel estimation in a patch-wise manner. In this section, we aim to demonstrate the weakness of image-level kernel estimation and emphasize the necessity of pixel-level estimation.
We have done a simple experiment using IKC [4] and DAN [21] to support our argument. Both algorithms estimate one kernel for an image and super-resolve the image based on it. As Figure 1 shows, IKC and DAN shows different SR quality at different regions -it still has artifacts and over-sharpens certain areas, whereas some regions are still left blurry. Since blurry or over-sharpening artifacts appear when the variance of the estimated kernels is smaller or greater than the actual kernels respectively [4], this example strongly indicates that the blur kernels are different regionby-region in real images and that estimating one kernel for an image is sub-optimal in practice. Hence, we aim to train a SR framework based on pixel-level kernel estimation to handle the non-uniformity of blur kernels.
One might question if the non-uniformity of kernels cannot be just handled in a patch-wise manner, like the approaches from deblurring. In fact, kernel estimation-based SR algorithms can be utilized in that manner. However, there are limits to this form of approach. Most importantly, making kernel estimations at patch-level is problematic in that it assumes each patch to have been degraded uniformly. It becomes sub-optimal in situations where the patches are also degraded non-uniformly, which is highly the case in real images. Continuously dividing patches into smaller patches may be an option, but since image contents are essential to make estimations on kernels, the patches cannot be divided from a certain point in order to estimate kernels. Therefore, applying kernel estimation algorithms patch-wise cannot fully manage the non-uniformity of kernels, and estimations on kernels should be made at pixel-level, one kernel per pixel, directly from a single image.

IV. PROPOSED METHOD
We propose a method to train a pixel-level kernel estimation model capable of handling non-uniformly degraded LR images. We discuss the strategy of non-uniform degradation to utilize in the training phase and present our approach for pixel-level kernel estimation and SR.

A. NON-UNIFORM DEGRADATION
In SISR, it has been dominantly assumed [2], [4], [18], [22], [23] that a LR image (I LR ) is degraded from a HR image (I HR ) by a degradation process of blurring, sub-sampling (↓) in a uniform manner, followed by an additive noise term n: where k stands for the blur kernel, s for the scale factor and ⊗ for convolution operation. In this work, we disregard the noise term (i.e. n = 0), as many prior works did.
On the other hand, to expose a non-uniformly degraded LR image to our framework, we use a kernel map K instead of a single kernel k in the degradation process. Kernel map is a map that indicates which kernels to apply on the HR image when generating each pixel of the LR image. The non-uniform degradation process can be expressed in the following formula: where p and q respectively denote each pixel's vertical and horizontal coordinates of the LR image and denotes dot product. Each pixel of the LR image (I LR p,q ) is degraded with a kernel of size h × h from the same coordinates of the kernel map (K p,q ). I HR P,Q represents a patch of size h × h from the HR image where the center coordinates are (p × s, q × s).

1) KERNEL MAP GENERATION
The main point of non-uniform degradation is to degrade various regions of an image with differing kernels and according to the non-uniform degradation process of Equation 2, a kernel map is essential. Here, we present a strategy to generate a kernel map.
Before synthesizing kernel maps, we must discuss the characteristics of kernel maps from the wild. We chose three characteristics to consider in the process of kernel map generation. The three characteristics are: randomness, coherency, and incoherency.
Randomness. Kernels can be applied to random regions, regardless of the contents. A specific content is not constrained to certain kernels and any kind of kernel can be applied randomly. Coherency. Although kernels may be randomly applied to different regions, it is very likely that neighboring regions tend to have similar kernels applied. A kernel map should have spatial coherency. Incoherency. While it is very likely that kernels of each pixel tend to be similar, this is not always the case. The possibility that kernels can change abruptly at neighboring pixels still exists.

a: KERNEL COLLAGE
In order to assure that the kernels are coherent in neighboring pixels, but change abruptly at times, we propose a degradation method Kernel Collage. First, we generate N kernel maps each filled with different kernels. One of these kernel maps is selected and is served as the base. The other N −1 kernels are shuffled, cropped, and pasted onto the base kernel map, like a collage (see Figure 2). Precisely, each kernel map randomly chooses two values each, which is the size (p i x , p i y ) of the area to cover and the anchor coordinate, which is the left-top coordinate (x i , y i ) of the area to start covering from.
In the descending order of size of the area (p i x × p i y ), each kernel overwrites their corresponding area It is worth mentioning that this order is important to avoid the larger patches from overwriting smaller patches, which can lead to the oversimplification of kernel maps.
Intuitively, it is likely that the inconsistency of kernels would appear at the boundaries of instances, and it may be better to introduce incoherency of kernels according to it. However, same instance or content does not guarantee identical or similar kernels and may change abruptly within oneself. Furthermore, different instances may also share identical or similar kernels. Consequently, we considered it to be risky to expose this kind of prior in training the SR framework. We must emphasize that this is a way to train a pixel-level kernel estimation framework, rather than an exact way to synthesize real LR images.

B. OVERALL FRAMEWORK
By using the generated kernel map described in the previous section, synthetic LR-HR pairs can be obtained. Once the data pairs are prepared, the SR network (G) is trained first. For the SR network, we adopt SFTMD [4], which is known to utilize information on kernels well in the SR process. Since the original SFTMD architecture was designed for image-level kernel estimation, we slightly modify the architecture so that it is capable of utilizing pixel-level kernel estimation. Once the SR network is fully-trained, the Predictor (P), which estimates the kernel map from a given LR image, is trained.
Finally, once the Predictor is fully trained, the SR network is fine-tuned, so that it becomes robust to incorrect kernel estimations.
At test time, the LR image is first given to the Predictor and it makes an estimation on the kernel map. This estimated kernel map is given along with the LR image to the SR network (G), which outputs the final super-resolved image (I SR ).

C. OBJECTIVE FUNCTIONS 1) SUPER-RESOLUTION NETWORK
In SR network (G) training, a non-uniformly degraded LR image (I LR ) and the corresponding ground-truth kernel map (K ) is given as the input. The network is trained to output a super-resolved image (I SR ) that is close to the ground-truth HR image (I GT ). Here, we use L 1 distance as the objective function.

2) PREDICTOR
The Predictor (P) estimates the kernel map from the given LR image. We use an indirect method to train the Predictor, via SR reconstruction loss. The output kernel map from the Predictor (K ) is given to the SR network (G) and the loss is computed from the output super-resolved image I SR K . The loss function can be formulated as the following: The simplest way to train the Predictor may be using the L p distance between the ground-truth kernel map (K ) and the predicted kernel map (K ) directly: This may be effective in a uniform degradation scenario, but we find it to be unsuitable in non-uniform degradation scenarios. By using Equation 4 for training, the Predictor can get penalized at featureless regions where kernel estimations are extremely challenging to be done, such as regions of a cloudless sky and pattern-less walls. These regions can be penalized greatly although the kernel estimations are irrelevant from SR quality, hindering the overall training process. Considering that our ultimate goal is to super-resolve a given LR image, wrong estimations of kernels are acceptable as long as the corresponding parts are super-resolved correctly. For uniformly degraded LR images, this is not an issue and Equation 4 could work well since the network can obtain clues for kernel predictions from regions other than featureless regions. However, in a non-uniform circumstance, information from other regions is meaningless, since the kernels may differ. Hence, this problem becomes more difficult. Thus, in order to prevent penalizing regions irrelevant from SR quality, we train the Predictor via reconstruction loss. Here, we can make two choices on setting the target image of the super-resolved image. For one, we can simply use the ground-truth HR image (I GT ) as the target image: Another option is to use the SR result from the SR network using the ground-truth kernel mapK , as Equation 3. We choose the latter option (Equation 3) to train the Predictor. The reason for this is because by using Equation 3, the Predictor can focus more on kernel map estimation. If I GT is set as the target, an erroneous loss irrelevant from estimations on the kernel map may flow to the Predictor. Specifically, the Predictor may get penalized even when it has made a correct estimation on the kernels, although the error is due to the limits of the main SR network G. (See Figure 3 for a visualized explanation.)

D. FINE-TUNING
Once the Predictor and the SR network are trained, the framework can be used immediately. However, using the Predictor and the SR network straightforwardly has a weakness. The predictions made by the Predictor may not be perfect; it can make some incorrect estimations at some pixels. On the other hand, the pre-trained SR network expects to get perfectly accurate kernel maps as the input along with the LR image. Consequently, it is prone to incorrect kernel estimations and may not perform fully well in practice. Therefore, the SR network needs to be fine-tuned so that it becomes robust to wrong estimations. With the weights of the Predictor fixed, the SR network is fine-tuned. In this process, the Predictor makes estimations on kernels and is given to the SR network along with the LR image, and the SR network is trained to recover the original GT image from it. Here, Equation 5 is used as the objective function for training the SR network.

A. IMPLEMENTATION DETAILS
For training, we use the HR images from DIV2K [27] and Flickr2K [28] dataset. We generate synthetic training image pairs via Kernel Collage. For simplicity, we consider isotropic and anisotropic Gaussian kernels in the experiments of this paper. The training data is augmented with random horizontal and vertical flips, along with random 90 • rotations. We use a patch size of 64 for LR images and a batch size of 16. Our SR network and Predictor are trained from an initial learning rate of 2e-4 and 1e-4 respectively and are both reduced by a factor of 0.1 after 2 × 10 5 iterations. We use Adam [29] for the optimizing process. The SR network and the Predictor are trained for 3 × 10 5 iterations and 2.5 × 10 5 iterations. The fine-tuning process takes 1 × 10 5 iterations. with a learning rate starting from 1e-4, which is halved after every 2.5 × 10 4 iterations.

1) NETWORK ARCHITECTURES
For our main super-resolution network (G), we have slightly modified SFTMD [4]. Using the original SFTMD model is not suitable for pixel-level kernel estimation, since it expects one kernel for an image. The γ and β values used in SFT-layers of the original SFTMD are single values, applied in a spatially uniform manner. Hence, we make the outputs of the SFT-layers be maps of γ and β. We use four convolutional layers for processing the input kernel map and two more convolutional layers for computing γ and β respectively. We use LeakyReLU [30] (α = 0.2) for the activation functions. The predictor we used is stacked with 8 convolutional layers. The first 6 layers each use 64 3 × 3 filters, while the last two use 1 × 1 filters. The seventh layer use 64 filters and the final (eighth) layer uses 3 filters. All activations are LeakyReLU (α = 0.2) except the final one, which is the Sigmoid operation.

B. EXPERIMENTS ON SYNTHETIC TEST IMAGES
Evaluation Datasets: We evaluate our approach (PerPix) both quantitatively and qualitatively. Experiments in this paper are done at ×4 scale unless mentioned, as most baseline methods provide pre-trained models and codes at this scale. Note that PerPix can be considered a deblurring network if the scale factor is set as ×1, as in DAN [21]. PerPix is evaluated on two settings, both uniform and non-uniform degradation scenarios. First, we compare our approach to other methods on DIV2KRK [2]. It consists of 100 LR-HR image pairs, where each LR image is uniformly downsampled with different kernels.
Furthermore, since there is no adequate benchmark dataset for non-uniform degradation scenarios, we have synthesized datasets using 3 types of non-uniform degradation for evaluation: 1) Grid 2) Smooth 3) Grid-segmented smooth. 1) Grid degradation setting is where an image is divided into patches in a grid form and each patch is degraded uniformly. 2) Smooth degradation setting is where the kernels vary at each pixel, smoothly changing along space. 3) Grid-segmented smooth degradation setting is where kernels vary smoothly along space but change abruptly at certain points. The smoothly varying kernel map is segmented in a grid form as in 1). We apply 16, 64, and 144 different kernels for VOLUME 9, 2021 TABLE 1. Evaluation on DIV2KRK (Uniform degradation scenario). The best and second best performances are indicated with red and blue. Although our model is trained for non-uniform degradation scenarios, it also performs successfully in uniform degradation scenarios, achieving state-of-the-art performance both quantitatively and qualitatively. It even outperforms non-blind algorithms given the ground-truth (GT) kernels and algorithms based on repetitive corrections on kernel estimations. PerPix also shows an outstanding image quality, by a noticeable difference. The quantitative results besides P + SFTMD, IKC and PerPix (ours) are from [21] and results of P + SFTMD and IKC are from the official implementation. each degradation method. These numbers refer to the number of kernels used in the kernel map generation process. For 1) Grid and 3) Grid-segmented smooth data, each image is divided into 16, 64, and 144 patches in a grid form and for 2) Smooth degradation, 16, 64, and 144 kernels are used as the base kernels to generate a smoothly interpolated kernel map. We used the validation set of DIV2K [28] to generate these data. Illustrations of these degradation methods are visualized in the Appendix.
In addition to these, we have also synthesized two more versions of non-uniform degradation datasets: a) Instancesegmented and b) Instance-segmented smooth. These are similar to Grid degradation and Grid-segmented smooth degradation; the difference is that the non-uniformity appears according to the instance segmentation masks. We used the image data and labels from the panoptic segmentation task of COCO [31] to synthesize the datasets for this setting.
Metrics: For quantitative evaluation, we use three metrics. We adopt the conventional PSNR and SSIM [32] metrics (higher the better), along with a recently proposed Learned Perceptual Image Patch Similarity (LPIPS) [33] (lower the better) metric. LPIPS is a metric for estimating the perceptual similarity between a pair of images. PSNR and SSIM are evaluated on the Y channel of YCbCr and LPIPS is evaluated on RGB channels.

1) EXPERIMENTS ON UNIFORM DEGRADATION
We first evaluate our method on DIV2KRK, which is a benchmark for blind SR. Since the LR images from this dataset are downsampled uniformly, with one kernel per TABLE 2. Evaluations on synthesized non-uniform degradation settings. Different kernels are applied in a grid form, smoothly varying manner and a smoothly varying but incoherent manner at the borders of the grid. Our model consistently shows the best performance in all metrics by a large margin compared to previous techniques. All results are evaluated with the official implementations.
image, it is a setting more optimal for image-level kernel estimation. We compare our method to methods of 4 types: Class 1) state-of-the-art SR models trained on bicubically downsampled images, Class 2) winners of the Blind-SR challenge of NTIRE'18 [24] and NTIRE'20 [34], Class 3) kernel estimation based blind SR algorithms, Class 4) Ground-truth kernel + non-blind SR methods.
Surprisingly, although our method is not designed for uniform degradation scenario, it outperforms methods designed for uniform degradation scenario and achieves state-of-theart performance in PSNR and SSIM. PerPix outperforms IKC and DAN which are based on repetitive corrections on kernels, even though our method estimates kernels at one shot. Moreover, PerPix even outperforms non-blind algorithms given the ground-truth (GT) kernels, achieving remarkable results both quantitatively and qualitatively. Results on DIV2KRK can be found at Table 1.

2) EXPERIMENTS ON NON-UNIFORM DEGRADATION
In addition to uniform degradation setting, we also conduct evaluations on non-uniform degradation scenarios. As mentioned previously, we experiment on 9 different non-uniform degradation conditions (Table 2) and 2 non-uniform degradation conditions based on instance segmentation (Table 3). Our approach consistently shows the best performance compared to the previous methods by a large margin, in terms of all three metrics. Not only does our approach perform well on uniform degradation settings, but also performs excellently in non-uniform degradation settings.

C. ABLATION STUDY 1) PATCH-WISE KERNEL ESTIMATION AND SR
In this section, we compare our approach (pixel-level kernel estimation) to state-of-the-art methods conducted in a patch-wise manner. We first compare our approach with a method designed for patch-wise kernel estimation and SR [20], for direct comparison. As there are not enough methods explicitly designed for patch-wise kernel estimation, we also conduct an experiment to compare kernel estimation at image-level, patch-level, and pixel-level. We divide the non-uniformly degraded images into patches, where each patch is uniformly degraded, but non-uniform when combined as a whole. The patches are given to the state-of-the-art methods which assume uniform degradation.

a: DIRECT COMPARISON WITH METHOD DESIGNED FOR PATCH-WISE KERNEL ESTIMATION
To our knowledge, [20] is the only method designed explicitly for patch-wise kernel estimation and SR. As the official implementation of [20] only provide pre-trained weights for ×2 scale, we experiment at ×2 scale. For evaluation, we adopt the DIV2KRK benchmark for uniform degradation scenario and the Grid degradation (64) of our synthesized dataset for non-uniform degradation scenario. Experiments on both scenarios show the superiority of our approach compared to [20] (Table 4).

b: PATCH-WISE EXPERIMENT WITH METHODS FOR UNIFORM DEGRADATION
We have also conducted a study comparing PerPix (pixellevel kernel estimation) to state-of-the-art kernel estimation VOLUME 9, 2021 FIGURE 4. Qualitative result on a real-world image from [35]. Our approach shows the best perceptual quality with minimum artifacts. Our method has a sharper image quality than results of RCAN and DAN and has less artifacts than IKC.

TABLE 4.
Evaluation on uniform degradation(DIV2KRK) and non-uniform degradation(Grid-64) with comparison to [20] at ×2 scale. Our approach shows improvement by a noticeable margin.

TABLE 5. Evaluation on Grid degradation setting (64).
Divide-and-conquer form of approach at patch-level turns out to be unsuccessful and rather shows a better performance when done at image-level.  and SR methods (IKC [4] and DAN [21]) conducted in a patch-wise manner. We report the performance on Grid degradation (64) of our synthetic non-uniform dataset. For the dividing process of images to patches, we divide the patches FIGURE 5. Qualitative results on real-world images from [7]. At the upper example, IKC over-sharpens the red region, while the green region is still blurry compared to PerPix. DAN shows a blurrier result compared to ours on real world images. Best on screen.

FIGURE 6.
Average MSE of kernels per image. We evaluate on Grid-Segmented (GS) smooth data (16,64,144). Our proposed indirect loss shows the least error of kernel estimations compared to the other two losses. Although the use of Eq.5 in Predictor training shows a fairly good performance in terms of SR (Table 6), it seems that it is actually not doing a good job in kernel estimation. On the other hand, the direct loss (Eq. 4) shows a less error than Eq.5, since it tends to estimate an average kernel to minimize errors on kernel estimation. along the borders of the grid -where the kernels change abruptly when synthesizing LR data.
Although the problem of super-resolving a non-uniformly degraded image has transformed into a problem of  (16) is visualized. A kernel map is divided into 16 different regions, and each region is uniformly filled with a single type of kernel. It can be seen from both magnified examples that the kernels within an identical region is filled with the same kernels, while changing abruptly, without coherency at the borders of the regions. Variations can be made depending on the number of regions to divide into -Grid degradation (64, 144). The regions are split to 8 × 8 and 12 × 12 respectively, in our experiments. In the experiment of Section 5.2.1, we have done an ablation study with comparison to patch-wise kernel estimation and SR. In order to conduct kernel estimation and SR in a patch-wise manner, we have divided the images along the borders of the grid (red line). This way, each patch is degraded with one kernel, which makes it a uniform degradation scenario. Thus, the problem of kernel estimation and SR on non-uniformly degraded LR transforms into a problem of multiple kernel estimation and SR on uniformly degraded LR images. Of all the non-uniform degradation scenarios mentioned in this work, this is the setting where comparison between patch-wise kernel estimation and pixel-level kernel estimation becomes fair the most. As reported in Section 5.2.1, although the non-uniform degradation scenario has transformed into a number of uniform degradation scenarios, which is the condition that models designed for uniform degradation scenarios should perform to its best, it rather performs worse than conducting kernel estimation and SR at image-level, estimating one kernel for an image. super-resolving several uniformly degraded images, patchwise kernel estimation and SR fails in practice, performing even worse than image-level kernel estimation and SR. The main reason for this performance drop is suspected to be due to the small size of patches. The dividing process makes the image smaller, making the prediction on kernels be made with much less information. This makes the process of kernel estimation and SR more difficult than the original state.

2) STUDY ON INDIRECT LOSS
To verify the effectiveness of our proposed indirect loss, we have compared it with Equation 4, 5. We perform two experiments to validate our approach. First, we conduct SR with a pre-trained SR network and the Predictors trained with each loss. The estimated kernel maps of the Predictor are FIGURE 11. Illustration of Instance-segmented degradation. Using the segmentation labels provided from the COCO dataset [31], each region labeled with the same segmentation masks are applied with the same kernels. The incoherency of kernels can be observed at the borders of each instances. Instance-segmented smooth degradation is somewhat similar to this, where the same regions (regions of the same segmentation masks) are applied with smoothly varying kernels; different from this example, where each instance is applied with one type of kernel. given to the pre-trained SR network and the SR results are evaluated. Note that we do not use the SR network fine-tuned after training the Predictor for fair comparison on pixel-level kernel estimation ability. We use the DIV2KRK dataset for this experiment and our proposed loss shows the best performance (Table 6).
Additionally, we conduct an experiment on computing the mean squared error (MSE) of the estimated kernels of each pixel. Since pixel-level GT kernel maps are necessary for this experiment, we use our synthesized non-uniformly degraded dataset. We report the results on Grid-Segmented (GS) smooth data (16,64,144). Our proposed loss shows the least error in kernel estimation, while the Predictor trained with Eq.5 shows the worst performance, although it showed an acceptable performance in experiments using SR. The Predictor trained with Eq. 4 shows a less error than that, which is because the Predictor simply tends to estimate an average kernel in order to minimize the loss. Based on these two experiments, the effectiveness of our proposed indirect loss can be confirmed.

3) STUDY ON THE EFFECT OF FINE-TUNING
Since the pre-trained SR network expects the given kernel maps to be accurate, it is prone to incorrect estimations at some pixels. Therefore, the SR network must get trained to be robust to such incorrect estimations. Fine-tuning shows an improvement in all metrics. (Table 7).  (16,64,144). At all three settings, PerPix performs stably, showing a consistently excellent image quality compared to other state-of-the-art models. Image-level kernel estimation based methods like IKC and DAN inevitably either over-sharpens some regions or blurs some regions in non-uniform degradation scenarios, while PerPix do not show such artifacts.

D. EXPERIMENT ON REAL IMAGES
Besides the experiments on synthetic test images, we also conduct an experiment on real-world images. We use images from [7] and [35] and compare it to state-of-the-art SR models. Qualitative results can be found at Figures 4 and 5.

E. INFERENCE SPEED
We compare the inference time with blind-SR models. IKC [4] and DAN [21] takes 2.81s and 0.52s, respectively, while PerPix takes only 0.09s per image in average. This difference in inference time is mainly due to the absence of the iterative loop in PerPix. The timings are measured with the DIV2KRK dataset using a single RTX Titan GPU.

F. PROCESSING TIME OF KERNEL COLLAGE
Uniform degradation applies identical degradation kernels throughout an image, enabling parallelization of the degradation process thanks to convolution operation using modern graphics processing units (GPUs). On the other hand, when it comes to non-uniform degradation, each region is degraded with different kernels, making it difficult to be parallelized. Kernel Collage is a non-uniform degradation method, applying different kernels at different regions. Therefore, one may  (16,64,144). At all three settings, PerPix performs stably, showing a consistently excellent image quality compared to other state-of-the-art models. Image-level kernel estimation based methods like IKC and DAN inevitably either over-sharpens some regions or blurs some regions in non-uniform degradation scenarios, while PerPix do not show such artifacts. think that Kernel Collage would take a vast amount of time for computation. However, Kernel Collage has merit in that the degradation process can be parallelized. As explained in Section IV-A1, identical kernels can be applied at neighboring regions, enabling parallelization and the use of GPUs for processing.
In specific, we have measured the processing time of Kernel Collage with comparison to the conventional uniform degradation. We used the DF2K dataset with a single RTX Titan GPU for this measurement. It took 0.0024 seconds per batch on average to process uniform degradation (1 degradation kernel per image), while kernel collage with 4 degradation kernels and 16 degradation kernels took 0.0030 seconds and 0.0052 seconds, respectively. We strongly argue that this difference in processing time is negligible, since our proposed model PerPix takes 0.0577 seconds per batch on average, occupying the majority of training time. Note that we found using 4 degradation kernels per image to perform best, which only increases the training time by less than 1%.

VI. CONCLUSION
We have shown that image-level kernel estimation and patch-level kernel estimation have limits in handling the VOLUME 9, 2021 FIGURE 15. Qualitative results on Grid-segmented smooth degradation (16,64,144). At all three settings, PerPix performs stably, showing a consistently excellent image quality compared to other state-of-the-art models. Image-level kernel estimation based methods like IKC and DAN inevitably either over-sharpens some regions or blurs some regions in non-uniform degradation scenarios, while PerPix do not show such artifacts.
non-uniformity of kernels and have demonstrated that pixel-level estimations on kernels can exceed these limits. We have also introduced PerPix, a SR framework based on pixel-level kernel estimation. To train this framework, we used Kernel Collage, a simple yet effective method to synthesize non-uniform degradation and the indirect loss to train pixel-level kernel estimation. Our approach outperforms previous blind-SR algorithms by a large margin quantitatively and qualitatively, on both uniform and non-uniform degradation datasets.

APPENDIX A NON-UNIFORMLY DEGRADED DATASETS
Since there is no adequate benchmark dataset for non-uniform degradation scenarios, we have synthesized 11 datasets to evaluate our approach.

APPENDIX B ADDITIONAL QUALITATIVE RESULTS
More qualitative results on our synthesized LR images and DIV2KRK can be found at Figures 13 ∼ 16.