Cross-Scale KNN Image Transformer for Image Restoration

Numerous image restoration approaches have been proposed based on attention mechanism, achieving superior performance to convolutional neural networks (CNNs) based counterparts. However, they do not leverage the attention model in a form fully suited to the image restoration tasks. In this paper, we propose an image restoration network with a novel attention mechanism, called cross-scale <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>-NN image Transformer (CS-KiT), that effectively considers several factors such as locality, non-locality, and cross-scale aggregation, which are essential to image restoration. To achieve locality and non-locality, the CS-KiT builds <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>-nearest neighbor relation of local patches and aggregates similar patches through local attention. To induce cross-scale aggregation, we ensure that each local patch embraces different scale information with scale-aware patch embedding (SPE) which predicts an input patch scale through a combination of multi-scale convolution branches. We show the effectiveness of the CS-KiT with experimental results, outperforming state-of-the-art restoration approaches on image denoising, deblurring, and deraining benchmarks.


I. INTRODUCTION
Image restoration, which has the intention of recovering a pure image from various types of degradations including noise, blur, rain, and compression artifacts, exerts a strong influence on the performance of downstream tasks such as image classification [1], [2], object detection [3], [4], segmentation [5], [6], and to name a few. As numerous solutions can exist for a single degraded input, image restoration is an ill-posed inverse problem.
Although numerous methods have been proposed for image restoration over the past few decades, several challenges have still remained due to various factors to consider in image restoration: locality, non-locality, and cross-scale aggregation. The locality (local textures or edges) is a key factor in restoring the degraded image in that neighboring The associate editor coordinating the review of this manuscript and approving it for publication was Bing Li . pixels are highly correlated. With the advances of convolutional neural networks (CNNs), recent restoration works [7], [8], [9] tried to establish a mapping relation between clean and degraded images by leveraging the representation power of the CNNs. The inductive bias of locality is well driven into the network by a virtue of the local operation in CNNs, but it inherently lacks the ability to capture long-range dependency, thus disregarding the knowledge of global image statistics. This limitation may be overcome by enlarging the receptive field of the convolution operation such as increasing network depth [10], dilated convolution [11], and hierarchical architecture [12]. Despite the large receptive field, it is still insufficient to model non-locality, considering that aggregating resemble patterns commonly come into existence within an entire image boosts the restoration performance significantly. In this context, non-local operation, which primarily contributed to traditional non-learning restoration methods [13], [14], has once more become a promising VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. Comparisons of different attention approaches: (a) Global attention [19], [20], [21] computes self-similarity between patches globally, (b) Local attention [22], [23] measures self-similarity within a single patch at the pixel-level, and (c) the proposed method aggregates similar k patches through local attention at the pixel-level.
solution as a result of the development of non-local neural networks [15]. By computing the response at a single position with a weighted sum of the features at all positions, they capture the long-range dependency within deep networks, but their capacity is limited by quadratic complexity with respect to input feature resolution. It is therefore only used in relatively low-resolution feature maps of particular layers [16], [17], [18]. More recently, self-attention mechanisms have arisen as a new trend in the field of computer vision, with Vision Transformer (ViT) [19] drawing attention by achieving a pleasant trade-off between accuracy and computational complexity in the image classification task. In ViT, the global attention mechanism, which can be viewed as a non-local operation, models self-similarity among non-overlapping split patches, as shown in Fig. 1 (a). Naturally, they also suffer from quadratic complexity with respect to the input feature resolution, which makes it nearly infeasible to apply the transformer to dense prediction tasks. To overcome this limitation, different from ViT has a columnar structure network in which the feature resolution remains unchanged, a hierarchical architecture was proposed by [20] and [21] to exploit multi-scale feature maps that are suitable for dense prediction tasks. Although they capture global self-similarity, they are far less capable of exploring the locality than CNNs, which is essential to image restoration.
In this context, numerous methods have been proposed to introduce the inductive bias of locality into transformer architectures [22], [23], [24], [25], [26]. Among them, local attention is considered in recent works [22], [23], [27], [28], [29] at the cost of restricting the receptive field in the transformer as shown in Fig. 1 (b). These approaches propose the local self-attention module, achieving a linear complexity to the input feature resolution. Since they constrain the self-attention computation only within a local patch, a shifting approach [22], [23], [29] is additionally applied to exchange information across non-overlapping patches. However, similar to CNNs, it considers only neighboring patches and thus still has an insufficient receptive field.
The aforementioned attention approaches take into account only patches of the same scale, but in various vision tasks, cross-scale aggregation is non-negligible due to the scales of similar objects (or patterns) that may often vary within an image. A number of studiesd [24], [30] have explored the cross-scale aggregation via interaction among concentric embeddings covering different receptive fields. However, since each patch typically has a single representative scale, considering all interactions between multi-scale embeddings is rather inefficient. This problem becomes more intensified in dense prediction tasks requiring more parameters and memory consumption.
To tackle these limitations, we propose a transformer-based image restoration network, cross-scale k-NN image Transformer (CS-KiT), where the locality, non-locality, and cross-scale aggregation are efficiently and comprehensively considered. First, to achieve cross-scale aggregation, we propose a scale-aware patch embedding (SPE) that estimates a representative scale of each patch through combinations of convolutions of different kernel sizes and forms mixed-scale patches. Consequently, the cross-scale aggregation can be implicitly accomplished by applying the proposed self-attention method to the patches with mixedscale, where different ranges of receptive fields are covered through the SPE. Second, to capture locality while explicitly establishing non-local connectivity, we propose a novel attention mechanism, called k-NN local attention (KLA), which takes into account the local attention of k-nearest neighbor (NN) patches. To compensate for the lack of the long-range dependency inherent in local attention, the proposed method considers k matched patches that generate non-local connectivity between patches of different positions.
To be specific, the KLA first groups a set of similar patches for each base patch with k-NN search, and then aggregates k matched patches through local attention in a pixel-level, as shown in Fig. 1 (c). This enables our method to apply local attention over an entire image while maintaining a linear complexity with respect to the feature resolution. Additionally, the inductive bias of locality contributes to the enhanced capability of local feature extraction. For efficiency, KLA leverages an approximated k-NN search via locality sensitive hashing (LSH) [31] which assigns the same hash value to similar patches with a high probability, and then aggregates similar patches with the same hash values. For an efficient batch-wise parallel computation, patches are sorted according to a hash value and split into chunks of size k so that similar patches are placed in the same chunk. However, in practice, the bucket size, indicating the number of patches (with the same hash value) belonging to the bucket, varies and is often not divisible with the chunk size k. Thus, each chunk may contain isolated patches with different hash values. To deal with this issue, we propose a chunk shift under the assumption that the relation of patches would be similar in an adjacent block. By shifting and sharing patch indexes of the current block in the successive block in each stage, as shown in Fig. 2, chunk shift allows an adjacent chunk to attend to a query patch, dealing with the isolated patches problem in an efficient way.
We validate the proposed method on various image restoration tasks on image denoising, deblurring, and deraining benchmarks, demonstrating superior performance to stateof-the-art restoration approaches. A preliminary version of this paper has appeared as a full paper in [32]. Compared with our previous work, we newly add (1) scale-aware patch embedding for cross-scale aggregation; (2) efficient handling of the isolated patches problem using chunk shift, and (3) extensive analysis of the proposed method.

II. RELATED WORK
Image restoration aims at recovering a clean image from an input image degraded by various factors such as noise, motion blur, rain streak, or low-light conditions. With the rapid progress of CNN, image restoration techniques using deep learning achieved evident performance improvement compared to traditional approaches thanks to their representation power. By introducing residual learning that models degradation as residuals image, [7], [33], [34] surpass the methods directly estimate clean image [35], [36]. With further developments, a dense connection between layers within the same block [37], [38] is employed to form a deeper network. As another steam different from regression network, adversarial learning [39], [40] Although these approaches have obtained enormous success in image restoration, convolution operation inherently suffers from a lack of non-locality essential in image restoration. Consequently, this fact led to the necessity of incorporating non-local into the network.

2) NON-LOCAL IMAGE RESTORATION
In image restoration, the non-local operation has been widely used. In classical approaches [14], [41], a set of pixels grouped by self-similarity contributes to an output filtered response. Recently, some efforts [16], [17], [18] tried to integrate the non-local operation into CNNs for image restoration tasks by establishing the long-range dependency with global self-attention following the success of non-local neural networks [15]. However, its heavy computational cost limits the spatial resolution of feature maps. Rather than employing full connections within the input feature map, sparse connections were adopted in [42], [43], [44], [45], [46], and [47] to cut down on computational costs. N 3 Net [42] and GCDN [43] find k-nearest neighbors that are close in the embedding space in a learnable manner, and aggregate them for efficient computation. According on the content of the images, DAGL [44] dynamically selects the number of neighbors for each query. IGNN [45] and CPNet [46] find k-NN patches among cross-scale feature maps by considering both sparseness and cross-scale patch recurrency. Nevertheless, the quadratic complexity for k-NN search of the aforementioned methods significantly slows down the overall procedure. NLSN [47] reduces the complexity of k-NN search process to be asymptotic linear by performing non-local sparse attention with locality sensitive hashing (LSH). Local information, however, cannot be captured in their attention module because the NLSN [47] approximated the full connection of the global attention at the pixel level.

3) MULTI-SCALE AGGREGATION FOR IMAGE RESTORATION
Multi-scale information is regarded as an indispensable property in image restoration tasks as it is beneficial to encompass various scales of objects, patterns, or degradations. As a naive way to deal with multi-scale information, exploiting feature maps of multiple scales by building hierarchical architecture [48], [49], [50] or multi-scale parallel branches [43], [51] has been explored. In another way, a progressive prediction [52], [53], [54] that gradually scales up the model capacity and difficulty of the problem (e.g. low resolution to high resolution) significantly improved the image restoration performance. However, these attempts have a certain limitation in dealing with cross-scale patch recurrence as they only build intra-scale relationships within single attention. To build cross-scale interactions, emerging approaches [45], [55], [56] focus on the interchange of mutual information across different scales. CSNLN [55] fuses multi-branch projections of cross-scale non-local attention, in-scale non-local attention, and identity mapping. TTSR [56] proposes a cross-scale feature integration module to transfer reference HR textures to the low-resolution image. IGNN [45] aggregates k-NN highresolution counterparts corresponding to the low-resolution patch.

B. ATTENTION MECHANISM
Inspired by the human vision system, attention mechanisms have been introduced to allow the network to focus on saliency and benefit various tasks such as recognition [57], [58], [59], [60], object detection [61], [62], and semantic segmentation [63], [64]. In CNN-era, RAM [65], a pioneer of visual attention, exploits visual attention on the recurrent neural network to classify images. Afterward, STN [66] proposes a spatial transformer network that predicts an affine transformation to select the most relevant regions. SENet [57] proposes a squeeze-and-excitation block that squeezes the spatial resolution and captures the channel-wise relationship. CBAM [58] exploits both the spatial relationship between the feature map of the pooled channel and the channel-wise relationship in the spatially reduced feature map. Non-local neural networks [15] combine non-local property and spatial self-attention. By capturing long-range dependencies, they show superior performance on various visual tasks including video classification. As a non-local neural network has quadratic complexity with respect to an input resolution, CCNet [63] proposes axial-wise attention (horizontal and vertical directions) to achieve a less computational cost of non-local attention. From the finding that the self-similarities VOLUME 11, 2023 modeled by non-local attention are almost the same for different query positions, GCNet proposes a more simplified non-local operation combined with SENet [57].
Recently, in [19], the Transformer architecture [67], originally proposed for natural language processing, was applied to the image classification task. This method, known as Vision Transformer (ViT), excels at capturing long-range dependencies by applying global attention to image patches, but it is not appropriate for dense predictions due to the quadratic complexity with respect to an input spatial resolution. References [20], [21], [68], [69] adopted the hierarchical architecture, where feature resolutions are gradually reduced for enabling the dense prediction, in contrast to ViT, which maintains a fixed spatial resolution across the whole architecture. PVT [20] constructs pyramid feature maps with the spatial reduction attention (SPA) layer. In order to recover fine-grained predictions, IPT [68] and DPT [21] propose an encoder-decoder architecture. However, these approaches based on global attention lack the ability to look into the locality, which is essential for image restoration.
Lately, Swin Transformer [22] has boosted object detection and segmentation performance with low complexity by leveraging local attention and a shifting strategy for patch connection. As the attention weights are produced between neighbor elements by the local attention, computations have linear complexity according to the spatial resolution and the inductive bias of locality is incorporated into the attention. Uformer [23] and SwinIR [29] clearly show remarkable performance in image restoration tasks by utilizing local attention. However, by taking neighboring patches into account, the shifting approach has a limited receptive field, thereby losing the non-local connectivity in the process. The proposed method, in contrast, makes non-local connectivity by carrying out local attention using k-NN patches. This enables us to capture locality in the attention module and impose the non-local connectivity with a linear complexity with respect to spatial resolution.

III. PROBLEM STATEMENT
It is well known that non-local self-similarity performs well in the task of image restoration. This necessitates the capability to capture a long-range dependency since analogous patterns are dispersed across the whole image. The ViT [19] applies the attention mechanism of an original Transformer [67] directly to image patch sequences. For a given input X ∈ R HW ×C in , they split it into non-overlapping patches, and reshape into a sequence of flattened 2D patches X p ∈ R N ×r 2 C in , where HW is the spatial resolution of the input feature map, C in is the channel of the input feature map, N = HW /r 2 , and r is the patch size. The global attention with dot-product between split patches is represented as: The learnable projection functions φ, θ : R N ×r 2 C in → R N ×r 2 C , and ψ : R N ×r 2 C in → R N ×r 2 C out project X p into the query, key, and value, respectively. The output O ∈ R N ×C out , where C out is an output channel size, is obtained as a weighted sum of the projected values using the affinity matrix computed between the projected query and key.
As C, C in and C out are usually set the same, we denote them as C. Although the global attention mechanism establishes the long-range dependency well, the quadratic complexity to the input feature resolution, O(r 2 N 2 C), makes it hard to take advantage of global attention for dense prediction tasks.
The local attention mechanism [22], [23], [27], [28], [29] reduces the complexity by computing attention within a local patch. An input feature map X is split into non-overlapping patches, satisfying where o i is an output patch corresponding to x i . Note that the learnable projection functions φ, θ and ψ project r 2 elements with a size of C, unlike ViT projecting N elements with a size of r 2 C, and are shared for all patches. The local attention achieves the linear complexity O(r 4 NC) to the input feature resolution. However, as (2) is applied to each patch independently, no information is exchanged between patches. In order to enforce patch connectivity between neighbor patches with an enlarged receptive field, a shifting approach [22], [23], [29] is sequentially applied. Also, both attention approaches overlook the scales of objects or patterns that may vary in real images, considering the interactions among features of a single scale only. Recently, some works considered cross-scale attention by introducing multi-scale information in feature embedding [30], linear projection [24], [70], or multi-path structure [71]. However, as an object or a pattern usually has a single representative scale, figuring correlations between multiple scales out calls for superfluous computations. Due to the computational burden coming from reconstructing high-resolution output in dense prediction tasks, a more efficient and effective attention layer design is required.
We overcome those limitations by leveraging k-NN patches of mixed-scale in the computation of the local attention. A novel non-local image restoration method, called cross-scale k-NN image transformer (CS-KiT), successfully equips locality, non-locality, and cross-scale aggregation.  input feature map X into non-overlapping patches with the patch size r, satisfying X = {x i ∈ R r 2 ×C | i = 1, . . . , N }. In the CS-KTB, scale-aware patch embedding (SPE) is first conducted to introduce cross-scale aggregation into attention. SPE projects each patch into scale-specific spaces through convolutions of different kernel sizes, separately. We then estimate the scale score α with a learnable scale prediction function in each scale-specific branch. A representative scale of the patch is then defined as the weighted sum of scale-specific patches leveraging α as weights. The mixed-scale patches projected on different representative scales are normalized and fed into k-NN local attention (KLA).

IV. PROPOESD METHOD A. OVERALL PIPELINE
KLA first clusters similar patches with locality sensitive hashing (LSH) [31] which is an approximated k-NN search algorithm. Assigned hash values by LSH stand for similarity; highly correlated patches have the same hash value with a high probability. The hash buckets, consisting of patches with the same hash value, mostly have non-uniform distribution, making it hard to aggregate patches within hash buckets in parallel. Therefore, we sort patches according to hash values and partition them into chunk sizes of k. Accordingly, each chunk may contain isolated patches that have a different hash value from most patches in a single chunk, degenerating non-local connectivity with its k-nearest neighbors. To deal with this problem, KiT [32] allows the previous chunks to contribute to the current chunk containing the query patch, but enlarging the attending chunk for attention causes an extra computational burden. As a more efficient way, we propose a chunk shift with the assumption that the k-NN relation is similar in an adjacent block. In chunk shift, the patch indexes of the current block are shared in the successive block and shifted so that isolated patches can be moved to the next chunk. By allowing the previous chunk to attend to the current chunk via chunk shift, the isolated patch issue is effectively resolved with no extra computation. Subsequently, KLA treats each chunk as a grouped local patch by rearranging and conducts local attention on each chunk. Note that, since each chunk includes patches of different positions and different scales, performing simple local attention on it achieves non-local aggregation while maintaining its lightweight computational complexity. At the end of CS-KTB, aggregated features are resized with an interpolation layer to build hierarchical architecture. When the input feature map passes through all stages of the network, the last three convolutions are conducted to predict residuals between the clean image and the degraded image.
The proposed network has a hierarchical U-shaped design for considering patterns at various scales. The aggregated features are processed through a layer of interpolation (downsampling for the encoder and upsampling for the decoder). For the purpose of restoring fine details, input feature maps and corresponding encoder features are concatenated in each stage of the decoder. Three convolutions are performed at the end of the network to predict a restored image from the output feature map.

B. CS-KTB: CROSS-SCALE k-NN TRANSFORMER BLOCK
In l-th cross-scale k-NN Transformer Block (CS-KTB) (l = 1, . . . , b), scale-aware patch embedding (SPE) is added in the front. After SPE projects patches to mixed-scale space, layer normalization is applied, followed by KLA. The intermediate feature X aggregated by KLA is passed through the rest steps, applying layer normalization and feed-forward network. Formally, CS-KTB is represented as: where FFN(X ) = MLP(DW(MLP(X ))). In l-th block of each stage (l = 1, . . . , b), the output feature map X l−1 from the previous block is normalized and fed into KLA. The intermediate featureX l is computed via a non-local aggregation of k similar patch features and residual connection. The bottleneck stage is the same as the CS-KTB except that no interpolation layer is employed and k is set to 1.

1) SPE: SCALE-AWARE PATCH EMBEDDING
The goal of scale-aware patch embedding (SPE) is to generate patches of mixed-scale as each patch embraces different scale information. Fig. 3 describes a detailed flow of scale-aware patch embedding. To build mixed-scale patches, we conduct depth-wise convolution operations with different kernel sizes S = {s 1 , . . . , s n s }, where n s is the number of scale spaces that equals to the number of kernels. Embedded scale-specific patch is defined as {x i,j | i = 1, . . . , N and j = 1, . . . , n s }, which are responses of different scales S. For cross-scale aggregation, naively aggregating all patches of various scales is computationally heavy. In a more efficient way, we assume that a single patch has a representative scale, similar to image descriptors such as SIFT [72]. With pre-defined discrete scales, we generate a mixed-scale patch with a weighted combination. Specifically, we define a learnable scale prediction function g: R r 2 ×C→R 1 which estimates the scale score of an input patch. The estimated scale scores from g(·) are used as weights for merging scale-specific patches: where δ is the softmax operation for normalizing each weight. Finally, scale-aware patch embedding X S = {x S 1 , . . . , x S N } can be obtained by 2) k-NEAREST NEIGHBOR SEARCH To figure k-NN patches out, a brute-force approach computes the pair-wise distances between all patches. As this pairwise distance involves the quadratic complexity to an input length, extensive works on k-NN search have been developed to reduce its cost [31], [73], [74]. We adopt locality sensitive hashing (LSH) [31] for k-NN search due to its linear computational complexity. LSH is an approximated k-NN search algorithm that hashes similar elements into the same buckets by using random rotation matrices. Here, the number of buckets is much smaller than the whole number of elements to ensure each bucket contains multiple elements.
In CS-KTB, in order to build buckets, the LSH projects divided patches into a unit hyper-sphere. Assuming there are m hash buckets, a hash value L(x S ) is assigned by multiplying random rotation matrix R ∈ R N ×m/2 to a mixed-scale patch x S as: where [·; ·] indicates the concatenation of two elements. With this hashing operation, patches with high correlation are very likely to receive the same hash value (in the same hash bucket), and vice versa. But, similar patches may fall in different hash buckets in times as LSH relies on a random rotation matrix. Multi-round LSH, in which LSH is applied with multiple different rotation matrices h times, is employed to cope with this problem.

C. KLA: k-NN LOCAL ATTENTION
k-NN local attention (KLA) aims at achieving locality, non-locality, and cross-scale aggregation at the same time.
As shwon in Fig. 4, KLA first rearranges patches so that similar patches are located around each other. As patches embrace mixed-scale information through SPE, applying local attention to the rearranged mixed-scale patches implicitly induces cross-scale aggregation. In addition, we partition patch sequences with a chunk size of k for efficient parallel computation in aggregation. However, as hash buckets have non-uniform distribution in practice, the patches with the same hash value may fall into adjacent buckets, resulting in isolated patches as shown in Fig. 5. To handle the problem of isolated patches, we propose a chunk shift that reuses sorted patch indexes of the current block in the successive block and shifts them.

1) LOCAL ATTENTION WITH SPATIAL REARRANGE
To restrict the contribution of patches to a query patch to those with the same hash value, we first sort patches based on hash values, and then divide the sorted patches into chunks each involving k patches (equivalent to the number of NN patches) for batching purpose, so that only patches in the same chunk are considered in the local attention. We denote π : n − → n be a permutation that sorts the patches in ascending order of hash values: For the sake of a simplicity, we definex S as a sorted mixed-scale patch wherex S p is equal to x S π(p) . Then, i-th chunk P i for i = 1, . . . , N /k contains k patches, KLA performs local attention by regarding each chunk as a grouped local patch. In each chunk, there are k patches with size R k×r 2 ×C . We make each chunk to 2-dimensional patcĥ P i = (P i ), where : R k×r 2 ×C → R kr 2 ×C is the spatial rearrange function.
As rearranged patchesP are mixed-scale and spatially grouped, aggregating cross-scale and cross-position patches can be done with simple local attention: where o i ∈ R kr 2 ×C represents the i-th output patch of KLA, and φ, θ, and ψ : R kr 2 ×C → R kr 2 ×C are learnable projection functions.

2) CHUNK SHIFT
Chunk patch sequences with a size of k enable an efficient batch-wise parallel computation. However, as the number of patches in a hash bucket is often indivisible by chunk size in practice, the patches with the same hash value may fall into nearby chunks, incurring isolated patches as shown in Fig. 5. The isolated patches that have a different hash value from major patches in the chunk may weaken the non-local connectivity of the KLA. We propose a chunk shift, an efficient FIGURE 5. The proposed chunk shift: as the distribution of hash values is non-uniform, the number of patches in each bucket estimated by LSH is often different from a chunk size k. In l -th block, sorted index l may contain isolated patches that have different hash values from patches in the chunk. By shifting chunks and reusing them in the next (l + 1)-th block, we can tackle this problem efficiently with no computation increase. A number in each patch represents a hash value assigned.
way to deal with the isolated patch problem with no increase in computation. We denote l as indexes of patches sorted by hash values in a l-th block. Shifted indexes l shift are defined by cycling shift l with a size of k/2, We form consecutive CS-KTBs such that shifted indexes are utilized in the successive block ( l+1 = l shift ) under the assumption that the feature distribution of adjacent blocks would be similar. As shown in Fig. 2 (c), k-NN local attention (KLA) and KLA with shifted chunk are alternately conducted in the network. This configuration enables us to deal with the isolated patches problem and save computation of k-NN search.

D. TRAINING LOSS
Following existing image restoration approaches [7], [17], the proposed network also predicts a residual image I r from the degraded input image I d . The objective is to recover clean image I satisfying I = I d + I r . We leverage Charbonnier loss [75] L char and an edge loss L edge for optimizing the network, L char = ∥I − (I d + I r )∥ 2 + ϵ 2 , VOLUME 11, 2023  where ϵ is empirically set to 10 −3 for all experiments and △ represents the Laplacian function. The total loss L is defined with L char and L edge , where a hyper-parameter λ controls the ratio of the two losses.

E. EXPERIMENTS
We used the AdamW optimizer to train the entire network using batches of 64 images cropped to 128 × 128 for 800k iterations. The learning rate was initially set at 2 × 10 −4 , and until 10 −6 , the linear warm-up strategy and cosine annealing were applied to decrease the learning rate. The chunk size k (equal to the number of NN patches) and patch size r were set to 4 by default. In the bottleneck stage, k was set to 1 since there are only a few patches (e.g. the number of patches is 8 × 8 when HW is 256 × 256). The number of scale spaces n s was set to 3 and the kernel size of each depth-wise convolution was 3, 5, and 7, respectively. The number of CS-KTB in each stage b was set to 2 in all stages. The number of hashes h was set to 4 for multi-round LSH. We validated the performance of the proposed method on various image restoration tasks such as image denoising, deblurring and deraining. For the performance evaluation, the PSNR and SSIM were measured on the RGB space for denoising and deblurring. In deraining, the evaluation was done on the Y channel of the YCbCr color space, following previous works [52], [78].

1) IMAGE DENOISING
We trained the CS-KiT with the SIDD [76] dataset consisting of 320 high-resolution real noisy images. shows the quantitative results of image denoising methods on the SIDD [76] and DND [79] dataset. The evaluation results include the classical denoisng method [14], CNNbased methods [7], [50], [51], [52], [77], [80], [81], [82], [85], [89], self-attention based methods [44], transformer-based methods [23], and our previous work [32]. As DND [79] dataset does not provide ground-truth labels, the results were obtained from the official benchmark. CS-KiT performs favorably against existing image denoising approaches. Specifically, compared to the previous winning method, Uformer, CS-KiT gets 0.1 dB gains in the SIDD dataset and 0.03dB in the DND dataset. Fig. 6 shows a visual comparison of the proposed method with previous algorithms. While most of the other methods failed to restore the exact number from noise, the results of CS-KiT correctly recovered the shape of the number, proving effectiveness in noise removal.

2) IMAGE DEBLURRING
We compared the state-of-the-art methods in image deblurring on GoPro [49] and HIDE [90]  This pattern also appears in visual comparison as shown in Fig. 7.
Although the results of Uformer [23] and KiT restored the shape of objects, they are still blurry and missing highfrequency details, whereas CS-KiT restores more sharp textures than other competing methods.

F. ABLATION STUDY
We conducted the ablation study to analyze the effectiveness of our method in various aspects. All experiments were conducted on SIDD [76] for the image denoising task.

1) VISUALIZATION OF THE k-NN PATCHES
Our method aims to preserve fine details while achieving non-local connectivity efficiently, achieved by aggregating patches of different scales with similar characteristics.
To visually validate this, we further visualize the patches belonging to the same chunk in Fig. 9. The leftmost images are divided into non-overlapping patches, where the patches marked with color boxes represent query patches for visualization. Similar k patches are clustered with a chunk in the right figures as the KLA utilizes LSH for k-NN search. The same color boxes serve to denote the k patches belonging to the same chunk. The patches with blue boxes have nontextured regions, while the patches with red and green boxes have similar patterns. This shows that the LSH effectively finds visually analogous patches. We also provide a visualization of learned attention produced by dot products between the center pixel of the query patch and k-NN patches.

2) THE NUMBER OF k AND h
In CS-KiT, we also validated two hyper-parameters: the chunk size k and the number of hash rounds h. k determines the maximum number of patches used for performing the local attention and h is used to reduce the probability that similar patches fall into different hash buckets. These two scalable hyper-parameters make the tradeoff between the computational complexity and the network capacity. Table 4 shows the denoising performance of the proposed method according to the two hyper-parameters. Similar to KiT, the best performance was achieved when the two hyper-parameters are set to 16, but, we set k and h to 4 as it has comparable performance with relatively low computation.

3) COMPUTATIONAL COST
We provide the performance comparison with state-of-theart image restoration methods with respect to accuracy and computational cost. Fig. 10 depicts the graphs that illustrate the performance and computational cost of state-of-the-art methods. Other approaches are denoted by a circle symbol in green, our previous work is denoted by a circle symbol in red, and the proposed CS-KiT is denoted by a triangle symbol in red. The x-axis and y-axis of the graphs, respectively, indicate the performance evaluated by the PSNR and computational cost measured by Multiply-Accumulates (MACs). The MACs of all graphs are measured when an input resolution is 256 × 256. The proposed method outperforms Uformer [23] despite having a competitive computational cost in the image denoising on the SIDD dataset [76]. In the image deraining and deblurring, the KiT shows a slightly better performance yet with much less computational cost and the CS-KiT further improves performance.

4) CHUNK SHIFT
When the number of patches is indivisible with the size of chunks, each chunk may contain isolated patches having a different hash value from other patches of the same chunk as shown in Fig. 5. We deal with this problem by 1) sharing patch indexes in the successive block and 2) shifting shared patch indexes to bridge a connection between adjacent chunks. Table 5 shows the denoising results on the SSID dataset according to chunk shifting and sharing. When chunk indexes are only shared in the successive block and not shifted, no performance drop was aroused. It implies that k-NN relations of adjacent chunks are almost similar to each other. In addition, applying both shifting and sharing chunk indexes in the successive block resulted in a slight increase in performance, while the computational cost was slightly reduced due to the omission of the LSH in the successive block. By leveraging the chunk shift to successive block, connectivity between similar patches belonging to different buckets was established without extra computations. VOLUME 11, 2023

5) SCALE PREDICTION
Scale-aware patch embedding assumes that each patch has a representative scale in continuous space. Thus, softmax was adopted to estimate continuous scale when merging scalespecific scores. We compared the performance between soft scale estimation (softmax) and hard scale estimation (Gumbel softmax [103]) in Table 6. Soft prediction achieves better performance than hard prediction, which supports our assumption that the representative scale estimator should yield a continuous value. Moreover, even the cross-scale aggregation with the Gumbel softmax surpasses our previous work [32] which aggregates the patches of the same scale only, implying that the cross-scale aggregation is essential in image restoration.

V. CONCLUSION
We presented a transformer-based image restoration network, a cross-scale k-NN image transformer (CS-KiT), that meets essential conditions: locality, non-locality, and cross-scale aggregation through a novel attention mechanism, k-NN local attention (KLA). The core idea of KLA is to group similar patches in the whole image and conduct local attention to spatially grouped patches. To handle the quadratic computational complexity of brute-force k-NN search, we adopt locality sensitive hashing (LSH) which is approximated linear k-NN method. In addition, scale-aware patch embedding projects each patch to different scales to form mixed-scale patches. By feeding mixed-scale patches into a transformer block, cross-scale aggregation is carried out while conducting self-attention. Chunk shift handles the problem of isolated patches that occur when the patch sequence is indivisible by the size of the chunks. By sharing and shifting patch indexes in the successive block, the KLA enhances non-locality while saving k-NN computations. We demonstrated that the proposed CS-KiT achieved superior performance to the state-of-the-art methods on various image restoration benchmarks, in terms of quantitative/qualitative performance.

A. FUTURE WORKS
Due to the lack of inductive bias, transformer-based approaches typically require more training data than CNN counterparts. In visual recognition tasks, a self-supervised pre-training with a large-scale dataset (ImageNet) shows significant improvements compared to training from scratch.
As not many datasets exist in image restoration, a few works of pre-training strategy have been investigated to solve the data-hungry issue. This has been partly addressed by pre-training the networks with synthetic degradation such as Gaussian noise or rain streak, but two limitations still remain unresolved. First, The domain gap between synthetic and real degradation makes it less effective when transferring the pre-trained network to the downstream task. Second, most works pre-train separate networks for different tasks. On account of this complex pre-training stage, the pre-training step has not been widely used in image restoration. In future work, we will continue to investigate the pre-training strategy leveraging real-world degradation datasets and the unified image restoration model across different degradation factors.