Lightweight Single Image Super-Resolution via Efficient Mixture of Transformers and Convolutional Networks

In this paper, we propose a Local Global Union Network (LGUN), which effectively combines the strengths of Transformers and Convolutional Networks to develop a lightweight and high-performance network suitable for Single Image Super-Resolution (SISR). Specifically, we make use of the advantages of Transformers to provide input-adaptation weighting and global context interaction. We also make use of the advantages of Convolutional Networks to include spatial inductive biases and local connectivity. In the shallow layer, the local spatial information is encoded by Multi-order Local Hierarchical Attention (MLHA). In the deeper layer, we utilize Dynamic Global Sparse Attention (DGSA), which is based on the Multi-stage Token Selection (MTS) strategy to model global context dependencies. Moreover, we also conduct extensive experiments on both natural and satellite datasets, acquired through optical and satellite sensors, respectively, demonstrating that LGUN outperforms existing methods.


Introduction
Single Image Super-Resolution (SISR) is a prominent research field in computer vision that focuses on enhancing the visual details and overall appearance of low-resolution (LR) images by generating high-resolution (HR) versions.It has diverse applications across domains such as surveillance [1][2][3][4], medical imaging [5,6], satellite imagery [7,8], and monitoring [9,10].Recent advancements in SISR techniques have leveraged advanced algorithms and deep learning models to effectively recover missing high-frequency details and textures from LR inputs, enabling significant improvements in resolution and visual quality.
Convolutional Networks are widely adopted for various visual tasks, including SISR [11,12].The inherent properties of convolutional operations, such as the ability to aggregate information from adjacent pixels or regions, e.g., 3 × 3 windows, make them effective at capturing spatially local patterns.These properties, including translation invariance, local connectivity, and the sliding-window strategy, provide valuable inductive biases.However, Convolutional Networks suffer from two main limitations.Firstly, they have a local receptive field, restricting their ability to model global context.Secondly, the interaction between spatial locations is fixed through a static convolutional kernel during inference, limiting their flexibility to adapt to varying input content.Transformers, on the other hand, offer a solution to address these limitations.By introducing self-attention (SA) in Vision Transformers (ViTs), global interactions can be explicitly modeled, and the importance of each token can be dynamically adjusted through attention scores computed between all pairs of tokens during inference.However, the computational complexity of Transformers, which grows quadratically with the token length N (or spatial resolution HW), poses challenges for real-world applications on resource-constrained hardware.This leads to the following natural question: How can we effectively combine the strengths of Convolutional Networks and ViTs to develop a lightweight and high-performance network suitable for resource-constrained devices?
In this work, we address the aforementioned question by focusing on the design of a lightweight and high-performance network for SISR tasks.The performance of our work is shown in Figure 1 compared with others.Our proposed approach, named LGUN, leverages the advantages of Convolutional Networks, such as spatial inductive biases and local connectivity, as well as Transformers, which offer input-adaptation weighting and global context interaction.Therefore, our core concept is illustrated in Figure 2. Compared to uni-dimensional information communication, e.g., spatial-only communication such as EIMN [13] or channel-only communication such as Restormer [14], our method can achieve local spatial-wise aggregation and global channel-wise interaction simultaneously, both of which are crucial for SISR tasks.As commonly known, in Convolutional Networks, the shallow layers of a network employ convolutional filters with smaller receptive fields, capturing local patterns and features like edges, corners, and textures.These low-level features are extracted in the initial layers, providing local information about the input data.By stacking multiple building blocks, Convolutional Networks gradually enlarge their receptive fields, enabling the capture of large-range spatial context information.Based on this prior knowledge, as shown in Figure 3, we divide the core modules, named Local Global Union (LGU), into two stages: Multi-order Local Hierarchical Aggregation (MLHA) and Dynamic Global Sparse Attention (DGSA).In the shallow layers, we employ MLHA to encode local spatial information efficiently.This approach feeds each sub-branch with only a subset of the entire feature, facilitating the explicit learning of distinct feature patterns through the Split-Transform-Fusion (STF) strategy.In the deep layers, we introduce DGSA to model long-range non-local dependencies while obtaining an effective receptive field of H × W. DGSA operates across the feature dimension, utilizing interactions based on the cross-covariance matrix between keys and queries.Considering the potential negative impact of irrelevant or confusing information in the attention matrix, which other methods [14] fail to consider, we incorporate the Multi-stage Token Selection (MTS) strategy into DGSA, which selects multiple top-k similar attention matrices and masks out insignificant elements allocated with lower weights.This reduces redundancy in attention maps and suppresses interference from cluttered backgrounds.The proposed design is robust to changes in the input token length and decreases the computational complexity to O(NC 2 ), where C ≪ N.
Our contributions can be summarized as follows: (1) We propose LGUN, a hybridization structure designed for resource-constrained devices.It combines the strengths of Convolutional Networks and ViTs, allowing for effective encoding of both local processing and global interaction throughout the network by the proposed LGU.(2) In the shallow layer, we employ MLHA to focus on encoding local spatial information.
By using the STF strategy, MLHA promotes the learning of different patterns while also saving computational resources.In the deep layer, we utilize DGSA based on the MTS strategy to model global context dependencies.This enhances the network's ability to model complex image patterns with high adaptability and representational power.(3) Experimental results on popular benchmark datasets demonstrate the superiority of our method compared to other recently advanced Transformer-based approaches.Our method outperforms in both quantitative and qualitative evaluations, providing evidence for the effectiveness of the MLHA-with-STF strategy and the DGSA-with-MTS strategy.LGUN exhibits robustness to changes in the input token length and significantly reduces the computational complexity to O(NC 2 ), where C ≪ N.

Convolutional Networks
Classical SISR.Since the introduction of SRCNN [15], Convolutional Networks have emerged as superior solutions for SISR tasks [16].Over the past decade, numerous novel ideas have been proposed or introduced in this field.These include residual learning [11], densely connected networks [17], neural architecture search (NAS) [18], knowledge distillation [19], channel attention [20], spatial attention [21], non-local attention [22], SA [23], etc.The general trend towards achieving higher performance in SISR is to design deeper and more complex networks.However, these methods often come at the cost of increased computational requirements, making it challenging to deploy them on resource-constrained mobile devices for practical applications.
Efficient SISR.To make Convolutional Networks suitable for computationally limited platforms such as mobile devices, methods such as pruning, NAS, knowledge distillation, reparameterization, and efficient design of convolutional layers have been proposed.Pruning technology involves removing insignificant connections or neurons from a network to reduce its size and complexity, thereby improving generalization ability and computational speed.NAS technology [24], on the other hand, automates the search for the optimal neural structure by exploring various combinations of structures across different platforms with varying computational capabilities.Knowledge distillation technology [19], a method for training smaller models, transfers knowledge from larger, more complex models to enhance performance while reducing computational requirements.Structural reparameterization [25] technology utilizes a multi-branch architecture during training and switches to a plain network during testing to achieve faster inference speed.Efficient convolutional layers, such as depth-wise convolution [26] and convolutional factorization [27], reduce computational resources while maintaining high performance.These design concepts have significantly contributed to the advancement of SISR.However, many existing methods either focus on local spatial information and lack global context understanding, or have high computational complexity that limits their applicability to edge devices.In this work, we propose a hybrid structure called LGUN that combines the strengths of Convolutional Networks (e.g., spatial inductive biases and local connectivity) and Transformers (e.g., input-adaptive weighting and global context processing).Notably, our approach achieves a superior trade-off between complexity and performance (Parameters/Multi-Adds @ PSNR/SSIM: 675K/141G @ 38.24/0.9618).

Transformers
Pioneer work.Recently, Transformers have attracted significant interest in the computer vision community, thanks to their success in natural language processing (NLP) field.Several studies have explored the benefits of using a Transformer in vision tasks, e.g., FAT [28] and RISTRA [29].The seminal work, Vision Transformer (ViT) [30], applies a standard Transformer architecture directly to 2D images for visual recognition and demonstrated promising results.The Image Processing Transformer (IPT) [23] leverages the power of the Transformer to achieve superior performance on various image restoration tasks, such as SR, denoising, and deraining.However, the quadratic computational cost make it difficult to apply the SA mechanism to the SISR task.
Efficient Transformers.Numerous efforts have been made to reduce complexity and maintain performance in order to make Transformers more suitable for vision tasks.For instance, Swin Transformer [31] and SwinIR [32] limit the SA calculation to non-overlapping local windows instead of the global scope and introduce a shift operation for cross-window interaction.This approach significantly reduces computational complexity on HR feature maps while capturing local context.Similarly, shuffle Transformer [33] and HaloNet [34] utilize spatial shuffle and halo operations, respectively, instead of shifted window partitioning.MobileViT [35] employs element-wise operations as replacements for computationally and memory-intensive operations, such as batch-wise matrix multiplication and softmax, to compute context scores.Linformer [36] substitutes self-attention with low-rank approx-imation operations.Axial self-attention [37] achieves longer-range dependencies in the horizontal and vertical directions by performing SA within each single row or column of the feature map.CSWin [38] proposes a cross-shaped window SA region that includes multiple rows and columns, while Pale Transformer [39] performs SA within a pale-shaped region composed of the same number of interlaced rows and columns of the feature map.Although these methods achieve a trade-off in performance across various vision tasks, the dependencies in the SA layer are limited to local regions to reduce computational complexity, resulting in insufficient context modeling.This limitation restricts the modeling capacity of the entire network.In this study, we propose DGSA, which models long-range non-local dependencies while achieving an effective receptive field of H × W that operates across the feature dimension.The interactions are based on the cross-covariance matrix between keys and queries.Importantly, the computational complexity is only linear, O(NC 2 ), rather than quadratic, O(N 2 C), where C is much smaller than N.
Sparse Transformers.In addition, the utilization of global-based attention involves computing attention matrices that consider all image patches (tokens), prompting the question of whether it is necessary for all elements in the sequence to be attended.The answer to this query is: NO!The inherent dense calculation pattern of the SA mechanism amplifies the weights of relatively lower similarities, rendering the feature interaction and aggregation process susceptible to implicit noise.Consequently, redundant or irrelevant representations continue to influence the modeling of global feature dependencies.Numerous studies have demonstrated that the adoption of sparse attention matrices can enhance model performance while reducing memory usage and computational requirements.For instance, Sparse Transformer [40] employs a factorized operation to mitigate complexity and suggests reducing the spatial dimensions of attention's key and value matrices.Explicit Sparse Transformer [41] improves attention concentration on the global context by explicitly selecting the most relevant segments in natural language processing (NLP) tasks.EfficientViT [42] further addresses redundancy in attention maps by explicitly decomposing the computation of each head and feeding them with diverse features.In this study, instead of computing the attention matrix for all query-key pairs as in the conventional SA mechanism, we adopt a selective approach in the proposed DGSA.Specifically, we choose the top-k most similar keys and values for each query.However, the use of predefined k values can be seen as a form of hard coding, potentially impeding the relational learning between pairwise pixels.To mitigate this issue, we generate multiple attention matrices with different degrees of sparsity by employing multiple k values.These matrices are then weighted by adaptively learned coefficients for fusion.Our approach can give higher attention to high-contributing regions while giving stronger suppression to low-contributing regions.

Combination of Transformers and Convolutional Networks
Several works have incorporated classical design principles of Convolutional Networks into Transformers.These include (1) preserving locality property [43][44][45][46][47][48] and (2) adopting specific network architectures such as U-Net [14,[49][50][51], hierarchical pyramid-like structures [52][53][54], and two-stream architectures [55].On the other hand, MobileViT [35] and MobileFormer [56] successfully combine MobileNet [57] and ViT [30] to achieve competitive results on mobile devices.HAT [58] introduces a hybridized network with parallel branches for channel attention and multi-head self-attention (MHSA) to reconstruct individual pixels or small regions.ACT [59] utilizes both Transformer and convolution branches and implements a fuse-split strategy to efficiently aggregate local-global information at each stage.In this work, we propose a novel hybridization structure, named LGUN, which leverages the advantages of Convolutional Networks, such as spatial inductive biases and local connectivity, and combines them with Transformers' input-adaptive weighting and global context processing.By encoding shallow, fine-grained local information and effectively interacting with deep global contextual information, our approach achieves a higher complexity-performance trade-off (Parameters/Multi-Adds @ PSNR/SSIM: 542K/113G @ xxx).

Overall Architecture
The proposed network architecture consists of three primary components: (1) feature extraction F E (•), (2) nonlinear mapping N LM(•), and (3) reconstruction RE C(•).The input and output of the model are denoted as I LR ∈ R H×W×3 and I SR ∈ R H×W×3 , respectively.In the initial stage, I LR undergoes an overlapped image patch embedding process, where a 3 × 3 convolution layer is applied at the beginning of the network.This results in F embed ∈ R H×W×C feature maps.Subsequently, F embed passes through N stacked blocks to facilitate the learning of local and global relationships.The final reconstructed result is obtained as follows:

LGU
The core modules of LGU, as depicted in Figure 3, include Multi-order Local Hierarchical Aggregation (MLHA) and Dynamic Global Sparse Attention (DGSA).The MLHA module efficiently encodes local spatial information by feeding each sub-branch with a subset of the entire feature, facilitating the explicit learning of distinct feature patterns.On the other hand, the DGSA module aims to model long-range non-local dependencies by leveraging interactions across feature dimensions, resulting in an effective global receptive field.This design ensures robustness to changes in the input token length while reducing computational complexity to O(NC 2 ), where C ≪ N.More specific details are provided below: Deep Layer Z ′ = Z + DGSA(Norm(Z))

Multi-Order Local Hierarchical Aggregation (MLHA)
In the shallow layer of our method, we employ MLHA to focus on encoding local spatial information.By using the Split-Transform-Fusion (STF) strategy, MLHA promotes the learning of different patterns while also saving computational resources.
Given the input feature X ∈ R H×W×C , it passes through three consecutive units: Linear-MLH A-Linear.The specific details of MLHA are as follows: Firstly, split.The input feature F in ∈ R H×W×C is divided into m subparts denoted by x i .Each subpart has the same spatial size of H × W and a channel number of 1 s C, where i ∈ {1, 2, ..., m}.
Secondly, transform.Each subpart feature x i is individually processed by a large kernel convolutional sequence (LKCS) denoted as LKCS i (•), which performs self-adaptive recalibration of the subpart features.Each LKCS i (•) has a similar structure: DW-Conv k 1 ×k 1 , DW-D-Conv k 2 ×k 2 , and Conv k 3 ×k 3 .
Finally, fusion.The MLHA integrates multiple re-weighting LKCS i (•) processes, enabling the modeling of spatial pixel relationships and the interaction of multi-order context information for input content self-adaptation.Specifically, each subpart feature x i (i > 1) is added to the output of LKCS i−1 (•) and then passed to the next branch LKCS i (•) for further processing.The output feature y i of LKCS i (•) corresponds to the input x i and is passed to the concatenation layer.The concatenation layer aggregates large-range spatial relationships and multi-order context information, treating them as weight matrices for self-adaptive modulation of the input feature F in .By effectively mining the underlying relevance of F in , positions with high scores receive adequate attention while insignificant positions are suppressed.This flexible and effective modulation of the feature representation promotes the modeling of complex image patterns with high adaptability and representational power.The process can be expressed as follows:

Dynamic Global Sparse Attention (DGSA)
The token-based SA mechanism calculates the weight matrix along the token dimension.However, the quadratic increase in computational complexity as the sequence length N grows makes it unsuitable for long sequences and high-resolution images.To address this, compromise solutions have been proposed with two approaches: (1) replacing global SA with local SA, which restricts the SA calculation to local windows, and (2) reducing the sequence length of the key and the value through pooling or stride convolution.However, the former method can only capture dependencies within a limited local range, thus constraining the modeling capacity of the entire network to a local region.The latter method, on the other hand, may result in excessive downsampling, leading to information loss or the confusion of relationships, which contradicts the purpose of SISR.In this work, we present an efficient solution that enables global interactions in SA with linear complexity.Instead of considering global interactions between all tokens, we propose the use of Dynamic Global Sparse Attention (DGSA), which operates across feature channels rather than tokens.In DGSA, the interactions are based on the cross-covariance matrix computed over the key and query projections of the token features.The specific details are as follows: Consider an input token sequence, X ∈ R N×D , where N and D denote the length and dimension of the input sequence, respectively.DGSA first generates the query Q, key K, and value V using linear project layers from X, where W q , W k , and W v ∈ R D×D h are learnable weight matrices and D h is the number of project dimensions.Next, the output of DGSA is computed as a weighted sum over N value vectors, Importantly, DGSA has a linear complexity of O(N) rather than O(N 2 ) in vanilla SA.
As mentioned in the Introduction, to address the potential negative impact of irrelevant or confusing information in the SISR task, we introduce a Multi-stage Token Selection (MTS) strategy.As shown in Figure 4, this strategy involves selecting the top-k similar tokens from the keys for each query in order to compute the attention weight matrix.To achieve this, we employ multiple different k values parallelly, resulting in multiple attention matrices with varying degrees of sparsity.The final output is obtained by combining these matrices through a weighted sum.The DGSA with MTS can be expressed as follows: where w 1 , w 2 , and w 3 represent the assigned weight, which is obtained through dynamic adaptation learning by the network, with an initial value of 0.1, and T k n (•) is the dynamic learnable row-wise top-k selection operator: We set Multi-stage Token Selection thresholds k 1 , k 2 , and k 3 to 1 2 , 2 3 , and 3 4 , respectively.In conclusion, DGSA offers two significant advantages.Firstly, it enables the modeling of global correlations by selecting the most similar tokens from the entire attention matrix while effectively filtering out irrelevant ones.Secondly, by employing a weighted sum of multiple attention matrices with varying degrees of sparsity, the model can adequately capture the underlying relevance between all pairs of positions.This approach assigns higher weights to positions of greater importance while suppressing insignificant positions.Consequently, it facilitates the identification of crucial features and their effective utilization in subsequent processing steps.Through this mechanism, our method adaptively selects high-contributing scores from input elements, promoting the modeling of complex image patterns with enhanced adaptability and representational power.

Feed-Forward Network (FFN)
The original Feed-Forward Network (FFN) has limitations in modeling local patterns and spatial relationships, which are crucial for SISR.The inverted residual block (IRB) incorporates a depth-wise convolution between two linear transform layers.This design enables the aggregation of local information among neighboring pixels within each channel.Building upon this idea, we adopted the IRB's design paradigm, and the point-wise convolutional layers in the vanilla FFN were replaced with a combination of depth-wise convolutions and excitation-and-squeeze modules.This modification captures local patterns and structures effectively.Further details are provided below.FFN(X) = Linear(σ(SAL(Linear(X)))) (10) where σ indicates the nonlinear activation function GELU.SAL indicates the spatial awareness layer.

Discussion
As mentioned earlier, our method combines the strengths of Convolutional Networks, such as spatial inductive biases and local connectivity, with Transformers, which provide input-adaptive weighting and global context processing.This integration allows us to achieve a favorable balance between complexity and performance.The advantages of our approach can be summarized as follows: (1) Fine-grained local modeling.The MLHA incorporates a re-weighting process into both the sub-branch and entire features.By utilizing the extracted convolutional features as weight matrices, we can self-adaptively re-calibrate the input representations, effectively capturing spatial relationships and enabling multi-order feature interactions.This approach ensures that important positions receive appropriate focus while suppressing insignificant positions.It is worth noting that each sub-branch feature x i can receive features from all subparts x i , j ≤ i, and passes through large kernel convolutional sequences, resulting in a larger receptive field.
(2) Efficient global interaction.The DGSA is capable of modeling long-range non-local dependencies while obtaining an effective global receptive field.The interactions in DGSA operate across feature dimensions and are based on the cross-covariance matrix between keys and queries.To avoid interference with subsequent super-resolution tasks, our MTS strategy selects multiple top-k similarity scores between queries and keys for attention matrix calculation.This strategy masks out insignificant elements with lower weights, reducing redundancy in attention maps and suppressing clutter background interference, thereby facilitating better feature aggregation.
(3) Linear complexity.Our method remains robust to changes in the input token length while achieving linear computational complexity of O(NC 2 ), where C ≪ N.This enables flexible and effective modeling of feature representation, promoting the capture of complex image patterns with high representational power.

Implementation Details
Our proposed method comprises 16 fundamental building blocks, with each block having 64 channels.Minor channel adjustments are made only in the image reconstruction part for the ×2, ×3, and ×4 scales.To evaluate the effectiveness of our proposed method, we tested it on five common benchmark datasets: Set5 [60], Set14 [61], BSD100 [62], Urban100 [63], and Manga109 [64].We measured the average peak-signal-to-noise ratio (PSNR) and the structural similarity (SSIM) on the luminance (Y) channel of YCbCr space.Our method was implemented using Pytorch 1.12.0 and trained on a single NVIDIA RTX 3090 GPU.More hyper-parameters of the training process are shown in Table 1.

Comparison with State-of-the-Art (SOTA) Methods
To validate the effectiveness of our method, we present the reconstruction results obtained by various SR models on both natural and satellite remote sensing images.These images were captured using common optical sensors (e.g., CMOS) as well as satellite sensors (e.g., millimeter-wave sensors).First, we verify the effectiveness of our proposed method on natural images.In Section 4.2.3, we verify the effectiveness of the method on satellite remote sensing images.
In Figure 5, we present the qualitative comparison results for different methods at upscale factors of ×4.For the images "img 024", "img 067", "img 071", "img 073" and "img 076" in the Urban100 dataset, our method demonstrates superior reconstruction of lattice and text patterns with minimal blurriness and artifacts compared to other methods.This observation confirms the usefulness and effectiveness of our approach.Taking the image "img 024" as an example, our method accurately generates stripes with the correct direction and minimal blurring, while the other methods produce incorrect stripes and a noticeable blur over a wide range.

LAM Results.
In Figure 6, we analyze the local attribution map (LAM [76]) results for SwinIR [32], AAN [72], LMAN [26], and our method to investigate the utilization range of pixels in the input image during the reconstruction of the selected area.We employ the diffusion index (DI) as an evaluation metric to assess the model's ability to extract features and utilize relevant information.As illustrated in Figure 6, our method exhibits the utilization of a larger range of pixel information in reconstructing the area outlined by a red box.This observation demonstrates that our approach achieves a larger receptive field through an efficient local and global interaction.
To facilitate intuitive comparisons, we present a heat map, as shown in Figure 7, illustrating the differences in interest areas between the SR networks (referred to as "Diff").An observation can be made that the proposed LGUN exhibits a more extensive diffusion region compared to CARN [70], EDSR [12], SwinIR [32], and AAN [72].This observation indicates that our designs enable the exploitation of a greater amount of intra-frame information while maintaining limited network complexity.This is primarily attributed to the MLHA and DGSA employed in LGUN, which facilitate the learning of diverse information ranges and the selective retention of spatial textures deemed useful.[70], EDSR [12], SwinIR [32] and AAN [72], while the blue areas represent the additional LAM interest areas of the proposed LGUN.(LGUN has a higher diffusion index).

Remote Sensing Image Super-Resolution
Satellite sensors play a vital role in remote sensing by capturing images and data of the Earth's surface from space.These sensors are mounted on Earth-orbiting satellites and are specifically designed to gather information across multiple wavelengths of the electromagnetic spectrum.Remote sensing images obtained from satellite sensors offer valuable insights for a wide range of applications, including environmental monitoring, land use classification, disaster management, and climate studies.
One crucial task of remote sensing is SISR, which aims to enhance the resolution of satellite images.Higher-resolution images provide more accurate and detailed information about the Earth's surface, which is crucial for various applications.Therefore, SISR plays a pivotal role in maximizing the usefulness of remote sensing data.To demonstrate the effectiveness of our proposed method in enhancing remote sensing images obtained from satellite sensors, we present the SISR results of different networks in Figure 8.Our network exhibits clear advantages in recovering remotely sensed images, particularly in capturing texture details, lines, and repetitive structures.In contrast, other contrast algorithms often introduce artifacts and blending issues when dealing with remote sensing images that have complex backgrounds.At the same time, our network effectively mitigates blurring artifacts and reconstructs edge details with higher fidelity.

Ablation Study
In Table 3, we present the results of the ablation study for our method.Below, we discuss the ablation results based on the following aspects: The influence of the structure configuration.The primary objective of this study was to efficiently encode local spatial information, model long-range non-local dependencies, and achieve a global receptive field by leveraging the strengths of Convolutional Networks, which provide spatial inductive biases and local connectivity, and Transformers, which offer input-adaptive weighting and global context interaction.In order to validate the effectiveness of the two core modules, namely MLHA and DGSA, we conducted experiments where one module was removed while the other was retained.The results, as presented in Table 3(a), demonstrate a significant decrease in model performance when either of the modules is removed.These findings indicate that the model benefits from both the global interaction introduced by the DGSA module and the fine-grained local modeling achieved by MLHA.
The influence of the MLHA part.In the initial layers of our model, we utilize MLHA to efficiently encode local spatial information.This is achieved by feeding each sub-branch with a specific subset of the complete feature.The effectiveness of the STF strategy is demonstrated in Table 3(b), where it is shown to enhance the explicit learning of distinct feature patterns within the network, leading to improved performance compared to models trained without the STF strategy.The influence of the design of LKCS in the MLHA part.We conducted an experiment to verify the effectiveness of three LKCS modules in our MLHA.Specifically, each LKCS module consists of three convolution layers: DW-Conv layer, DW-D-Conv layer, and Conv layer.The three LKCS modules differ in the kernel size of the three convolution layers they contain.In the first LKCS module, the kernel sizes of the three convolution layers are 3, 5, and 1.In the second LKCS module, The kernel sizes of the three convolution layers are 5, 7, and 1.And in the third LKCS module, the kernel sizes of the three convolution layers are 7, 9, and 1.We wanted to show the effectiveness of extracting features using different kernel sizes.We conducted the experiments, in which the three LKCS modules were exactly the same.The kernel sizes of the three convolution layers in all three LKCS modules were set to 5, 7, and 1.The results are shown in Table 3(d), which shows the effectiveness of our proposed LKCS module.

Application
There are many potential applications of the Lightweight Image Super-Resolution approach.For example, in surveillance, SR techniques can enhance video resolution, making images sharper and clearer so that details, such as facial features and licence plate numbers, can be more easily identified, thus enhancing security.In medical imaging, SR technology can improve the clarity of medical images and help doctors diagnose conditions more accurately.In the field of satellite imagery, SR technology can improve image quality and make remote sensing data analysis more accurate, which is used in environmental monitoring, urban planning, and other fields.The lightweight SR method is particularly suitable for resource-constrained devices and real-time processing scenarios due to its low computation and storage requirements.

Conclusions
The aim of this study is to develop a lightweight and high-performance network for SISR by effectively combining the strengths of Transformers and Convolutional Networks.To achieve this objective, we propose a novel lightweight SISR method called LGUN.
LGUN focuses on encoding local spatial information within MLHA and utilizes the Split-Transform-Fusion (STF) strategy to facilitate the learning of diverse patterns.Additionally, it models global context dependencies through the core module: DGSA.DGSA selects multiple top-k similar attention matrices and masks out elements with lower weights, thereby reducing redundancy in attention maps and suppressing interference from cluttered backgrounds.The experimental results, evaluated on popular benchmarks, demonstrate the superior quantitative and qualitative performance of our method.

Figure 1 .
Figure 1.Trade-off between performance and model complexity on Set5 ×4 dataset.Multi-Adds are calculated on 1280 × 720 HR images.

Figure 2 .
Figure 2. Compared to uni-dimensional information communication, e.g., spatial-only or channelonly, our method can achieve local spatial-wise aggregation and global channel-wise interaction simultaneously, both of which are crucial for SISR tasks.

Figure 3 .
Figure 3.The architecture of our proposed method, LGUN, consists of three main parts: feature extraction, nonlinear mapping, and image reconstruction.The core modules, named LGU, include two stages: MLHA and DGSA.In the shallow layers, MLHA efficiently encodes local spatial information by utilizing subsets of the entire feature, enabling explicit learning of distinct feature patterns through the STF strategy.In the deep layers, DGSA is employed to model long-range non-local dependencies while achieving a global effective receptive field.DGSA operates across the feature dimension and leverages interactions based on the cross-covariance matrix between keys and queries.Moreover, we incorporate the MTS strategy into DGSA, which selects multiple top-k similar attention matrices and masks out elements with lower weights.This reduces redundancy in attention maps and suppresses interference from cluttered backgrounds.LGUN exhibits robustness to changes in the input token length and significantly reduces the computational complexity to O(NC 2 ), where C ≪ N.

Figure 4 .
Figure 4. Multiple attention matrices.Take a head as an example (D = D h ), where w 1 , w 2 , w 3 , and w 4 represent the assigned weight, which is obtained by dynamic adaptation learning of the network.We set Multi-stage Token Selection thresholds k 1 , k 2 , k 3 , and k 4 to 1 2 , 2 3 , 3 4 , and 4 5 , respectively.

Figure 6 .
Figure 6.Results of local attribution maps.A more widely distributed red area and higher DI represent a larger range of pixel utilization.

Figure 7 .
Figure 7.The heat maps exhibit the area of interest for different SR networks.The red regions are noticed by CARN[70], EDSR[12], SwinIR[32] and AAN[72], while the blue areas represent the additional LAM interest areas of the proposed LGUN.(LGUN has a higher diffusion index).

Figure 8 .
Figure 8. Qualitative comparison of state-of-the-art methods on AID dataset.The influence of the DGSA part.In the deeper layers of our model, we introduce DGSA to effectively model long-range non-local dependencies and achieve a global receptive field of H × W. To reduce redundancy in attention maps and mitigate interference from cluttered backgrounds, we employ the MTS strategy, which selects multiple top-k similar attention matrices and masks out elements with lower weights.In Table3(c), we display the results of a series of experiments to assess the effectiveness of the DGSA module.These experiments include scenarios with no sparse attention (w/o top-k), sparse attention (w/top-k), and sparse attention with the MTS strategy (top-k with MTS).The results of these experiments indicate that employing sparse attention with the MTS strategy leads to improved performance.The influence of the design of LKCS in the MLHA part.We conducted an experiment to verify the effectiveness of three LKCS modules in our MLHA.Specifically, each LKCS module consists of three convolution layers: DW-Conv layer, DW-D-Conv layer, and Conv layer.The three LKCS modules differ in the kernel size of the three convolution layers they contain.In the first LKCS module, the kernel sizes of the three convolution layers are 3, 5, and 1.In the second LKCS module, The kernel sizes of the three convolution layers are 5, 7, and 1.And in the third LKCS module, the kernel sizes of the three convolution layers are 7, 9, and 1.We wanted to show the effectiveness of extracting features using different kernel sizes.We conducted the experiments, in which the three LKCS modules were exactly the same.The kernel sizes of the three convolution layers in all three LKCS modules were set to 5, 7, and 1.The results are shown inTable 3(d), which shows the effectiveness of our proposed LKCS module.

Table 1 .
Hyper-parameters of the training process.

Table 2 .
Quantitative comparison with SOTA methods on five popular benchmark datasets.Thicker text indicates the best results.'Multi-Adds' is calculated with a 1280 × 720 HR image.The bold font shows the best value in every group.
Figure 5. Qualitative comparison of state-of-the-art methods on Urban100 [63].Our method achieves better performance with fewer artifacts and less blur.

Table 3 .
Ablation experiments on the micro structure design.The bold font shows the best value in every group.(a)Results for the MLHA and DGSA modules.