Self-FuseNet: Data Free Unsupervised Remote Sensing Image Super-Resolution

Real-world degradations deviate from ideal degradations, as most deep learning-based scenarios involve the ideal synthesis of low-resolution (LR) counterpart images by popularly used bicubic interpolation. Moreover, supervised learning approaches rely on many high-resolution (HR) and LR image pairings to reconstruct missing information based on their association, developed by complex long hours of deep neural network training. Additionally, the trained model's generalizability on various image datasets with various distributions is not guaranteed. To overcome this challenge, we proposed our novel Self-FuseNet, particularly for extremely poor-resolution satellite images. Also, the network exhibits strong generalization performance on additional datasets (both “ideal” and “nonideal” scenarios). The network is especially for those image datasets suffering from the following two significant limitations: 1) nonavailability of ground truth HR images; 2) limitation of a large count of the unpaired dataset for deep neural network training. The benefit of the proposed model is threefold: 1) it does not require any significant extensive training data, either paired or unpaired but only a single LR image without prior knowledge of its distribution; 2) it is a simple and effective model for super-resolving very poor-resolution images, saving computational resources and time; 3) using UNet, the processing of data are accelerated by the network's wide skip connections, allowing image reconstruction with fewer parameters. Rather than using an inverse approach, as common in most deep learning scenarios, we introduced a forward approach to super-resolve exceptionally LR remote sensing images. This demonstrates its supremacy over recently proposed state-of-the-art methods for unsupervised single real-world image blind super-resolution.

BGUSat, the first Israeli research CubeSat, is a nanosatellites cooperative venture between the Ben-Gurion University of the Negev (BGU) and Israel Aerospace Industries (IAI), and the Israeli space agency. This 3 U CubeSat in low Earth orbit (LEO) collects remote sensing images in the short wave infrared (SWIR). The satellite is entirely functional and has gathered several images of more than 2000 of Earth from different geographical locations. BGUSat is a state-of-the-art CubeSat that has completed its technological mission and is now working on its scientific mission. The following many EO applications benefit from satellite imaging: 1) improving land cover classifications utilizing lowresolution (LR) satellite images; 2) precise agriculture (monitoring and upkeep); 3) improved multispectral satellite imaging quality, leading to improved detection and classification; 4) detection and tracking of glacier cracks in the mountains; 5) natural catastrophe damage monitoring (earthquake, tsunami, etc.); 6) forest/reforestation planning and monitoring; 7) pollution and global warming monitoring; 8) forest fire; 9) exact and comprehensive maps; 10) urban planning and other related activities. We are super-resolving the images in an unsupervised way to cope with all real-time applications exploiting LR data with a spatial resolution of 600 m with a revisit frequency of less than a day, a high temporal resolution that is useful to observe rapid events. The intention is to super-resolve these LR images so we can use them in many applications. Traditional methods for single image super-resolution (SISR) like bicubic interpolation [2] and Lanczos resampling [3] are good in terms of computational speed but lag in terms of accuracy and perceptual quality.  . Image-specific feature extraction. (a) Externally trained networks on other paired datasets, whose image distribution looks similar to test data. (b) Internally trained networks by generating fake LR from test image itself. (c) Self-FuseNet: Our proposed idea that neither require external paired dataset nor fake LR images to produce SR. Idea of extracting all possible hidden frequencies in an image and fuse them in appropriate ratios followed by customized UNet for image reconstruction. popular because of their speed and tremendous improvement in perceptual quality. These methods primarily benefit from high-resolution (HR) and artificially created LR pairs. If the training dataset is already at the improved spatial resolution, this notion of generating fake LR samples is a useful one. However, it does not perform well on extremely poor-resolution images of the real world, like the images from our BGUSat dataset. Fig. 2 shows one of the BGUsat real-world image (Israel dead sea) and its associated patches at different locations. As the image quality is pretty low, which is understandable considering the 600 m spatial resolution, it is observed from Fig. 2(c) and (d), the edges are greatly affected by blurring and blocky pattern. The texture can also be described as very rough in Fig. 2(a).
Image super-resolution (SR) aims to extract as much visible intelligence and finer details from a scene of interest by increasing the pixel count per unit area [4]. Furthermore, managing ground truth (GT) HR images for any other dataset not included in the training process is extremely difficult and costly. When it comes to real-world image SR, supervised models are less practical and effective.
Authors have successfully used feature extraction in past works and translated it into corresponding feature vectors. Each feature vector corresponds to one different dimension. Similar techniques with different features are used in this [5] work. Others have used a different kind of filter [6], [7], [8], [9] to extract various features and try to reproduce the texture, smoothness and edge details.
We have broadly classified unsupervised SISR techniques into the following three potential categories as shown in Fig. 3: 1) Features extracted via transfer learning from externally pretrained supervised models on similar datasets whose image distribution is matched to the test dataset. Still, these architectures fail to generalize on other image distribution datasets. 2) Internally image-specific trained convolutional neural networks (CNN) like zero-shot SR (ZSSR) [10] and meta transfer learning for zero-shot SR (MZSR) [11] that generate further fake LR images from already degraded images. The models based on this idea will not work well for higher scaling factors, say 3, 4, and 8. Also, it is not practical for real-world images existing at already LR. 3) Our proposed network, Self-FusNet, neither requires any GT dataset nor fake LR images for training. The idea is to extract low (image smoothness details), mid (textural and roughness details), and high frequency (edge and sharpness details) information from the same LR image and fuse them in an appropriate ratio such that it improves the perceptual quality. The proposed network Self-FuseNet is shown in Fig. 4, which is designed in the following two phases: 1) Non-Learning phase: The first step includes the Self-Fusion block, which is intended to extract numerous featured images: The low-frequency feature relates to image smoothness, the mid-frequency feature to image textural features, and the high-frequency feature to image edge details. Algorithm 3 fuses these featured images in an optimum ratio based on no-reference image quality as a loss function. 2) Learning phase: Customized UNet architecture for single image reconstruction to process the data accelerated by the network's wide skip connections. This enables image reconstruction with fewer parameters and the retention of image characteristics in forward layers that were left out in previous layers, followed by the upsampling-2-D layer.

1)
One cannot get precise information regarding data sorting, since practically we do not have HR images to train on a network. 2) Less accuracy than supervised models which is obvious as original GT data are missing. 3) Unsupervised learning is computationally complex. 4) Real-world degradations are different from ideal degradations as assumed in most supervised learning cases, like producing counterpart fake LR images from bicubic interpolation, which works well on these synthetically generated LR images but fails practically on real-world images. 5) Other models that need multiple unpaired image datasets rely on the collection of bulk images whose image distribution is similar to the tested real image. This is practically impossible every time and consumes a lot of time in searching and generating artificial data.

B. Our Contribution
Our proposed network, Self-FuseNet, is one-of-a-kind, as it retrieves low, mid, and high-frequency features from a single LR image and fuses them in optimum fusion weights. Following are the manifold contributions of our approach: 1) The proposed network is unsupervised with only a single image for SR without kernel and distortion estimation. No need for a large amount of unpaired and labeled data. We do not employ any HR images from other satellite datasets. The network does not need bulk unpaired images from other standard datasets too. So, here we are getting rid of manually arranging datasets of similar distribution. 2) It does not require any transformation within different color spaces and hence additional complexity is saved in our network unlike in previous fusion-based methods that need to do a color conversion.
3) The existing fusion-based algorithms need image registration for satellite data. Here, no need to georegister or georeference the images, as featured images are extracted from the same single LR image. 4) The proposed self-fusion based network is computationally resource and time efficient as it consumes only a few minutes to process without the requirement of GPU. Other existing fusion networks, like IFCNN [12], take long training time (44 h) and occupy large memory of GPU (9 GB). Hence, we saved here computation time and resources too. 5) We customized the existing UNet-architecture for single image SR to benefit from its wide skip connection to enhance the texture details and image reconstruction. 6) We proved that by using traditional image processing tools with the extra effort of deep learning for image reconstruction, where one could better generate a SR image only from a single LR image without unnecessary extensive training. The results are perceptually and quantitatively outstanding than existing purely learning-based unsupervised methods. 7) The network has good generalization ability on other datasets too, like for RGB colored images (both "ideal" and "nonideal" scenarios) and multispectral band images. The rest of this article is organized as follows. Section II reviews the previous work for SISR. Section III presents our proposed network Self-FuseNet and introduces the submodules for feature extraction and image reconstruction. Section IV gives a brief introduction on the used no-reference image quality metrics (NR-IQM) as loss functions. Section V presents the results of the ablation experiments, runtime comparison with existing state-of-the-art SISR methods, and visual and quantitative comparisons on different datasets (RGB and multispectral band images) with existing unsupervised SISR methods, followed by Section VI to discuss the results. Finally, Section VII concludes this article.

II. RELATED WORK FOR UNSUPERVISED SISR
There are two types of SR algorithms: 1) based on learning; 2) based on reconstruction. Reconstruction-based SR algorithms deal with SR problems in an inverse way. It first creates fake LR image by blurring followed by downsampling mostly bicubic interpolation as presented in where, k is the standard gaussian kernel, I HR is the original HR image, I LR is the artificially created LR image, and s is the desired downscaling factor. To deal with real-world image SR using deep learning, various existing approaches have been proposed that are broadly classified in Fig. 1.

A. Self-Supervised Learning
Self-supervised learning-based methods that rely on heavy training try to learn the distribution of images that match the distribution of real-world LR images. Also, the requirement of training unpaired samples needs to be huge, and they should be at least better resolution than real LR image perceptual quality. The methods like CinCGAN [13], Cycle-CNN [14] on natural images, LGCNet based on local-global combined relationship [15] on remote sensing images, enhanced upscaling module for image SR [16], and bidirectional convolutional LSTM neural network for remote sensing image SR [17] used fake generated LR images from already existing standard datasets, that too in bulk. In [18], an unpaired SR network was proposed using a generative adversarial network (GAN) where a correction network is used for noise removal to LR images followed by an SR network for up-scaling. In [19], it was proved that selfsupervised pretraining on remote sensing images could lead to better outcomes than supervised pretraining on natural scene images for the downstream task of remote sensing image classification. In [20], a self-supervised transfer learning network, SelfS2, is proposed for Sentinel-2 multispectral image SR to generate an entire datasets at 10 m from 20 m and 60 m of spatial resolution. Such methods perform well, if they already have datasets at visibly better resolutions, say 20 and 60 m, because the fake LR images are again generated for self-supervision by ideal Gaussian blur followed by bicubic degradation. Nevertheless, these requirements are insufficient for real-world image data at extremely poor resolution with real complex distortions.

B. Few-Shot and Zero-Shot Learning
Few shot-learning is also effective when we have a very limited count of paired image samples, say 10-50 image pairs. As in [21], a few pairs of images are used to train iteratively and learn the relational classifiers used to extract new facts of the new relations. In [22], they proposed a mechanism to apply self-supervised scale prediction that leverage the property of multiscaling in an image and estimate image relations in a fewshot setting. In 2018, the first CNN-based unsupervised SISR was proposed called ZSSR [10], which creates fake LR samples from the original real LR image, where this way, it can utilize the paired relation. Based on zero-shot learning, author proposed DualSR [23] for real-world SR by employing the adversarial loss and GAN. Another work, MIP [24] designed GAN for image reconstruction. Nevertheless, GAN-based image SR methods introduce some artifacts that do not exist in the original image and are not helpful for remote sensing applications. ZSSR is efficient for a scaling factor of 2 but not perceptually effective on higher scaling factors like 3, 4, and 8. Also, processing a single image consumes a lot of training time. To modify the ZSSR, meta zero-shot learning MZSR [11] is proposed that significantly speeds up the image-specific training process. Again, these methods rely on bicubic interpolation for LR images, at least a few paired datasets for SR, and are not effective for higher scaling factors.

C. Image-Specific Networks
In [25], an image-specific degradation simulation network was proposed. For training, the network generates an image's depth information which indicates the natural sizes of local image patches, to extract the unpaired HR/LR patch collection. Another way is estimating image-specific kernels like nonparametric kernel estimation [26], [27], [28], iterative kernel correction [29] that iteratively corrects the image for an idea scenario, internal GAN [30], and the correction filter [31] tries to mimic the LR image as with the one generated from bicubic interpolation, which is an effective independent way of extracting particular dataset-based features rather than a general kernel for all types of datasets. These approaches consume a lot of computation time and resources. The iterative kernel predictionbased methods aim to estimate image-specific blur and noise.
Then, this kernel is used on real-world images to generate a new training set for training. Still, there are the following multiple challenges with iterative kernel estimation-based methods.
1) Time-consuming due to the iterative loop.
2) At least one SR patch or ideal assumption is required.

D. Fusion-Based Methods
In [32], a comprehensive review of multiexposure image fusion techniques is provided that needs multiple images of different exposure areas and enhanced using different fusion rules. In 2017, the first deep learning-based fusion network called DeepFuse [33], particularly for multiexposure images, was proposed. The model converts input images to YCbCr color space followed by a CNN composed of the following three types of layers: 1) feature layer; 2) fusion layer; 3) reconstruction layers. Again, at the output end, the image is converted back from YCbCr to RGB format. In 2020, U2Fusion [34], an end-to-end unsupervised fusion network, was proposed to generate a fused output image to solve different fusion-based applications like multimodal, multiexposure (MEF), and multifocus. The network extracts features from pretrained VGG-16 and fuses the input images with DenseNet. The major drawback is that it needs good-quality images at the input. By the term good quality, we mean the image should have at least visible distinguished features. In 2020, IFCNN [12] was the first CNN-based neural network to propose a general image fusion architecture. However, the drawbacks are the following: it needs GT images of better resolution than input, a large number of data samples for training, and it did not use any postprocessing. In 2021, [35] proposed a UMEF network, which can fuse multiple images rather than two in DeepFuse using CNN for feature extraction and fusion. Also, they employed two loss functions multiexposure fusion structural similarity index measure (MEF-SSIMc) and an unreferenced gradient loss, unlike DeepFuse, which uses only MEF-SSIMc loss. Besides CNN, few GAN-based unsupervised MEF methods are proposed, like Chen [36] used attention mechanism and adversarial training to correct the moving pixels and then fuse two input images. Further, two more GAN-based networks proposed MEF-GAN [37] and GANFuse [38]. Except for UMEF, all of the unsupervised MEF networks mentioned above needed color space conversion. In [39], an architecture PAN-GAN has proposed to pan-sharpen the image. The multispectral dataset used is already at a good resolution (1.8 m and 3.2 m), and they used GT data of better resolution from World-view-II and GF-2 (0.5 m and 0.8 m, respectively). In [40], the author used unsupervised deep learning-based architecture requiring multiple sets of unpaired images in the training process. Also, another successful use of fusion network VIF-NET [41] to enhance the image in an unsupervised way by taking the benefit of visible image for texture details and infrared image for the same scene for night-time visibility and suppression of highly dynamic regions. All these networks need at least two or more images as an input source to the fusion network at different lighting conditions. Nevertheless, arranging multiple input images of the same scene at different lighting conditions is not always feasible to process at an input end. Therefore, we proposed a fusion-based network and named it self-fuse that relies entirely on a single LR image and extracts multiple featured images from the same to best fuse them.

III. PROPOSED METHOD
We presented here a forward approach-based network called Self-FuseNet, as shown in Fig. 4, based on fusion to cope with real-world images. The proposed network is designed in the following two phases.
1) The non-Learning phase is composed of the proposed subblock self-fusion in Fig. 5, which is intended to extract numerous featured images: The low-frequency feature relates to image smoothness, the mid-frequency feature to image textural features, and the high-frequency feature to image edge details. Algorithm 3 fuses these featured images in an optimum ratio based on no-reference image quality as a loss function.
2) The learning phase, as shown in Fig. 6 of the customized UNet [42] architecture for single image reconstruction for processing of data, accelerated by the network's wide skip connections. This enables image reconstruction with fewer parameters and the retention of image characteristics in forwarding layers that were left out in previous layers. The Self-FuseNet consists of the following four modules: 1) image enhancement module; 2) feature extraction module; 3) feature fusion module; 4) image reconstruction module. Rather than using the conventional "inverse approach" that is constructing a counterfeit LR from a given GT HR and then inverting the process to get to the original GT, we employed a "forward approach" to improve the image at the original resolution and then the SR. Our proposed network preprocesses the LR image to develop an HR version based on a self-fusion subblock that utilizes feature extraction from the same single LR image and subsequently uses a custom-UNet followed by upsampling Conv2D layer to output the desired scale SR image.

A. Image Enhancement Module
The proposed Algorithm 1 is used to generate an enhanced image I Enhance . We used the gradient property to determine the change in pixel intensity in both the horizontal and vertical directions, where k x and k y are the first order derivative in x and y directions, respectively, allowing us to recognize sections in an image that appeared to already have edge structure (highfrequency information). The histogram of an image is essential for a visual representation of pixel intensity in image distribution and hence necessary to extract irrelevant gray-level values that are not important for image reconstruction. To ensure this, we recommended a threshold α computed from image histogram parameters, with only significant edges over the threshold value being accepted. These accepted intensities contribute to the formation of an image featuring edge elements (high-frequency information). To extract more edge information, repeat the method for the generated image. For edge extraction and improvement, we discovered that two iterations were sufficient. One can use and extend the steps from our proposed Algorithm 1 according to the accepted visual clarity without artificially generated over-enhanced images. β and γ are those constants updated with an iterative loop based on popular visual information fidelity (VIF) [43] metric: An image enhancement assessment metric. In our case, we repeat the process for two iterations as demonstrated in Algorithm 1. The VIF metric indicates, that a high score is better. The idea is to iterate till the VIF score is best between the first iteration image (I LR , I edge 2 ) and the second iteration (I LR , I edge 1 ) image, i.e., when a successive iteration has higher VIF score than the previous iteration.

B. Feature Extraction Module
Filtering is one of the most fundamental image processing and computer vision operation. We used bilateral filter to recover low-frequency information like image smoothness and a Gabor filter to retrieve midband frequency information like texture and contour information from a single LR image. [44] is a modified version of a traditional Gaussian filter, but it has a unique property of smoothing an image and reducing noise without impairing edge features, which is a crucial requirement in SR tasks. Bilateral filtering is based on the idea that two pixels are close not just in terms of physical proximity, but also in terms of photometric ranges. The spatial distribution of image intensity has nothing to do with range filtering. Integrating the strengths from the entire image makes sense since the distribution of image values beyond a range should not affect the overall value at a particular pixel location. Range filtering without domain sorting, on the other hand, just changes the graphical notation of an image, which is of limited utility. It is proved [45], how bilateral blur outstands Gaussian blur for making better LR pairs in CNN-based architecture for SR-based applications and this idea [46] is utilized for SRCNN. This proved that convergence time could be reduced by achieving similar image quality in less number of iterations. Mathematically, the bilateral filter and its normalization factor is expressed in (2) and (3), respectively. In Gaussian smoothing, we take a weighted average of nearby pixel values. Weights are inversely proportionate to distance from the neighborhood's center. The bilateral filter adds weight to these spatial weights, such that pixel values that are near to the pixel value in the center are weighted more than pixel values that are further apart. Because of this weighting, the bilateral filter preserves edges (considerable tonal variations) while smoothing out the more flat areas (minor tonal differences)

1) Bilateral Filter: Bilateral filter
where, two weight parameters σ s and σ r are responsible to control the filtering intensity. σ r is responsible for increasing the extent of blur and σ s smooths out the large features in the image.
Here, G r is a range of Gaussian distribution that minimizes the impact of pixels q in the neighborhood with an intensity value different from image I p , whereas G s is a spatial Gaussian that minimizes the influence of pixels far from the targeted pixel.
2) Gabor Filter: Gabor filter [47] is a combination of both Gaussian and sinusoidal functions. Gabor filters serve the purpose of extracting texture characteristics (midband frequency information) by using different frequencies and orientations. In [48], authors used it for lung cancer prediction using an extended KNN algorithm. Also, in [49] authors have used it for feature detection of medical images for deep learning-based fusion. In [50], the building and vegetation textures are extracted for the classification of airborne images and LIDAR data. Motivated by the application of the Gabor filter in various applications, particularly for textural feature extraction, we use it in the proposed Algorithm 2 to extract texture from each image fragment called Gabor featured images I Gi = (I G1 , I G2 , . . ..., I Gn ), where [1 ≤ i ≤ n], and n is a total number of generated images. The important featured images are those that have essential nonredundant significant observable features that are accepted by NR-IQM as a loss function called I filtered .
The featured images used to fuse according to their relevance decided by iterative correction of loss function between original LR image and accepted Gabor filtered images I filtered . The different textural featured images are generated by varying different parameters as represented in (4), where x and y are the pixel position in horizontal and vertical directions, respectively, k is the kernel-size, θ is the angle, σ is the standard-deviation of the Gaussian envelope, wavelength is λ, gamma value is γ, and phase-offset is φ that extracts features from different orientations The filter creates numerous featured images depending on the combination of parameters. However, we are only interested in nonrepeated unique and meaningful features with the most negligible loss compared to the original LR. To measure the loss, we are the first to use NR-IQM as loss metrics.

C. Feature Fusion Module
Image fusion aims to incorporate the critical features of numerous input images into a single, complete image. To fuse, features are extracted like, a low-frequency detail from the bilateral filter, a midfrequency detail from Gabor featured images, and a high-frequency detail from an enhanced image generated from the above-proposed algorithms. A similar concept is employed in [51], where a new transformer-based feature fusion network Algorithm 1: Algorithm to Get Enhanced Image: I Enhance .
Input: Low-resolution image I LR Output: Enhanced high-resolution image I Enhance Paraemeters: k x , k y (First order derivative kernels), α (Threshold), β, γ (constants updated from iterative loop), K sharp (Image sharpening filter) Initialization: TR-MISR has been proposed that super-resolves images utilizing several sequences.
We assume there are at least N (N≥2) featured images to fuse wisely via the proposed Algorithm 3. The fused featured images are denoted as I bilateral generated from the bilateral filter, I Gabor generated from Algorithm 2, and I Enhance generated from Algorithm 1. The featured images extracted from the Gabor filter are selected based on NR-IQM, blind or referenceless image spatial quality evaluator (BRISQUE) [52] and natural image quality evaluator (NIQE) [53], as loss functions. The loss function is designed for each set: First, between the LR image and all filtered sets of Gabor feature images. As both NR-IQM indicate, lower is better. Algorithm 2 iteratively calculates the two losses for both image sets and accepts only the image with less value for both losses than the input LR image. If, in the whole process, we cannot get a single extracted Gabor featured image, update the Gabor filter parameters and again iterate. The set of accepted Gabor featured images are denoted as I filtered . Each extracted featured image has its particular feature property in an image which should be nonredundant and observable. Utilizing observable details, we have shown a few Gabor featured images Algorithm 2: Algorithm to Extract Gabor Featured (Mid-Frequency) Images.
Input: I LR Output: Fused I Gabor Paraemeters: kernel size k, θ, σ, λ, γ, φ Initialization: kernel size ← 5 phase-offset ← φ = 0 (for hidden image set 0.8) (1,3,5,7,9)do for λ in range(0, π, π 4 )do for γ in range(0.05, 0.5)  Fig. 10. Here, all the images have significant details that are nonredundant too. This is verified visually and using the NR-IQM loss used in Algorithm 2. In Fig. 10(i), the I G8 is entirely dark, this featured image is utterly undesired due to the loss of information for us, and hence we neglect it to accept for I filtered .

{Accept only those kernels without redundant information and desired features}
It is imperative to know the correct proportion of each feature to fuse. To decide this, we propose a fusion rule in Algorithm 3 to decide appropriate fusion weights for corresponding extracted features based on NR-IQM as loss functions. The objective of the proposed algorithm is to select the optimum combination of fusion weights to reconstruct the best fused HR image, both in terms of visual perceptual quality and quantitative results.

D. Image Reconstruction Module
The second phase, called the learning phase of the network, uses the customized UNet network's extensive wide skip connectivity, which was first used in medical image segmentation [42]; we are the first to use UNet for single-image reconstruction process without sacrificing texture information. We designed our custom-UNet architecture as shown in Fig. 6 to take the advantage of wide skip connection and training for single image reconstruction. The fused HR image (produced from the self-fusion subblock) is reconstructed and upscaled to produce an SR image followed by additional upscaling convolution layer (separately for each scaling factor). To achieve improved image reconstruction from HR image to SR image, the architectural detail of our customized UNet is as follows.
1) In customized-UNet, we used a depth of four convolutional blocks for both the encoder and decoder neural networks. 2) We maintained homogeneity at the encoder by employing seven layers in a consistent pattern for each Conv2D block. Our Custom-UNet receives the processed HR image size at 256 x 256 from our proposed self-fusion subblock (nonlearning phase).

4)
To encode the image at the encoder side, the number of filters at each Conv-Block from first to last are assigned as: K 1 = 64, K 2 = 128, K 3 = 256, and K 4 = 512. 5) The image is represented in 1024 high-dimensional feature maps at the bottleneck layer between the encoder and decoder networks. 6) At the decoder, for each DeConv-block, we maintained uniformity by using seven layers at each block in a regular fashion like an encoder network. 7) The seven layers at the decoder network in sequence are as follows: a) Conv2D-Transpose layer; b) Conv2D layer (Conv 1 ); c) batch-normalization (BN 1 ) layer; d) ReLU activation layer; e) Conv2D layer (Conv 2 ); f) batch-normalization (BN 2 ) layer; g) ReLU activation layer. 8) Again, to decode the image at the decoder side, the number of filters at each DeConv-Block from the bottom to top are assigned as: K 4 = 512, K 3 = 256, K 2 = 128, and K 1 = 64 at the last Conv-Block to reach the image at the original size of 256 x 256. 9) The output to nonlinear activation ReLU is further input to the upsampling-2-D layer. We employed three spsampling-2-D layers for each scaling factor 2, 3, and 4 to output an SR image. The UNet is trained only on a single fused HR image generated from our proposed nonlearning self-fusion subblock, as shown in Fig. 5. For training, we consider 1000 epochs, Adam optimizer, ReLu activation function, the stride of 2, and mean squared error as loss function.

A. Blind or Referenceless Image Spatial Quality Evaluator
The metric is based on hand-crafted features derived from mean in (6) and variance in (7) statistics of an image called mean subtracted contrast normalized coefficients. The author proved that the NR-IQM BRISQUE [52] based on natural scene statistics is statistically better than the popular full-reference image quality metrics like peak signal-to-noise ratio (PSNR) and the structural smilarity index measure (SSIM). BRISQUE's computational complexity is minimal, making it perfect for realtime applications. BRISQUE features can also be utilized to identify distortions. Unlike traditional image quality assessment metrics that rely on calculating distortion-specific properties like a ringing effect, blur, or blockiness, BRISQUE also takes care of scene information to quantify potential losses of "naturalness" in the image. Under the impact of a spatial natural scene statistical model, the underlying properties are detected from an empirical distribution of luminances and products of locally normalized luminances. The metrics are simple to apply and have a low computing complexity, making them ideal for assessing image quality in real time.
For a test image I i,j to producê where i and j are the rows and column indices denoted as 1 ≤i ≤ M and 1 ≤ j≤ N , respectively, and M and N are the image height and width, respectively. C is a constant to prevent the instability from zero denominators (e.g., when an image is completely smooth like sky or sea.) and where w k,l is a circularly symmetric Gaussian weighting function along both 2-D dimensions and μ (i,j) is the mean of the test image. K and L are the counts of standard deviations taken into consideration to rescale in one unit volume.

B. Natural Image Quality Evaluator
NIQE [53] is based on the statistics of the natural scene in the space domain. Despite "general purpose" no-reference image quality algorithms that require information regarding expected distortions in the form of training examples or human opinion scores, they developed a blind IQA model that uses observable deviations (quality-aware features) derived from statistical regularities from a natural scene statistical (NSS) [56] model detected in real pictures. The results of their study demonstrated that it outperforms both PSNR and SSIM. The metric indicates lower the score, the better the image quality is.

A. Dataset Detail
Israel's first state-of-the-art nanosatellite, BGUSat, collaborates with the BGU), IAI, and the Israeli space agency. The Cube-Sat is successfully launched into LEO at a height of 500 km. The image acquisition by satellite is at 600-m spatial resolution with a revisit frequency of less than a day, which is a high temporal resolution useful for observing rapid events. The images are a single band of SWIR. Also, we do not have any GT data for HR images. Our idea is to super-resolve these LR images to use them for further real-time applications. The dataset is real-world poor-resolution satellite images. We successfully super-resolve them in a completely unsupervised manner without any GT using our proposed Self-FuseNet with the extra effort of deep learning as postprocessing in image reconstruction. We have more than 2000 images from different regions of the world at 600-m spatial resolution. We tested our proposed fusion-based network on the following two BGUSat images in different scenarios: 1) Israel's dead sea image; 2) China's border image.

B. Ablation Study
We conduct ablation experiments on the China border image to investigate how the proposed Self-FuseNet and featured images after every module refine the impact of SR on BRISQUE, PIQUE, and NIQE.
1) Low-Frequency Featured Image: When we say a lowfrequency feature in an image, we are referring to the image's overall smoothness. Numerous filters can extract the smoothness effect from an image, but the bilateral filter is preferred since it preserves the edge component, which is critical for high-quality visual reconstruction. Additionally, this removes any unnecessary image noise too. So, we are getting a noise-free smooth image without affecting the edges of an image. We choose kernel spatial width as k = 5. In Fig. 9, we can observe the following two images: 1) the BGUSat input LR image I LR ; 2) smooth noise free I Bilateral image.
2) Mid-Frequency Featured Images: A mid-frequency characteristic of an image refers to the overall texture quality and contours. Using Algorithm 2, we generated 384 Gabor featured images by the combination of parameters: kernel size k = 5, theta θ in range (0,8) where θ is a factor of ( θ 4 ) * π, standard-deviation σ = (1,3,5,7,9), wavelength λ in range (0, π) with a step size of π 4 , and γ in range (0.05,0.5). One can observe the different textural Gabor featured images in Fig. 10. We have shown here all accepted Gabor featured images and one undesired feature image (I G 8 ) in Fig. 10(i).
3) High-Frequency Featured Image: When we refer to a high-frequency feature in an image, we emphasize the shapedefined aspects of an image, such as edges and sharpness. The detailed description is explained in Section III-A. We can observe image sharpness in Fig. 11 of the enhanced I Enhance image from Algorithm 1 as compared to input BGUSat LR image.

4) Fused-HR Image:
We compared images before (original input LR) and after self-fusion block (fused HR) in Figs. 12 and 13 for both the Israel dead sea image and the China border image at three different regions in each, respectively. One can observe a noticeable difference between the original BGUSat LR image and the fused HR image from our proposed self-fusion block (nonlearning phase of Self-FuseNet) at three different patches for both tested scenes.

C. Runtime Comparison
We use the source codes of existing state-of-the-art unsupervised methods to evaluate the inference runtime (on our BGUSat dataset) on the same machine with Intel(R) Core(TM) i9-10900 CPU @ 2.80GHz (64-GB RAM) and NVIDIA GeForce RTX 3090 GPU (24-GB Memory). We compared the training runtime as illustrated in their corresponding references. Since ZSSR trains an image-specific CNN at test time, its training and inference times are almost identical. Table I shows that our network does not require several days to train on a large amount of dataset. Also, it can be easily trained on a CPU with 1000 epochs within 40 min with an inference time of less than 1 s. Hence, our network, Self-FuseNet, is effective in terms of both time and resources. However, training and inference time may differ slightly according to the local machine specification and GPU (not necessary in our case). However, when comparing our        training time which needs only 5 min on GPU and 40 min on CPU, to LapSRN [57], which required three days on GPU, one can see that LapSRN's inference time is quite close to ours in terms of performance. Also, our visual and quantitative results are far better than LapSRN. When comparing to EDSR [58], our network outperforms in both training and inference time. All the inference times are calculated for a scaling factor of 2 and an image size of 256 × 256. The time complexity will marginally increase depending on the desired higher upsampling factor.

A. The"Ideal" Case
Even though this is not the aim of Self-FuseNet, we verified our proposed network through well-known Wald's protocol [59] on the following two different standard datasets: (2) PROBA-V [55]: A multispectral dataset in two bands, NIR and RED, each with HR and multiple LR pairs for the same scene. We choose NIR band images for demonstration purposes. For both datasets, we used Gaussian blur (kernel size = 5) followed by "ideal" well-known bicubic downsampling. The visual or subjective results at a scaling factor of 2 can be visualized from Fig. 7 for UC-MERCED dataset and Fig. 8 for the PROBA-V dataset. A quantitative comparison is also presented in Table IV using full-reference image quality metrics: PSNR [60] and SSIM [61]. Both PSNR and SSIM indicate better image quality for a higher score.

B. The "Nonideal" Case
Real-world images are not always ideally generated. We designed the proposed Self-FuseNet that can also deal with nonideal circumstances that arise as a result of either of the following: 1) nonideal downscaling kernels (that deviate from the well-known bicubic kernel); 2) extremely blurry and blocky structured images when zoom-in. In such nonideal cases, our proposed Self-FuseNet is validated by comparing the findings via potential NR-IQM: BRISQUE [52], PIQUE [63], NIQE [53], and DOM [64]. The BRISQUE, PIQUE, and NIQE are based on the NSS [56] model, which is designed based on the human visual system perceptual quality. DOM [64] is defined to evaluate image sharpness in both x and y directions, based on local correlation coefficients. For all three, BRISQUE, NIQE,  [2], LapSRN [57], ZSSR [10], EDSR [58], and MIP [24]. a bicubic kernel whose GT is available, as can be observed from Table IV. These "ideal" LR images' visual results are shown in Figs. 7 and 8 for UC-MERCED and PROBA-V datasets, respectively. Also, from Table IV EDSR and MIP can be observed as the second-best performing network for most of the scenarios. MIP produces excellent results than ZSSR for all datasets. In such "ideal" scenarios, the proposed Self-FuseNet provides significantly better results in most of the cases than compared to SR methods (PSNR by 1-2 dB and SSIM by 0.01-0.1).
Eventually, one cannot control a single network solely for every possible image distortion. Because the following scenario is so uncommon, we mentioned such cases as limitations shown below: 1) Since we did not include any preprocessing task to tackle cloud and snow in images, it also enhanced the white texture along with the background image, which is not required. Hence, the network produces white-colored overenhanced textured blocks and does not produce promising outcomes in the region of clouds and snow. Although, one could use the cloud and snow removal technique followed by our method for efficient results. 2) Texture detail is also known as low-contrast fine detail in images. For example, in the case of gray-scale images, a highly dense textured patch indicates fine features or texture. It is common to discover that a few small textural features are lost during the noise reduction procedure during the preprocessing step, especially when spatial  resolution is low (above 100 m). However, by minimizing unwanted noise, our approach produces superior and sharper visual results in most cases. Only in rare scenarios (especially in gray-scale images), the reconstruction process might also reduce crucial texture features, resulting in unpleasing image quality, while enhancing similar neighboring newly generated textures. This can be seen in Fig. 8(c) for imgset0608 of the PROBA-V dataset, where the edges are solid and well-shaped, yet the texture is highly dense. Therefore, only some textural details are recovered during the noise reduction procedure because of the significant variability in the contrast of nearby pixels. Finally, other than the above limitations, which are rare, our model works well in the following scenarios: 1) single band satellite images (our case of BGUSat data as shown in Fig. 14); 2) multispectral band images (PROBA-V: NIR Band results as shown in Fig. 8); 3) RGB remote sensing images (UC-MERCED data as shown in Fig. 7); 4) real-world colored image dataset (DRealSR [62] as shown in Fig. 16). Our model has good generalization ability on other image datasets, too, except for our single band SWIR BGUSat data. Also, Table I shows that our network does not take more computational time and resources than existing compared methods and is a lightweight network.

VII. CONCLUSION
We proposed a novel state-of-the-art self-fusion-based network: Self-FuseNet, which is a "forward approach" to preprocess first the LR image to produce a fused HR image followed by an SR image rather than traditional bicubic kernel for downscaling the HR image to generate fake LR samples and upscale to generate an SR popularly known as an "inverse approach." Self-FuseNet is a combination of the following two phases: 1) nonlearning; 2) learning phase. The nonlearning phase is the self-fusion subblock to generate a fused HR image from the input LR image. The learning phase is the customized-UNet followed by an additional upsampling convolutional layer to generate an SR image from fused HR (output image from self-fusion block). The benefits of the proposed network are as follows.
1) The proposed network is completely unsupervised with only a single image for SR without kernel and distortion estimation. No need for a large amount of unpaired and labeled data. We also do not employ any HR images from other satellite datasets, hence getting rid of manually arranging datasets of similar image distribution. 2) The model does not use a complex neural network for training and tweaking out weights and hence is computationally light to handle without the requirement of expensive GPUs for training. 3) The model does not require a GT image to extract features. Therefore, no requirement for any georegistered reference images too. 4) The network has good generalization ability on other datasets too, like for RGB colored images (both "ideal" and "nonideal" scenarios) and multispectral band images. This is the first time a model has used the BRISQUE and NIQE NR-IQM as loss functions to super-resolve a single LR image in a completely unsupervised manner. The work is an excellent example not just for single image blind SR but also in the area of image fusion. The applications can be extended to superresolve medical images, natural photographs, and other datasets in addition to remote sensing images. We have demonstrated a significant generic network for employing this self-fusion using filters that suit our requirements. However, one may use the same network with alternative filters or alternative deep neural networks to extract low-frequency, mid-frequency, and high-frequency features. We have evidenced both qualitatively and quantitatively to prove why we consider a forward approach, rather than focusing solely on the inverse approach. The work is novel as it focuses on particularly real-world single-image SR for extremely poor-quality datasets in a completely unsupervised setting.