Multi-Stage Frequency Attention Network for Progressive Optical Remote Sensing Cloud Removal

: Cloud contamination significantly impairs optical remote sensing images (RSIs), reducing their utility for Earth observation. The traditional cloud removal techniques, often reliant on deep learning, generally aim for holistic image reconstruction, which may inadvertently alter the intrinsic qualities of cloud-free areas, leading to image distortions. To address this issue, we propose a multi-stage frequency attention network (MFCRNet), a progressive paradigm for optical RSI cloud removal. MFCRNet hierarchically deploys frequency cloud removal modules (FCRMs) to refine the cloud edges while preserving the original characteristics of the non-cloud regions in the frequency domain. Specifically, the FCRM begins with a frequency attention block (FAB) that transforms the features into the frequency domain, enhancing the differentiation between cloud-covered and cloud-free regions. Moreover, a non-local attention block (NAB) is employed to augment and disseminate contextual information effectively. Furthermore, we introduce a collaborative loss function that amalgamates semantic, boundary, and frequency-domain information. The experimental results on the RICE1, RICE2, and T-Cloud datasets demonstrate that MFCRNet surpasses the contemporary models, achieving superior performance in terms of mean absolute error (MAE), root mean square error (RMSE), peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM), validating its efficacy regarding the cloud removal from optical RSIs.


Introduction
Remote sensing technology serves as an indispensable tool, enabling the continuous and swift acquisition of geometric and physical information pertaining to Earth's surface [1].Accompanying the rapid advancements in remote sensing technologies, optical RSIs have emerged as the predominant medium for Earth observation [2].Nonetheless, the presence of clouds, which envelop approximately 55% of Earth's surface, markedly hinders the interpretation and utility of optical RSIs [3].Consequently, cloud removal from optical RSIs presents a significant challenge [4].
The traditional cloud removal techniques primarily rely on the principle of local correlation among adjacent pixels, predicated on the assumption that cloud-covered areas possess similar characteristics to their adjacent cloud-free regions [5].As a result, these techniques utilize the data from cloud-free regions to reconstruct or substitute the cloudcovered areas, employing methods such as interpolation [6], filtering [7], and exemplarbased strategies [8], among others.Zhu et al. [9] introduced a novel cloud removal approach utilizing a modified neighborhood similar pixel interpolator (NSPI).Similarly, Siravenha et al. [10] employed a high boost filter alongside homomorphic filtering for scattered cloud elimination, while He et al. [11] proposed an innovative image completion technique through the analysis of similar patch offsets.These methods fundamentally assume a resemblance in the features between cloud-covered and cloud-free areas.Nonetheless, they might introduce discontinuities or artifacts at the boundaries of the newly generated cloud-free areas and the original cloud-covered regions, leading to unnatural effects in the restored images [12].Moreover, these approaches often require manual parameter adjustments to achieve optimal results, adding complexity and potentially resulting in inconsistent performances [13].
The advent of machine learning techniques, including decision trees (DTs) [14], support vector machines (SVMs) [15], random forests (RFs) [16], and others, has significantly mitigated the constraints of the conventional cloud removal methods.By leveraging extensive datasets comprising both cloudy and cloud-free imagery, machine learning models adeptly discern intricate patterns and distinctions between these regions.Consequently, they possess the capability to distinguish between the cloudy and cloud-free areas in images, facilitating effective cloud restoration.Lee et al. [17] introduced multicategory support vector machine (MSVM) as a promising, efficient algorithm for cloud removal preprocessing.Hu et al. [18] developed a thin cloud removal algorithm for contaminated RSIs, combining multidirectional dual-tree complex wavelet transform (M-DTCWT) with domain adaptation transfer least square support vector regression (T-LSSVR), ensuring the preservation of the ground object details while eliminating thin clouds.Tahsin et al. [19] devised an innovative optical cloud pixel recovery (OCPR) method using an RF trained on multi-parameter hydrological data to restore the cloudy pixels in Landsat NDVI imagery.However, machine learning approaches typically do not directly derive high-level feature representations from raw data.Instead, they depend on the expertise and prior knowledge of domain experts for feature selection.The effectiveness of image reconstruction is profoundly influenced by the quality and selection of these handcrafted features, introducing a degree of subjectivity and inherent limitations.
The recent integration of deep learning's robust nonlinear modeling capabilities, particularly through convolutional neural networks (CNNs), has revolutionized the cloud removal efforts.These networks can autonomously learn feature representations from raw data, eliminating the necessity for manual feature extraction or selection.This end-to-end learning paradigm significantly diminishes the need for manual input in cloud removal processes.Zhang et al. [20] introduced DeepGEE-S2CR, a method combining Google Earth Engine (GEE) data with a multi-level feature-connected CNN, to efficiently clear the clouds in Sentinel-2 imagery using Sentinel-1 synthetic aperture radar imagery as supplementary data.Meanwhile, Ma et al. [21] proposed the innovative cloud-enhancement GAN (Cloud-EGAN) strategy, which incorporates saliency and high-level feature enhancement modules within a cycle-consistent generative adversarial network (CycleGAN) framework.The widespread implementation of CNNs has significantly improved the cloud removal models' ability to discern complex features and variations between the cloud-covered and cloud-free regions in images, leading to the enhanced reconstruction of cloudy imagery.
In traditional CNN architectures, the model's effective receptive field is constrained by the network's depth and the size of the convolutional kernels, limiting the ability to capture comprehensive features and contextual information [22].This restriction hampers the model's capacity to process global information from images.To overcome this limitation, attention mechanisms [23] have been integrated into cloud removal models to augment their capability to assimilate global information, enabling dynamic focus adjustment across different image regions.Such enhancements allow these models to prioritize the reconstruction of cloudy areas, thereby significantly improving their contribution to image analysis.Xu et al. [24] introduced AMGAN-CR, which effectively leverages attention maps alongside attentive recurrent and residual networks, coupled with a reconstruction network, to tackle the challenges of cloud removal.Wu et al. [25] proposed Cloudformer, a transformer-based model that integrates convolution and self-attention mechanisms with locally enhanced positional encoding (LePE), adeptly managing cloud removal by extracting features across various scales and augmenting the positional encoding capabilities.
Despite the promising results demonstrated by the existing deep learning cloud removal models, they still face significant limitations.Specifically, these models [26,27] primarily rely on a per-pixel approach during the training process, which neglects the global coherence between pixels.This leads to difficulties in seamlessly integrating the reconstructed regions with the surrounding cloud-free areas at the semantic level.To address this challenge, studies such as [28,29] have introduced mask techniques aimed at more accurately distinguishing cloud-covered areas from cloud-free regions, thereby guiding the detailed reconstruction of local features and transitions.However, the accuracy of these masks remains a critical factor limiting their performance.Additionally, the attention mechanisms currently employed in cloud removal tasks mainly focus on channel and spatial attention, largely overlooking the crucial role of the frequency features within the image [30].Frequency features, as another important dimension of image information, are essential for capturing both the global structure and local details of an image.Therefore, effectively integrating frequency features into cloud removal models to achieve a unified reconstruction of the cloud-covered and cloud-free regions, thereby enhancing the quality and semantic consistency of the reconstructed images, remains a pressing research problem.
To address these problems, we introduce a frequency-domain attention mechanism utilizing the fast Fourier transform (FFT) to enhance the spatial information processing and performance of cloud removal models given that frequency information is critical for cloud removal.Low frequencies highlight the overall content of an image, while high frequencies focus on the edge contours and texture details.Leveraging high-frequency information is essential.Moreover, precise boundary delineation is vital in cloud removal to accurately identify cloud-covered regions, enabling targeted reconstruction efforts that preserve the integrity of cloud-free areas.We propose a multi-stage reconstruction strategy to refine the boundary features and design a collaborative optimization loss function to concentrate on the boundaries of cloud-covered areas while minimizing unnecessary reconstruction in cloud-free zones.The principal contributions are summarized as follows: (1) We propose a frequency cloud removal module (FCRM) that is adept at recovering the details while preserving the original characteristics of non-cloud regions in the frequency domain.The FCRM utilizes frequency-domain attention to focus on the differences in the frequency-domain information between cloudy and cloud-free images to refine the boundary information of the image.Additionally, it introduces the non-local attention block to capture the local and non-local relationships and enhance the contextual connections through global dependency relationships.
(2) We introduce a collaborative optimization loss function, consisting of Charbonnier loss for global robustness, edge loss for edge-preserving precision, and FFT loss for frequency-aware adaptability, which penalize boundary shifts while ensuring subject consistency and retaining intricate image details and textures.
(3) The multi-stage frequency attention network (MFCRNet) is structured around an encoder-decoder architecture, specifically designed for reconstructing areas obscured by clouds.Utilizing FCRM modules in the preceding N − 1 layers enables meticulous cloud removal from input images.To minimize the information loss from up-sampling operations, a variant ResNet is directly applied to the input image in the N-th layer.
(4) A series of experiments are conducted on the RICE1, RICE2 [31], and T-Cloud [32] datasets, demonstrating the feasibility and superiority of the proposed method.It exhibits superior performance in both the quantitative and qualitative assessments compared to the other cloud removal methods.

Cloud Removal
This section introduces several techniques for removing clouds, which can be broadly classified into single-stage and multi-stage approaches based on their architectural designs.

Single-Stage Approach
Predominantly, cloud removal techniques employ a single-stage design, where the models directly learn the mapping from raw input data to the final output.For example, Bermudez et al. [33] used cloudless optical imaging areas and the corresponding synthetic aperture radar (SAR) data to train conditional generative hostile networks (CGans) to rely solely on SAR data to clear clouds.Meraner et al. [34] developed a deep residual neural network tailored for cloud removal tasks in SAR-optical fusion.Zhou et al. [35] created a cloud removal framework based on a generative adversarial network (GAN), which integrates coherent semantics and local adaptive reconstruction considerations.Feng et al. [36] utilized self-attention to capture global dependencies, proposing a method that incorporates global-local fusion for cloud removal.Nonetheless, single-stage cloud removal approaches often struggle to adequately extract and leverage the information from the input images, which may result in suboptimal outcomes, especially in complex cloud coverage scenarios.This limitation underscores the need for exploring alternative, more sophisticated methods that offer nuanced and adaptable solutions for cloud removal.

Multi-Stage Approach
Contrary to the single-stage approach, the multi-stage approach breaks down the cloud removal task into several phases, offering more precise control over the image cloud removal process.Zheng et al. [4] developed a two-stage methodology for the complex task of single-image cloud removal, utilizing a U-Net architecture for the thin cloud elimination in the initial stage and a GAN for the dense cloud removal in the later stage, acknowledging the distinct strategies required for different types of clouds.Jiang et al. [37] introduced a sophisticated network capable of dividing the cloud removal process into coarse and refined phases, thereby enabling a more thorough and detailed approach to the task.Darbaghshahi et al. [38] proposed a dual-stage GAN for converting SAR-to-optical images and for subsequent cloud removal.Similarly, Tao et al. [39] presented a two-stage strategy for sequential cloud-contaminated area reconstruction, focusing initially on global structure recovery and subsequently on enhancing the content details for a more nuanced restoration.

Attention Mechanisms
In traditional CNN models, each neuron is connected only to its immediate neighbors, restricting the model's capacity to grasp long-range dependencies [40].To address this limitation, the attention mechanism, inspired by human selective attention, has been introduced.It allows the model to dynamically compute a weight distribution at each processing step based on contextual information, thereby focusing selectively on the most pertinent segments of the input relevant to the task at hand [41].This enhancement significantly improves the model's ability to perceive long-range dependencies and has gained widespread application across the machine learning and artificial intelligence domains [42].
In the context of cloud removal, the adoption of attention mechanisms is similarly extensive.Li et al. [43] proposed a hierarchical spectral and structure-preserving fusion network (HS 2 P) utilizing a hierarchical fusion of optical RSIs with SAR data.This approach includes a channel attention mechanism and a collaborative optimization loss function.Jin et al. [44] introduced HyA-GAN, a novel deep learning model for cloud removal in RSIs that integrates channel and spatial attention mechanisms into a generative adversarial network.Wang et al. [45] developed a method that leverages SAR images as supplementary data, integrating spatial and channel attention mechanisms alongside gated convolutional layers to enhance the cloud removal from optical images.In our proposed model, frequencydomain attention is employed to discern the edge details in missing regions, proving advantageous for the cloud removal process.

Learning in Frequency Domain
The frequency domain is utilized to analyze signals or functions based on their frequency components.Recently, deep learning has been introduced to efficiently process various tasks by utilizing frequency-domain information [46].Rao et al. [47] proposed the global filter network, a technique designed to learn the long-term spatial dependencies within the frequency domain, with a specific focus on improving image classification tasks.Yang et al. [48] introduced an unsupervised domain adaptation strategy that seeks to reduce the disparity between source and target distributions by exchanging low-frequency spectra in semantic segmentation tasks.Zhong et al. [49] embedded clique structures in super-resolution processes, utilizing the inverse discrete wavelet transform (IDWT) for resizing feature maps.Chen et al. [50] developed the frequency spectrum modulation tensor complement method and used the Fourier transform in the time dimension to execute the low-rank complement for each frequency component, especially in the context of cloud removal.Building on the success of these frequency-based approaches, our work employs the fast Fourier transform as an effective tool for modeling the frequency information in cloud removal tasks.This enables comprehensive learning of the edge details within cloud-covered regions, facilitating a more focused and precise restoration strategy.

Materials and Methods
This section presents the introduction of the proposed method MFCRNet for cloud removal in optical RSIs.Section 3.1 provides an overview of the overall architecture.Subsequently, Sections 3.2-3.4describe detailed descriptions of each module within the architecture.Finally, Section 3.5 explores the loss function employed by the model.

Overview
In Figure 1, a comprehensive approach is illustrated for maximizing the utilization of contextual information in cloud removal tasks through the construction of a deep architectural framework in MFCRNet.The shallow feature extraction block (SFEB) in each stage of MFCRNet encompasses a convolutional layer combined with a channel attention mechanism, which plays a pivotal role in extracting fundamental features.The frequency attention block (FAB) and non-local attention block (NAB) at the preceding N − 1 stages operate sequentially in both frequency and spatial domains, progressively refining the cloud removal process.In the N-th stage, a variant ResNet is designed to generate pixel-wise accurate estimates while preserving the fine details.Furthermore, to enhance the recovery of spectral and structural features from global information, a collaborative optimization loss function is devised for training purposes.

Multi-Stage Progressive Architecture
As depicted in Figure 1, following N stages of progressive cloud removal, we derive the ultimate output of MFCRNet.To provide a clearer method description, we designate the initial N − 1 stages as the frequency-domain-based cloud removal modules (FCRMs), with the final layer termed as the variant ResNet cloud removal module (VCRM).The primary information flow can be succinctly expressed by the following equation: where I represents the input cloudy image, F out represents the final restored cloudless image, VCRM means the operation of the final stage, FCRM k means the operation of the k-th stage, FCRM k out represents the intermediate result output after the k-th FCRM module operation, while k = 1, 2, ..., N − 1 and N is the total number of the stage of MFCRNet.The result of each stage is progressively passed on to the next stage until reaching the N-th stage.The final output is the result of gradual cloud removal at all stages with high-resolution features while mitigating the effect of clouds on the image.Inspired by previous work [51], we used a long skip connection to efficiently capture more information.

Frequency Attention Block
As illustrated in Figure 2a, the structure of the FAB follows an encoder-decoder architecture with the following components.First, to fully extract both high-and low-frequency information from the input images and simultaneously capture long-term and short-term interactions, we introduce the residual FFT-Conv block in Figure 2b.In addition to a normal spatial residual stream comprising two 3 × 3 convolutional layers followed by ReLU activation, another channel-wise FFT stream is integrated to explain the global context in the frequency domain.Initially, the transformation from the spatial domain to the frequency domain begins with the computation of the 2D real FFT of the input features Z, which decomposes the input features into their constituent frequency components.Upon performing the FFT operation, the resulting frequency-domain representation comprises both real and imaginary components.To preserve all the information within these components, they are combined along the channel dimension.Subsequently, the concatenated frequency-domain tensor undergoes a series of transformations (a 3 × 3 convolutional layer, ReLU activation, and a 3 × 3 convolutional layer) to refine its feature representation.Finally, the inverse 2D real FFT is employed to convert the refined frequency-domain representation back into the spatial domain, which reconstructs the spatial structure of the image, incorporating the refined frequency-domain information to produce a final feature map.To strike a balance between efficiency and effectiveness in cloud removal, each encoder block and decoder block comprises three residual FFT-Conv blocks.Second, inspired by the FCANet [52], which combines the channel attention mechanism with the discrete cosine transform (DCT) cleverly to generate controllable frequency-domain components, the feature maps at U-Net skip connections are processed with the FCA.In FCA, the input feature X ∈ R C×H×W is divided into n partitions along the channel dimension.For each partition, a 2D discrete cosine transform (DCT) is applied: where u ∈ {0, 1, ..., H − 1}, v ∈ {0, 1, ..., W − 1}are the frequency component 2D indices corresponding to X i and The whole multi-spectral channel attention framework can be written as By using FCA as the skip connection in U-Net, the frequency-based representations obtained provide complementary information to the spatial representations so that the model gains access to a richer and more diverse set of features, potentially enhancing its ability to capture complex patterns and structures in the input data.

Non-Local Attention Block
As depicted in Figure 1, the final step of the MFCR involves the utilization of the NAB to further refine the restoration results.The details of NAB are demonstrated in Figure 3.The NAB operates by taking the input cloud image I and the feature map FABout generated by FAB as inputs.It computes attention maps on FABout by utilizing a non-local block (NLB) [53] that considers both local and non-local relationships.These attention maps are then used to enhance the features of the input image I.By computing pairwise similarities between feature vectors at all spatial positions, the NAB identifies relevant contextual information from distant regions in the image, enhancing the discriminative power of the maps.The attention-enhanced features generated by the NAB are then forwarded to the next stage for additional processing.

Collaborative Optimization Loss
In the current cloud removal tasks, the predominant utilization of the L2 loss function for content reconstruction often overlooks the importance of structural and global information.To address this limitation, we have devised a collaborative optimization loss function for training, which encompasses Charbonnier loss L c for global robustness, edge loss L e for edge-preserving precision, and FFT loss L f for frequency-aware adaptability.The collaborative optimization loss function is defined as where λ 1 and λ 2 are the weights of the loss function, and we empirically set the values of these to 0.05 and 0.01, respectively.By penalizing the discrepancy between predicted and ground truth pixel values using a robust L1 norm from global information, Charbonnier loss L c promotes the generation of restoration results that are less sensitive to outliers and exhibit greater resilience to noise.It is defined as where O means the final output of the MFCRNet, Y represents the ground-truth image, and ε is set to 10 −3 empirically.
In order to preserve edge details and structural integrity within the restored images, ensuring that boundaries between cloud and non-cloud regions remain sharp and welldefined, the edge loss L e is designed to penalize deviations in edge locations and gradients between the predicted and ground truth images as where △ means the Laplacian operator.
To encourage the model to faithfully reproduce both low-and high-frequency components, preserving the overall structure and texture of the scene, FFT loss L f leverages the frequency-domain representation of images to promote the restoration of global information and fine-grained structures, leading to more coherent and visually consistent outcomes, defined as where FT represents the FFT operation.By combining Charbonnier loss L c , edge loss L e , and FFT loss L f into a unified collaborative optimization framework, our approach effectively addresses the shortcomings of traditional L2 loss-based methods by incorporating structural and global information into the training process.The collaborative optimization function ensures that the model learns to balance content fidelity, structural integrity, and global coherence, resulting in more accurate and visually pleasing cloud removal results.

Experiments
In this section, the dataset, evaluation metrics, and experimental setup are presented, followed by the experimental results and the ablation study.

Dataset
To thoroughly validate our proposed method's effectiveness, we conducted experiments using the publicly available RICE and T-Cloud optical remote sensing datasets.

RICE Dataset
To thoroughly validate our proposed method's effectiveness, we conducted experiments using the publicly available RICE optical remote sensing dataset.The RICE dataset has two subsets, namely RICE1 and RICE2, specifically designed for cloud removal tasks.The RICE1 dataset, sourced from Google Earth, includes 500 pairs of images: cloudy images and corresponding cloud-free images.The RICE2 dataset is derived from Landsat 8 OLI/TIRS and contains 736 sample pairs, each with cloudy images, cloud-free images, and corresponding cloud masks.The time difference between the images with clouds and their cloud-free counterparts does not exceed fifteen days.The size of the images in each dataset is 512 × 512.For the RICE1 dataset, we partitioned 80% of the images, approximately 400, for training purposes, while the remaining 20% served as the test set.For the RICE2 dataset, we selected 589 images for training and 147 images for testing.

T-Cloud Dataset
The T-Cloud dataset is a large-scale dataset of 2939 pairs of real remotely sensed images captured by Landsat 8 satellites, and these image pairs demonstrate the contrast between thin cloud cover and its clear scene 16 days later, with diverse ground scenes, complex texture details, and challenging non-uniform cloud distributions.In our experiments, we follow an 8:2 ratio for the division, where the training set contains 2351 pairs of images, the test set contains 588 pairs of images, and the size of each image is 256 × 256.

Evaluation Metrics
To quantify the quality of the final restored images, we employed commonly used evaluation metrics: peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), mean absolute error (MAE), and root mean square error (RMSE).PSNR is a widely accepted metric for assessing the fidelity of the restored image against the ground truth image.SSIM measures the similarity between the restored and ground truth images concerning structure, luminance, and contrast.MAE evaluates the average magnitude of errors between the restored and ground truth images, providing a straightforward measure of the overall accuracy of the restoration process.RMSE calculates the square root of the mean squared differences between corresponding pixels of the restored and ground truth images taking into account both the magnitude and direction of the errors.It is worth noting that higher image quality is indicated by larger PSNR and SSIM values, as well as smaller MAE and RMSE values.The PSNR, SSIM, MAE, and RMSE are defined as follows: where x and y represent the two images being evaluated, and H and W, respectively, represent the height and width of the images.(i, j) and y(i, j) indicate the pixel values of the two images at pixel position (i, j), respectively.µ x and µ y are the means of images x and y; σ x and σ y are the standard deviations of x and y, and y. θ 1 and θ 2 are constants.

Experimental Setup
We implemented the proposed method with the PyTorch framework on an NVIDIA A40 GPU.The Adam optimizer was employed to update the learnable parameters, with the exponential decay rates β1 and β2 set to 0.9 and 0.999, respectively.The eps was set to 1 × 10 −8 .The initial learning rate of the Adam optimizer was set to 2 × 10 −4 , and then a cosine annealing strategy was used to dynamically update the learning rate.The training process ran for 250 epochs with a batch size of 4.

Comparison with Other Methods
In this section, we compare the proposed MFCRNet with five existing methods: Dark Channel Prior (DCP) [54], Pix2Pix [55], RCAN [56], SpAGAN [57], CVAE [32], and CMNet [58].Among them, the DCP algorithm utilizes the characteristics of the dark channel in images and the physical model of fog to remove haze from images, representing a traditional approach based on a handcrafted prior.Both the Pix2Pix and SpAGAN methods are based on GAN models.Pix2Pix employs a conditional generative adversarial network to learn how to transform cloudy images into clear ones.SpAGAN integrates spatial attention mechanisms with generative adversarial networks to remove cloud from highresolution satellite images.The RCAN method employs channel attention mechanisms and residual learning for cloud removal.The CVAE method generates multiple plausible cloudfree images using conditional variational autoencoders and addresses the cloud removal problem by analyzing uncertainty across multiple predictions.CMNet uses a cascaded memory network that combines local spatial detail extraction and global detail restoration.

Results on RICE1 Dataset
Table 1 presents the quantitative experimental results of various methods on the RICE1 dataset, with the best results highlighted in bold.It is evident from the observations that each method's performance across four evaluation metrics is distinct.Notably, our proposed MFCRNet method achieves ideal outcomes across all the evaluation metrics.Specifically, our method produces significant results on MAE/RMSE/PSNR/SSIM by 0.0140/0.0167/37.0148/0.9763.Compared to the other methods, the improvements on MAE/RMSE/PSNR/SSIM are at least by 0.0073/0.0005/1.9675/0.0131,underscoring the exceptional performance of our MFCRNet approach in cloud removal tasks.Different methods' cloud removal results on the RICE1 dataset are presented in Figure 4.In order to enhance the visual recognition and clarity, each selected area is clearly labeled by an orange box, and these areas are individually enlarged and subsequently arranged directly below the respective original image.We showcase the restoration outcomes for both greenery and mountainous regions.Our proposed approach not only performs excellently in terms of the local details and overall content preservation but also exhibits remarkable color fidelity.In the second and sixth rows of Figure 4, the results generated by the DCP (b) show severe image distortion, failing to accurately reproduce the color information of the ground scenes.This could be attributed to DCP's primary focus on the dark channels containing high-brightness elements like sky and clouds, potentially leading to color distortion when removing these regions.Both Pix2Pix (c) and CVAE (f) yield subpar cloud removal results, exhibiting blurry effects.The spatial-attention-based SpAGAN (d) faces similar color distortion issues as DCP, possibly because it solely emphasizes local information while overlooking global contextual cues.On the other hand, RCAN (e) effectively alleviates the color distortion problem observed in SpAGAN by leveraging residual learning.However, despite achieving similar color tones as the reference cloud-free images, RCAN still exhibits some blurriness in certain edge regions.Although CMNet (g) demonstrates significant results in cloud removal, as shown in the second row, our method exhibits superior performance in restoring the clarity of the ground object contours, providing more accurate contour details.This is attributed to our utilization of the FAB for extracting detailed information such as image edges and the incorporation of NAB to enable the model to fully consider global dependencies, facilitating a comprehensive understanding of the image characteristics throughout the learning process.

Results on RICE2 Dataset
As shown in Table 2, the quantitative results of the various image restoration methods on the RICE2 dataset exhibit notable discrepancies across evaluation metrics such as MAE, RMSE, PSNR, and SSIM.Similar to the findings on the RICE1 dataset, our proposed method maintains excellent performance.These results underscore the substantial advantages of our MFCRNet method on the RICE2 dataset.Figure 5 illustrates the results of the different methods for cloud removal on the RICE2 dataset.Compared to the RICE1 dataset, the RICE2 dataset exhibits higher density and thickness regarding the cloud images, which results in a significant loss of detailed information in the images.However, our method still optimally preserves the details and color information of the images.In the regions marked in the third and eighth rows of Figure 5, it is evident that the DCP (b) fails to effectively reconstruct the cloudy areas.This is attributed to the high brightness of the thick cloud regions compared to the underlying terrain, causing the pixel values in the dark channel to become overly saturated, thus hindering accurate cloud removal.In comparison to the traditional methods like DCP, Pix2Pix (c) based on deep learning shows improvement in cloud removal; however, difficulties persist in removing some clouds when the cloud coverage is extensive.Although the SpAGAN (d) achieves satisfactory cloud removal, noticeable differences in color distribution between the reconstructed results and ground truth are observed.The RCAN (e) fails to accurately reconstruct the edge features, resulting in blurred outlines of some terrain features.While the CVAE (f) can delineate terrain outlines more accurately, it tends to produce blurry results.It is obvious from the comparison in the sixth row that, when the cloud thickness reaches a level that almost obscures the ground objects, CMNet's recovery appears to be suboptimal compared to our method, which demonstrates superior recovery when dealing with these types of extreme cloud cover situations.Our method fully utilizes contextual and boundary information, resulting in images with more restored details, fewer artifacts, and more realistic colors.

Results on T-Cloud Dataset
Table 3 exhaustively compares the performance of MFCRNet with a variety of the existing methods on the T-Cloud dataset, demonstrating through quantitative evaluation that MFCRNet achieves the top results in all four key evaluation metrics.Further, Figure 6 visualizes the experimental results of MFCRNet regarding the qualitative analysis on the T-Cloud dataset and compares it with the other methods.Echoing the previous performances on the RICE1 and RICE2 datasets, MFCRNet once again proves its powerful ability: it can accurately recover the cloud-obscured regions while maintaining and enhancing the detailed information of the image during the recovery process, ensuring the overall quality of the image and the integrity of the information.

Effects of Different Stage Numbers
In the MFCRNet network, we further determined the number of stages N. Table 4 shows the scores of the different evaluation metrics.For the RICE1 dataset, the network's performance improves with the increase in the number of stages N until it reaches N = 6.Although at N = 6 the PSNR is 1.26% higher than at N = 5, the other metrics do not perform as well as at N = 5.Additionally, with the increasing number of layers, the model's parameter count also increases.Therefore, we chose N = 5 as a balance between the performance and parameter count.

Effects of Critical Module
To fully demonstrate the effectiveness of the proposed MFCRNet method, we conducted a series of ablation experiments to validate the contribution of each module comprising the MFCRNet method.
To illustrate the role of the FAB module in the MFCRNet architecture, in the fourth stage, we generated heatmaps from the input feature maps to the FAB module and the output feature maps from the FAB module, which are shown in Figure 7. From the second column, it is evident that, without passing through the FAB module, the extracted contours appear blurry.However, from the third column, it can be observed that, after passing through the FAB module, we can extract clearer contours.This comparison verifies that our FAB module, by transforming the spatial domain into the frequency domain and leveraging both low-frequency and high-frequency information, improves the quality of the image contours and enhances the representation of the key features.
In addition, we also conducted ablation experiments to quantitatively evaluate the contributions of the proposed modules on the RICE1, RICE2, and T-Cloud datasets.We set the baseline as using the U-Net network for the feature extraction and regular convolutional modules for the feature reconstruction.These results are shown in Table 5.It is evident that, when removing the FAB or NAB module, the performance metrics of the model decrease.The visualization results of the ablation experiments for each key module conducted on the datasets are shown in Figures 8-10.From the highlighted areas, it is evident that our proposed method exhibits smooth and natural handling of the object edges, whereas the other methods appear blurry, demonstrating the effectiveness of our approach in improving the reconstruction quality.

Effects of Different Loss Functions
To verify the importance of the loss function in our work, we conducted ablation experiments on two datasets, employing different loss functions.As Charbonnier loss L c is commonly used in cloud removal tasks, we adopted it as our baseline, gradually adding boundary loss Le and frequency-domain loss L f .Table 6 presents the quantitative results.We observed that using only L c led to the lowest evaluation scores.Introducing either L e or L f individually resulted in positive effects, while incorporating both L e and L f simultaneously maximized all the evaluation metrics.This indicates that our proposed loss functions facilitate the network to effectively learn more features during training, thereby being more beneficial for recovering detailed information.The visualization results of the ablation experiments for each loss function conducted on the datasets are shown in Figures 11-13.We can see from the places marked with red boxes in Figure 12 that the (b) and (c) images without L f loss recover a part of the green swamp in the input image as a lake, which is not obscured by clouds in the input image, and thus the L f loss minimizes the unnecessary reconstruction of the cloud-free regions.
In addition, we paid special attention to the critical impact of the weighting parameters λ 1 and λ 2 on the experimental results.To this end, we carefully designed and executed a series of parameter sensitivity experiments.First, we kept the parameter λ 2 unchanged while setting the value of λ 1 to 0.01, 0.05, and 0.1 in turn, and tested them on three different datasets.By comparing the results of the experiments from Figure 14, we observed a clear trend: when λ 1 was set to 0.05, the model demonstrated optimal performance regarding all the metrics.Immediately thereafter, we fixed λ 1 at the optimal value identified above and turned to explore the effect of λ 2 on the model performance.Similarly, we set three test values for λ 1 : 0.01, 0.05, and 0.1, and conducted experiments on the same dataset.As shown in Figure 15, we found that the model reached its best performance when λ 1 = 0.01.Combining the above experimental results, we determined the optimal combination of the weighting parameters λ 1 and λ 2 as λ 1 = 0.05 and λ 2 = 0.01.

Conclusions
In this study, we introduce a novel cloud removal network, leveraging a multi-stage frequency-domain attention mechanism designed to reconstruct the obscured information in optical images affected by cloud coverage.This framework progressively restores image detail by hierarchically deploying FCRMs.In the FCRM, the FAB aims to provide more distinguishable feature vectors between the cloud-covered and cloud-free regions, and the NAB is designed to propagate informative context to the next stage.Additionally, we employ a collaborative optimization loss that integrates semantic, boundary, and frequencydomain information to enhance the reconstruction accuracy.Our extensive testing on the RICE1 and RICE2 datasets corroborated the efficacy of this method.Both the quantitative and qualitative evaluations indicate that our approach not only effectively reconstructs detailed information in cloud-obscured regions but also achieves high accuracy, confirming its potential for practical applications in remote sensing image processing.In the future, we will further extend our framework to handle time-series data, enabling it to reconstruct scenes captured over multiple time points.

Figure 1 .
Figure 1.The framework of the MFCRNet.

FCFigure 2 .
Figure 2. The structure of the FAB.

Figure 3 .
Figure 3.The structure of the NAB.

Figure 4 .
Figure 4. Results of different methods on RICE1 dataset.(a) Cloudy images; (b) results of the DCP; (c) results of the Pix2Pix; (d) results of the SpAGAN; (e) results of the RCAN; (f) results of the CVAE; (g) results of the CMNet; (h) results of ours; and (i) ground truth.Each orange box highlights a selected area, which is enlarged and displayed below the original image.

Figure 5 .
Figure 5. Results of different methods on RICE2 dataset.(a) Cloudy images; (b) results of the DCP; (c) results of the Pix2Pix; (d) results of the SpAGAN; (e) results of the RCAN; (f) results of the CVAE; (g) results of the CMNet; (h) results of ours; and (i) ground truth.Each orange box highlights a selected area, which is enlarged and displayed below the original image.

Figure 6 .
Figure 6.Results of different methods on T-Cloud dataset.(a) Cloudy images; (b) results of the DCP; (c) results of the Pix2Pix; (d) results of the SpAGAN; (e) results of the RCAN; (f) results of the CVAE; (g) results of the CMNet; (h) results of ours; and (i) ground truth.Each orange box highlights a selected area, which is enlarged and displayed below the original image.

Figure 7 .
Figure 7. Heatmaps obtained before and after the FAB.(a) Cloudy images; (b) heatmaps obtained before the FAB; (c) heatmaps obtained after the FAB; and (d) ground truth.The orange box highlights the difference in detail before and after using FAB.

Figure 8 .
Figure 8. Qualitative ablation study on different components of RICE1 dataset.(a) Cloudy images; (b) results of the baseline; (c) results of the MFCRNet without FAB; (d) results of the MFCRNet without NAB; (e) results of ours; and (f) ground truth.Each orange box highlights a selected area, which is enlarged and displayed below the original image.

Figure 9 .
Figure 9. Qualitative ablation study on different components of RICE2 dataset.(a) Cloudy images; (b) results of the baseline; (c) results of the MFCRNet without FAB; (d) results of the MFCRNet without NAB; (e) results of ours; and (f) ground truth.Each orange box highlights a selected area, which is enlarged and displayed below the original image.

Figure 10 .
Figure 10.Qualitative ablation study on different components of T-Cloud dataset.(a) Cloudy images; (b) results of the baseline; (c) results of the MFCRNet without FAB; (d) results of the MFCRNet without NAB; (e) results of ours; and (f) ground truth.Each orange box highlights a selected area, which is enlarged and displayed below the original image.

Figure 11 .Figure 12 .
Figure 11.Qualitative ablation study on different components of the loss functions in MFCRNet on RICE1 dataset.(a) Cloudy images; (b) results of the L c ; (c) results of the L c + Le; (d) results of the L c + L f ; (e) results of ours; and (f) ground truth.Each orange box highlights a selected area, which is enlarged and displayed below the original image.

Figure 13 .
Figure 13.Qualitative ablation study on different components of the loss functions in MFCRNet on RICE2 dataset.(a) Cloudy images; (b) results of the L c ; (c) results of the L c + Le; (d) results of the L c + L f ; (e) results of ours; and (f) ground truth.Each orange box highlights a selected area, which is enlarged and displayed below the original image.

Table 1 .
Quantitative results of different methods on RICE1 dataset, where ↓ means the higher score indicates the better effect, ↑ means the lower score indicates the better effect, and the bold text indicates the best results.

Table 2 .
Quantitative results of different methods on RICE2 dataset, where the bold text indicates the best.

Table 3 .
Quantitative results of different methods on T-Cloud dataset, where the bold text indicates the best.

Table 4 .
Quantitative results of different stage numbers, where the bold text indicates the best.

Table 5 .
Ablation study on different modules in MFCRNet, where the bold text indicates the best.

Table 6 .
Ablation study on different components of the loss functions in MFCRNet, where the bold text indicates the best.