Cloud-EGAN: Rethinking CycleGAN From a Feature Enhancement Perspective for Cloud Removal by Combining CNN and Transformer

Cloud cover presents a major challenge for geoscience research of remote sensing images with thick clouds causing complete obstruction with information loss while thin clouds blurring the ground objects. Deep learning (DL) methods based on convolutional neural networks (CNNs) have recently been introduced to the cloud removal task. However, their performance is hindered by their weak capabilities in contextual information extraction and aggregation. Unfortunately, such capabilities play a vital role in characterizing remote sensing images with complex ground objects. In this work, the conventional cycle-consistent generative adversarial network (CycleGAN) is revitalized from a feature enhancement perspective. More specifically, a saliency enhancement (SE) module is first designed to replace the original CNN module in CycleGAN to re-calibrate channel attention weights to capture detailed information for multi-level feature maps. Furthermore, a high-level feature enhancement (HFE) module is developed to generate contextualized cloud-free features while suppressing cloud components. In particular, HFE is composed of both CNN- and transformer-based modules. The former enhances the local high-level features by employing residual learning and multi-scale strategies, while the latter captures the long-range contextual dependencies with the Swin transformer module to exploit high-level information from a global perspective. Capitalizing on the SE and HFE modules, an effective Cloud-Enhancement GAN, namely Cloud-EGAN, is proposed to accomplish thin and thick cloud removal tasks. Extensive experiments on the RICE and the WHUS2-CR datasets confirm the impressive performance of Cloud-EGAN.


I. INTRODUCTION
E ARTH observation technology has facilitated the acquisition of remote sensing images. These images have been successfully used to extract land surface information in many critical applications, including object detection [1], [2], [3], scene classification [4], [5], [6], and semantic segmentation [7], [8], [9], [10]. However, such optical satellite images are inevitably susceptible to the atmospheric and illumination conditions, which incurs degradation in image quality. In particular, remote sensing images commonly suffer from the contamination of cloud layers, significantly diminishing the signal quality obtained by satellite sensors. Specifically, the cloud layers heavily reduce the visibility and saturation of images, hindering the subsequent image applications [11]. While thin-cloud-covered regions still exhibit limited ground features, the contextual information beneath thick clouds is completely lost. Compared with natural digital images, remote sensing images contain more complex spatial structures and richer spectral information for ground object characterization, making cloud removal more challenging. Therefore, the development of efficient signal processing algorithms is strongly desired to accurately recover the genuine land surface information from remote sensing images distorted by cloud layers. In the literature, existing cloud removal methods can be classified into two approaches, namely conventional methods based on hand-crafted features and deep learning (DL)-based methods [12], [13], [14], [15], [16], [17], [18].
Conventional methods, such as multitemporal dictionary learning (MDL) [19], thin cloud removal using homomorphic filter (TCHF) [20], and signal transmission principles and spectral mixture analysis (ST-SMA) [21], require hand-crafted features to estimate the cloud distribution. In particular, MDL learned dictionaries of cloud-covered and cloud-free regions separately in the spectral domain whereas TCHF utilized a classic homomorphic filter in the frequency domain. Furthermore, ST-SMA was developed based on signal transmission and spectral mixture analysis. Despite their many advantages, these methods were designed for thin cloud removal while overlooking the thick This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ cloud scenarios. Moreover, their feasibility and performance are typically limited by irregular cloud distribution and the choice of hand-crafted features.
Driven by the rapid development of DL techniques, DL-based cloud removal methods have attracted substantial research attention, owing to the superior performance of DL models in mining representative features from remote sensing images [22]. Most existing DL-based cloud removal methods in the literature were built upon convolutional neural networks (CNNs) by exploiting abstract and conceptual representations of remote sensing images. Generally speaking, DL-based networks for cloud removal can be divided into two categories, namely the pure encoderdecoder methods [11], [23], [24] and the generative adversarial networks (GAN)-based networks [12], [25], [26], [27], [28], [29], [30]. For the pure encoder-decoder networks, multiscale features-CNN [23] explored the multiscale high-level features to detect thin-cloud, thick-cloud, and no-cloud pixels simultaneously while residual learning and channel attention mechanism [11] integrated residual connection with a channel attention mechanism to capture details in different convolutional layers. Furthermore, conditional variational autoencoders (CVAE) [24] applied a probabilistic graphical model with CVAE to restore cloud-free images according to the image degradation process. The abovementioned encoder-decoder models employ the encoder to extract enriched features from remote sensing images, while the decoder is exploited to interpret abstract information before recovering the detailed information of cloud-free images. However, these methods are handicapped by their weak feature representation capability of CNNs. As a result, additional efforts are required to enhance the feature representation capability of CNNs to generate high-quality cloud-free images.
Similar to the encoder-decoder methods, the GAN-based models also consist of two parts, i.e., the generator and discriminator [31]. Owing to its remarkable capability of modeling the relationship between input and output data, GAN has gained tremendous popularity in computer vision. For the cloud removal task, conditional GAN (cGAN) [25] employed a simple UNet-based structure as the generator while PatchGAN [32] as the discriminator. Furthermore, a hybrid loss function using the structural similarity (SSIM) loss [33] was designed to improve the SSIM of the generated images with the ground truth. Recently, spatial attention GAN (SpAGAN) [27] was proposed to remove clouds by integrating local-to-global spatial attention to the generator whereas MSDA-CR [29] proposed a grid network based on cloud-distortion-aware representation learning to model the effects of cloud reflection and transmission. In addition, AMGAN-CR [30] generated attention maps through an attentive recurrent network and employed an attentive residual network to remove clouds according to the attention maps. These methods have improved the GAN-based frameworks by enhancing the encoder or loss function design through a single-directional mapping, i.e., from cloudy images to cloud-free images.
Recently, the cycle-consistent GAN (CycleGAN) model [34] has been widely applied to transfer image styles. CycleGAN attempts to learn a bidirectional mapping between domains while incorporating cycle-consistency loss and identity loss to effectively retain the color composition and texture. CloudGAN [12] introduced CycleGAN into cloud removal to learn the mapping of feature representations between cloudy images and their corresponding cloud-free images in a cyclic structure. In the cloud removal task, it is also necessary to learn global color composition and texture outside the cloud area before predicting the objects under the cloud in the forward process. The reverse stage in the cycle process can promote the learning of these global representations in the forward process by restoring the original cloud map. However, it suffers from blurred edges due to its straightforward encoding structure and the lack of modeling channel and spatial relationships. On this basis, SAR-to-optical image translation using SSIM and perceptual loss-based Cy-cleGAN [26] introduced the least squares loss function [35] into the CycleGAN to improve its training stability in image translation. Furthermore, multimodal GAN (MMGAN) [28] was developed to generate multiple most likely cloud-free outputs before selecting the best generated cloud-free images through a perception-based image quality evaluator. Despite their many advantages, these methods suffer from a poor performance in reconstructing detailed features of remote sensing images as they are straightforward extensions from models originally devised for natural images. Compared with natural scene images, remote sensing images exhibit more severe spectral heterogeneity and more complex spatial relationships of ground objects [36], [37]. Typically, undesired cloud layers have various thicknesses, and images are acquired under different lighting conditions [38]. As a result, the performance of those image restoration models developed for natural scene images is usually poor if directly applied to cloud removal. Furthermore, it is challenging for these models to handle large-scale cloud removal tasks due to their prohibitively expensive computational complexity.
To improve the representation capability of CNNs and GAN with long-range contextual information, the newly developed transformer has been introduced into the cloud removal tasks. Empowered by its nonlocal attention mechanism, the transformer can establish long-range dependencies with impressive scalability [39], [40]. For instance, SAR-enhanced cloud removal with global-local fusion [15] added Swin transformer layer [41] after each convolutional layer for cross-window feature interaction. CloudTran [42] replaced the CNN-based encoder with an axial transformer [43] to estimate the lowresolution cloud-free images. However, the transformer is only regarded as a feature extractor to exploit global information while lacking the capability to fully extract enriched local features. Compared with the transformer, CNN exploits and aggregates enriched local features using the local receptive fields in the convolutional layers [3], [10]. One trivial approach to take advantage of both transformer and CNN is to directly construct a dual-branch encoder to extract global and local information by transformer and CNN, respectively [44], [45], [46], [47]. More recently, the authors in [48] and [49] proposed to use CNN to extract multiscale features while exploring the ability of transformer to enhance these multiscale features. In contrast, Fang et al. [50] further integrated the Swin Transformer layers and the convolutional layers by exploiting spatial attention after each Swin Transformer layer. However, all methods aforementioned failed to explore the potential enhancement of high-level semantic features provided by exploiting the synergy of CNN and G Y 2X are the two generators and D X and D Y are the two discriminators. X,X, Y ,Ỹ represent the authentic cloudy images, the generated cloudy images, the authentic cloud-free images, and the generated cloud-free images, respectively. (b) Discriminator follows the PatchGAN structure. (c) Generator is based on the UNet framework with different output sizes from each SE block. Note that the two generators, G X2Y and G Y 2X , are designed with the same structure. Similarly, the two discriminators, D X and D Y , have the same structure. and transformer. Thus, it is of great practical interest to investigate how to fill this gap by combining CNN and transformer in the cloud removal task.
Motivated by the aforementioned challenges, this work introduces a CycleGAN-based model for thin and thick cloud removal from two different enhancement perspectives. First, the backbone is enhanced by a saliency enhancement (SE) module to extract hierarchical discriminant features with more saliency. Furthermore, in sharp contrast to the existing models that utilize CNN to enhance high-level features [6], [9], [23], this work proposes to explore enriched high-level features by jointly exploiting CNN and transformer. The main contributions of this work can be summarized as follows: 1) An SE module is utilized to generate enhanced hierarchical feature maps derived from each convolutional block by recalibrating the attention weights of feature channels. As a result, cloud-covered components and blurred edges are reduced; 2) A high-level feature enhancement (HFE) module is devised between the encoder and the decoder to effectively explore and aggregate high-level features. Specifically, HFE is composed of a CNN-based HFE (CHFE) module and a transformer-based HFE (THFE) module. CHFE is designed to exploit high-level local features to harvest sufficient detailed information while THFE long-range contextual information. CHFE and THFE are integrated under the cloud-enhancement GAN (Cloud-EGAN) framework to retain the global features of the restored cloud-clear images; 3) Extensive experimental results on the RICE and WHUS2-CR datasets verify the superiority of Cloud-EGAN in segregating clouds and preserving high-quality land surface information. The rest of this article is organized as follows. Section II elaborates on the proposed model while extensive experimental results are presented and analyzed in Section III. Finally, Section IV concludes this article.

II. METHODOLOGY
In this work, a CycleGAN-based architecture with SE and HFE modules in the generator is proposed to extract and aggregate enhanced local and global features from remote sensing images. In the following, an overview of the proposed Cloud-EGAN is presented before each of its key components is elaborated. Finally, hybrid loss functions employed in the proposed model are devised.

A. Framework
As depicted in Fig. 1(a), the proposed Cloud-EGAN is developed based on CycleGAN that consists of two generators G X2Y and G Y 2X and two discriminators D X and D Y . More specifically, for a supervised cloud removal task, the authentic cloudy image X serves as the input to the generator G X2Y to reconstruct the predicted cloud-free imageỸ that is then discriminated by D Y with the authentic cloud-free image Y . Meanwhile, according to the cyclic consistency principle, the generator G Y 2X is employed to generate the cloudy imageX fromỸ . The same operation is performed on the input Y in the Cloud-EGAN.
As illustrated in Fig. 1(b), the discriminators D X and D Y adopt a PatchGAN structure with stacked hierarchical convolutional blocks to determine the authenticity ofỸ . Furthermore, the generator is developed based on an UNet architecture [51] by capitalizing on symmetrical concatenations between an encoder and a decoder, as shown in Fig. 1(c). Specifically, the generator combines the SE and HFE modules while SE exploits hierarchical features by reassigning attention weights to feature maps at each level. The resulting high-level feature maps are then fed into the HFE module to further enhance feature representation through the combination of CNN and transformer. After that, a convolutional prediction head is utilized at the end of the generator to recover cloud-clear images. More details about the SE and HFE modules will be elaborated in the following sections.

B. Saliency Enhancement
Following the classical channel attention mechanism [52], the SE module adaptively exploits more salient features from remote sensing images at multiple feature levels by assigning learnable attention weights to feature channels. As a result, SE can enhance information restoration from heavily cloudy regions and generate high-quality cloud-free features. Fig. 2(a) illustrates the encoding process in which u k ∈ R D k ×H k ×W k denotes the kth level feature map generated from the first convolutional block (SE_Conv), where D k is the channel dimension, H k = H/2 k and W k = W/2 k . Furthermore, a global average pooling (GAP) layer as a channel descriptor is applied to exploit enriched features and produce output z k where G stands for the GAP function. After that, two 1 × 1 SE_Conv blocks are utilized to compute the attention weights through convolution operations with output s k ∈ R D k ×1×1 being given by where W 1 and W 2 are parameters of the two convolutional blocks and δ(· · · ) is the sigmoid function. Finally, the SE output where represents the point multiplication operation. The decoding process depicted in Fig. 2(b) is similar to Fig. 2(a) with the convolutional block (SE_Conv) being replaced by an upsampling SE_Conv.

C. High-Level Feature Enhancement
The HFE module is designed to learn enriched high-level local and nonlocal features by combining CHFE and THFE, as shown in Fig. 3. As a result, it is beneficial to further characterize cloud-free representations and propagate contextual information across the feature maps from a global perspective, which can maintain the spatial structure of the restored features identical to the ground truth.
More specifically, a residual learning module [53] and a dilated convolutional module [54] are used in CHFE to process high-level features in parallel. In particular, high-level features are fed into the residual learning module containing three successive residual blocks named HFE_ResConv to extract critical ground information while reducing the feature discrepancy between cloud-covered and cloud-free images. Meanwhile, F h is passed through a convolutional block with residual structure named HFE_Conv, and three dilated convolutional blocks named HFE_DilatedConv with different dilation rates to exploit multiscale contextual information while alleviating cloud-covered features. After that, the concatenated outputs are further enhanced through an HFE_Conv block to restore the original feature size. Finally, the outputs of the residual learning module and dilated convolutional module are added together to form refined feature maps F ∈ R D× H 16 × W 16 . Following the approach of the classical Swin transformer [41], THFE splits F into nonoverlapping patches in the patch partition module before projecting the patches to an arbitrary dimension D using a linear embedding layer. The patches are then fed into a successive Swin transformer block and a patch merging layer to generate higher level feature representations. More specifically, as depicted in Fig. 3(b), each successive Swin transformer block consists of the residual architecture, four layer-normalization (LN) layers, a window-based multihead self-attention (WMSA) module, a shifted WMSA (SWMSA) module, and two multilayer perceptron (MLP) layers with GELU function.
The operation of successive Swin transformer blocks is shown in Fig. 3(b). For each head of the WMSA and SWMSA, the input features F S are fed into the Swin transformer block to calculate the multihead self-attention (MSA) as follows: and where Q S , K S , and V S denote the projected query, key, and value features, respectively while W Q , W K , and W V the corresponding parameter metrics. Furthermore, B S is the learnable relative position embedding term in the Swin transformer whereas Att(F S ) represents the output of self-attention for each head. In addition, φ(·) is the softmax function and d =D/4 is the channel dimension for each head. After that, the features of each 2 × 2 neighboring patches generated by the Swin transformer block are concatenated by the patch merging layer. We denote by H/32, W/32, and 4D the height, width, and channel after the patch merging layer, respectively. Finally, after two Swin transformer blocks and the reshape operation to maintain the same size as the input F h , the output of the HFE moduleF h ∈ R D× H 16 × W 16 can be obtained.

D. Loss Functions
In this work, a novel hybrid loss function comprising the adversarial loss L adv , the cycle consistency loss L cyc , the perceptual loss L per , and the identity loss L id is introduced to guide the training of our proposed model. It is notable that L adv is utilized to train both generators and discriminators, while L cyc , L per , and L id are employed for training the generators. The expression of the hybrid loss function L can be formulated as follows: where λ cyc , λ per , and λ id are adjustable weights of the three loss components. More details about each loss function are provided in the following sections. 1) Adversarial Loss: The adversarial loss aims to make the reconstructed cloud-free images close to the corresponding ground truth. Adopting a structure similar to the classical CycleGAN, we define the adversarial loss as where x and y are the input cloudy and cloud-free image samples, respectively. Furthermore, P data (x) and P data (y) represent the distributions of cloudy and cloud-free images. The total adversarial objective L adv is comprised of L X2Y adv and L Y 2X adv used to train forward process and reverse process, respectively.
2) Cycle Consistency Loss: The cycle consistency loss measures the pixel-wise difference between the generated images and their corresponding ground truth. It is adopted to reduce blurring regions and keep the reconstructed images closer to the ground truth. The cycle consistency loss takes the following form: where G X2Y and G Y 2X are the two generators in the Cloud-EGAN and || · || 1 stands for the L1-norm of the enclosed quantity.

3) Perceptual Loss:
Based on the computation of losses in pixel colors and edges, the perceptual loss [55] is introduced to measure the consistency between convolutional outputs of the ground truth and the restored images obtained by a pretrained network, e.g., VGG19 pretrained on the ImageNet [56]. Moreover, the capability of extracting perceptual semantic features via the convolutional layers can be evaluated. Mathematically, the expression of the perceptual loss can be defined as where φ k denotes the feature map extracted from the kth layer in the pretrained VGG19 network, and C k , H k , W k denote the number of channels, height, and width of the kth feature map, respectively. Moreover, x and x represent the pixel intensities in the original cloudy images and the generated cloudy images by Cloud-EGAN, respectively. Meanwhile, y and y stand for the pixel intensities in the ground truth cloud-free images and the generated cloud-free images by Cloud-EGAN, respectively.

4) Identity Loss:
The identity loss aims to retain the color consistency between the input and the output. For the cloud removal task, the clouds are expected to be eliminated in the generated cloud-free images and the cloud-free regions are expected to remain unchanged in texture details and color compositions. The proposed model can avoid color distortion in cloud-free regions by applying identity loss. It can be formulated as follows:

III. EXPERIMENTAL RESULTS
In this section, experimental datasets will be first described. After that, the parameter settings and evaluation metrics are introduced before the comparisons with other DL-based models are reported and analyzed.

A. Datasets
In this section, the proposed model is evaluated on the RICE dataset [57] and the WHUS2-CR dataset [58]. Specifically, the RICE dataset comprises two subdatasets named RICE1 and RICE2. In particular, the RICE1 contains 500 pairs of cloudcovered and cloud-free images from Google Earth, with the ground resolution being 5 m/pixel. Most of the samples in RICE1 are thin clouds where the ground objects are mostly identifiable. In sharp contrast, the RICE2 dataset includes Landsat-8 images of 736 groups of the ground resolution 30 m/pixel. The images in this dataset contain abundant thick clouds, where the ground objects are hardly identifiable. Taking into account the large discrepancy in terms of cloud thickness and image resolution, we perform our evaluation on these two subdatasets separately. Furthermore, in sharp contrast to MSDA-CR [29] and CR-MSS [58] that utilize multispectral data as input, we mainly focus on visible (RGB) bands in our evaluation. This is because RGB images are more commonly available [11], [12], [28], [59]. However, we also perform supplement experiments to demonstrate that the proposed model can work well with multispectral data by exploiting both RGB and near-infrared (NIR) data.
The images in the RICE dataset are of size 512 × 512 pixels each. Moreover, the WHUS2-CR dataset involves 848 pairs of Sentinel-2 image patches of size 256 × 256 pixels. The acquisition time lag of the cloud-covered images and their corresponding cloud-free images is less than 10 days. Furthermore, 400 and 100 image pairs were chosen for training and testing in the RICE1 dataset, respectively. In addition, 589 and 147 pairs were adopted as the training and testing set in the RICE2 dataset, respectively. For the WHUS2-CR dataset, 679 pairs were obtained as training data, and the remaining 169 pairs were reserved for testing. Some typical samples in the RICE1, RICE2, and WHUS2-CR datasets are displayed in Fig. 4.

B. Implementation Details
In the generator of the Cloud-EGAN, four convolutional layers with a kernel size of 4 × 4 and stride of 2 are utilized in the encoder and decoder, with {32, 64, 128, 256} channels for the former and {256, 128, 64, 32} for the latter. After that, a convolutional layer with a kernel size of 4 × 4, a stride of 1 and 3 channels is utilized to restore the cloud-free images with the same size as the input. In the discriminator, four convolutional layers with a kernel size of 4 × 4 and a stride of 2 are exploited with {64, 128, 256, 512} channels. Meanwhile, a convolutional layer with a kernel size of 4 × 4, a stride of 1, and a channel number of 1 is used to discriminate whether the generated cloudfree images are authentic or not. Notably, these convolutional layers are followed by the instance normalization [34] and the Leaky ReLU function [60] parameterized by 0.2, except for the classifier in the decoder and the discriminator.
In Cloud-EGAN, the learning rate α was initially set to 0.0001 before being decayed by half after every 20 epochs. Furthermore, The batch size was set to 4. In addition, the Adam optimizer [61] with default momentum parameters, i.e., β 1 = 0.9 and β 2 = 0.999 was adopted. Finally, λ cyc , λ per , and λ id in the loss function were set to 10, 1, and 9, respectively. All experiments were implemented on a single NVIDIA GeForce RTX 3090 GPU with 24-GB RAM.

C. Metrics
Two widely used metrics, SSIM [62] and peak signal-to-noise ratio (PSNR) [63], were utilized for quantitative evaluation. Specifically, SSIM is expressed as where μ x , σ x , and σ xy represent the average, variance, and covariance, respectively. C 1 and C 2 are constants for stabilizing the division with a weak denominator. A larger SSIM value stands for the greater similarity between the generated cloud-free images and ground truth, which indicates a higher quality of the generated cloud-free images. Moreover, PSNR is defined as PSNR = 20 log 10 MAX I √ MSE (13) where and MAX I represents the possible maximum pixel value in the generated cloud-free images I. Moreover, the generated cloudfree image I and the corresponding ground-truth J are of size M × N × 3, and (i, j) represents the pixel index in I and J.
A larger PSNR value represents less image distortion in the reconstructed cloud-free images. Finally, we evaluated the computational complexity of the proposed method using the following metrics, namely the floating point operation count (FLOPs), the number of parameters (M ), and the frames per second (FPS). More specifically, FLOP is used to evaluate the model complexity whereas M measures the memory requirement. In addition, FPS is used to evaluate the execution speed. For computationally efficient models, their FLOP and M should be small while FPS being large.

D. Performance Comparison
As illustrated in Fig. 5, the results obtained by Cloud-EGAN achieved lower spectral distortion and more significant SSIM with the ground truth in the thin-cloud-covered scenarios. Moreover, the results obtained by cGAN and CloudGAN suffered from much loss of texture details with some blurring areas while failing to thoroughly restore the land surface information in the generated cloud-free images. Compared with cGAN and CloudGAN, SpAGAN and MSDA-CR showed better results with more explicit texture details. However, some color distortions have been observed. As a result, the color information of the ground surface could not be fully restored. Finally, despite the fact that the results of CVAE and MMGAN achieved color compositions similar to the ground truth, some slightly-blurred edges were noticed in several patches.
In contrast, Cloud-EGAN performed best among all methods under evaluation in the thick-cloud-covered scenarios of the RICE2 dataset, as shown in Fig. 6. It generated images with better texture structures and color compositions. In comparison, cGAN and CloudGAN could not remove clouds thoroughly, which generated some edge-blurring and color-distortion areas in noncloudy regions. Moreover, the results obtained by SpAGAN exhibited severe loss of details since the structure information of ground scenarios could not be completely recovered. Furthermore, it was observed that the results of CVAE, MSDA-CR, and MMGAN showed more spatial features similar to the ground truth, though some slight color distortions were observed.
For the WHUS2-CR dataset, as depicted in Fig. 7, the color compositions and the texture details of the cloud-free images generated by Cloud-EGAN were more similar to the ground truth. In contrast, cGAN and CloudGAN showed the worst performance due to their limited feature extraction capability. Compared with cGAN and CloudGAN, SpAGAN and MSDA-CR obtained better results with more evident backgrounds and details, but some color distortion scenes remained in cloudless regions. Finally, the color tones in the results of CVAE and MMGAN were visually close to the ground truth. However, some contextual information was lost, and clouds were not segregated thoroughly in these models.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  [25], (d) CloudGAN [12], (e) SpAGAN [27], (f) CVAE [24], (g) MSDA-CR [29], (h) MMGAN [28], (i) Proposed Cloud-EGAN. exploration and aggregation of enriched local and global features in hierarchical and deep contextualized space. In particular, the results labeled as Cloud-EGAN* were generated with the proposed Cloud-EGAN using four input bands, i.e., RGB and NIR. It is evident from Table I that the proposed Cloud-EGAN can also work well in multispectral scene. Furthermore, it is shown that the NIR band could indeed further improve the performance of cloud removal. Therefore, the cloud-covered and cloud-free regions could be more accurately characterized, which aided in maintaining the recovered images close to the ground truth.

E. Ablation Study 1) Cycle-Consistence:
In order to evaluate the necessity of cycle-consistence that requires two generators and discriminators, we conducted an ablation experiment as shown in the second line from the bottom in Table II, which is the result of the conventional GAN framework with only one generatordiscriminator pair. The experiment results showed that the cycle-consistent mechanism enabled the generator to learn better global representations to promote the prediction of the ground objects of cloud-free areas.

2) Model Components:
We compared the quantitative results by eliminating various modules in the proposed Cloud-EGAN framework, as shown in Table II. Note that the elimination of the SE module by only utilizing the convolutional layers followed by instance normalization and leaky ReLU function still follows the same UNet-based architecture as Fig. 1(c). Inspection of Table II reveals that integrating all feature enhancement modules resulted in the best performance in terms of both PSNR and SSIM. Accordingly, this confirmed the benefits of these modules in aggregating enriched contextualized features and restoring ground surface information sufficiently. Notably, SE can be further developed by other channel-based or spatial-based attention modules. We provided a unique perspective on using squeeze-and-excitation module [52] to enhance convolutional networks comprehensively. Considering the versatility and complexity, we finally chose this classical channel attention module as the feature enhancement structure in this work. Moreover, the quantitative results without the SE modules were better than those obtained without HFE. In other words, HFE plays a critical role in cloud removal performance, further demonstrating the necessity of enhancing high-level features for remote sensing images. More specifically, THFE enables the model to learn more global representations, facilitating the model to better predict the objects under the cloudy area. As shown in Table II, the results generated with THFE were better than those without THFE. Similar observations regarding CHFE can be made in Table II, which suggests models with CHFE can learn more detailed representations.
3) Effectiveness of Adding Perceptual Loss: To evaluate the effectiveness of the perceptual loss, we compared the proposed hybrid loss function with the loss function in the classical CycleGAN, as shown in Table III. The adjustable weights λ cyc and λ id of the loss function in the classical CycleGAN were set to 10 and 9, respectively. It is observed that there was a non-negligible improvement in terms of PSNR and SSIM after incorporating perceptual loss. Table IV shows the complexity evaluation results of all methods conducted in our work. SPAGAN achieved the best performance on these metrics since it only used simple CNN-based modules while the generation performance is poor. Compared to most other methods, the proposed Cloud-EGAN achieves significantly improved performance with low computational complexity by exploiting the convolution operations and the high-efficient WMSA module. Meanwhile, we added more modules including the CHFE module and a THFE module to enhance the high-level features. Therefore, the proposed Cloud-EGAN achieved a better cloud removal performance at the cost of a larger number of parameters and a lower inference speed.

G. Discussion
The experimental results have demonstrated that Cloud-EGAN performs better than existing DL-based models in the cloud removal task. This superior performance can be attributed to the cyclic structure and the integration of the SE and HFE modules. More specifically, Cloud-EGAN learns the mapping of feature representations between cloudy images and the corresponding cloud-free images in a cyclic-consistent way, which is conducive to strengthening the model capability of feature representation. Moreover, the combination of SE and HFE can effectively extract and aggregate contextual information, which is conducive to generating high-quality cloud-free images similar to the ground truth. The effectiveness of introducing SE and HFE can be validated from the feature maps shown in Fig. 8. Notably, the informative feature details are further enhanced through SE and HFE. As a result, cloud-removed scenes with enriched ground information can be preserved in Cloud-EGAN.
In addition, we compared the training loss convergence using Cloud-EGAN and the classical CycleGAN on the RICE1, RICE2, and WHUS2-CR datasets. It is observed in Fig. 9(a)-(c) that Cloud-EGAN obtained better convergence performance than CycleGAN due to the novel framework and the incorporation of the perceptual loss.

IV. CONCLUSION
In this work, a novel CycleGAN-based architecture, named Cloud-EGAN, has been proposed to perform supervised cloud removal tasks, which can effectively remove thin and thick clouds while preserving spectral and spatial consistency with the land surface. Compared with existing DL-based models developed for removing clouds, the proposed Cloud-EGAN utilizes a cyclic architecture while integrating the SE and HFE modules to enhance the ability to identify remote sensing images with complex ground objects. While the cyclic architecture is designed to recalibrate the weights of hierarchical channels, the integration of the SE and HFE modules is employed to further aggregate local and global high-level contextualized features. As a result, the proposed Cloud-EGAN can more effectively exploit multilevel enriched features with more saliency to highlight ground information while suppressing cloud components and blurred edges through the integration of CNN and transformer. Extensive simulation results on the RICE and WHUS2-CR datasets have confirmed the superior cloud removal performance achieved by Cloud-EGAN as compared to existing DL-based methods for removing thin and thick clouds.
There are several extensions of this study that can be further explored. First, it is of great practical interest to further investigate how to construct a more computationally efficient model for various cloud-covered scenarios. Furthermore, it is interesting to consider applying the proposed Cloud-EGAN to large-scale remote sensing datasets, such as Sentinel-2 and Landsat-9 images in an unsupervised or semisupervised manner. Finally, end-to-end designs of cloud removal and other downstream tasks, such as semantic segmentation, will be explored in future research.