Removing Snow from a Single Image Using a Residual Frequency Module and Perceptual RaLSGAN

Removing snow particles from an image is a complicated task due to the shape, size, and color of the particles. The latest snow removal methods remove snow from a single image but still retain some snow and salt-and-pepper particles. Some approaches, while trying to remove snow from a single image, produce blurry artifacts. In this paper, we solve these problems by designing a network model that consists of a residual generative network, a snow-free image generative network, and a perceptual relativistic discriminative network. In both generative networks, we assign the residual frequency network (ReFNet) as our bottleneck module. Our network model learns to map two relationships. First, the input snowy image is trained to map the snow mask image in the dataset. Then, a retained image resulting from subtraction between an input image and the estimated residual image is concatenated with the input snowy image and mapped to the desired snow-free ground truth. Moreover, we use a perceptual identical-paired adversarial network based on a relativistic discriminative network to make our training results more robust. Our results achieve greater performance than state-of-the-art methods on both synthetic and real-world snowy images.


I. INTRODUCTION
AD weather influences outdoor vision systems, such as security or surveillance cameras, auto-driving systems, and traffic monitors. Its influence can cause misjudgment by humans or vision analytic applications, such as object detection, segmentation, and action recognition. In general, these applications require sharp and clean images to handle their related tasks and produce correct outputs. However, in a real situation, the acquired images are not always clean due to bad weather interferences such as haze, fog, rain, and snow. Therefore, image enhancement and restoration techniques must be investigated.
Numerous approaches have been proposed to try and remove atmospheric particles from images. Early approaches tried to deal with haze particles, such as [1][2][3][4][5][6], which used prior-based methods. However, those methods did not work well in certain cases due to the inaccurate estimation of a transmission map, haze-relevant prior, or heuristic cues. During the rise of deep learning, many deep learning-based methods were also proposed, including DehazeNet [7], residual learning methods [8], densely connected pyramid dehazing networks [9], and FFA-Net [10]. These approaches approximated the dehazing's purpose. Furthermore, another essential challenge is rain removal; while haze is uniformly accumulated over an entire image, rain has non-uniform densities, diverse shapes, streaks, and various orientations. Like dehazing, its solutions started with traditional methods. Some tried to deal with a certain type of rainy scene [11] regardless of the shape, size, and density of the rain streaks, which caused issues like under de-raining, over de-raining, and over-smoothing of the artifacts. Deep learning-based methods, such as [12][13][14][15][16][17][18], obtained greater performance than previous traditional methods. Fu et al. [12] and Yang et al. [17] built a large synthetic rain dataset consisting of various levels B of rain density, shapes, and orientations for training with learning-based methods. Though some of the learning-based methods mentioned above can be considered as generic methods for removing atmospheric particles, Liu et al. [19] claimed that it is difficult to adapt those methods to snow removal due to the complications of snow characteristics, including uneven density, diverse particle shapes and sizes, transparency, and irregular trajectory. They introduced a synthetic Snow100K 2 dataset consisting of 100K synthesized snowy images, corresponding snow-free ground truths, snow masks, and an additional 1,329 real-world snowy images. They produced 5.8K base masks via PhotoShop to synthesize the snowy images. With this dataset, they also designed a multistage network, named DesnowNet, for translucent and opaque snow particle removal. For more accurate estimation, they differentiated snow into attributes of translucency and chromatic aberration. Snow has complicated characteristics, and manually synthesizing snow via Photoshop cannot produce all types of snow characteristics. Consequently, DesnowNet obtained effective evaluation results on synthetic test images, but it failed to produce satisfactory snow-free images from real-world snowy images. Li et al. [20] used the same Snow100K 2 dataset to train their model. They proposed composition generative adversarial networks (CGANs) to remove snowflakes from a single image with different sizes of snow particles. They stated that the mean absolute error (MAE) function generates an image that is close to ground truth faster as it deals with only the pixel domain. Though CGAN outperformed DesnowNet on both synthetic and real-world snowy images, it still led to blurry artifacts in the resultant images. Isola et al. [21] claimed that using 1 in their objective loss function improved the training speed, but they also produced blurry reconstructed images.
In this paper, we design a snow removal model that can preserve details and avoid blurry artifacts after image reconstruction, regardless of the diverse snow characteristics (i.e., orientation, shape, and translucency). The proposed model consists of three networks: a residual generator network, a snow-free generator network, and a perceptual discriminator network. Applying these networks, we propose a bottleneck module, which we refer to as the residual frequency network (ReFNet). ReFNet consists of a discrete wavelet transform (DWT) [22][23][24] and CNNs. The network has a similar structure as Res2Net [25], but the number of its scales is fixed to four; in the module, each scale represents a featurefrequency component. Based on DWT, there are four frequency components, i.e., one low-level frequency and three high-level frequencies. The conventional discrete wavelet transform [22,26] removes salt-and-pepper noise effectively. Liu et al. [14] realized the advantages of DWT and proposed a multi-level wavelet-CNN for image restoration. Their method removes not only salt-and-pepper noise, but also preserves detailed information of the restoration images. Therefore, we adapt DWT into our bottleneck module. It is, structurally, constructed as a pooling layer [24] in ReFNet. It serves our purpose of removing snow particles with various characteristics, such as a salt-like shape, which were disregarded in existing data-driven methods. Furthermore, it can preserve details after its reconstruction. ReFNet helps maintain high frequency areas or objects of a reconstructed image that are mapped to a ground truth, while the whole network considers whether snowflakes belong to the highfrequency domain or the snowflakes themselves are high frequency. We also modified RaLSGAN [27] into perceptual RaLSGAN. We adapt the iPANs discriminator's logic [28] and perceptual adversarial network [29] to build our adversarial loss functions for the perceptual RaLSGAN. In addition, we add the mixed loss [30] and composition loss to the generative adversarial loss function to improve the robustness of our final results.
Our approach has contributed to the realm of image enhancement and restoration as follows:  We propose a novel network architecture that produces an estimated snow mask and snow-free image, outperforming state-of-the-art processes on both synthetic and real-world snowy images.
 We also propose ReFNet as our bottleneck module. It is applied in both the residual generator network and snow-free image generator network. ReFNet enables our model networks to remove snow of different characteristics and preserve the delicate parts of images.
 We update the existing relativistic discriminator network into a perceptual relativistic discriminator network to improve the robustness of our training results. We organize the rest of this paper by sections, as follows. In Section II, we describe related works that are relevant to our approach. We explore our proposed methods in Section III, including the network architecture and equations. Section IV describes our training environment and parameters setting. We exhibit our quality and visual results in Section V and conclude our paper in Section VI.

II. RELATED WORKS
Below, we review related snow removal methods, discrete wavelet transforms, and the relativistic discriminator.

A. Snow Removal
Single-image snow removal is a technique used to remove snowflakes or restore a clear and clean image from a snowy image. Pei et al. [31] assumed that the snowflakes are located in the high-frequency and bright areas. They analyzed the color and frequency features of snowy images, extracted the high-frequency part, and detected snowflakes using intensity prior. However, using intensity characteristics led to the misidentification of highlighted regions of the background as snowflakes. Zheng et al. [32] used the same assumption. The high-frequency region is decomposed into background edges and snowflakes using a guided filter, and low-frequency components are used for the guide image. This method followed rain characteristics and disregarded the transparency, scale, shape, trajectory, and distribution of snow characteristics. To get away from rainfall-driven feature approaches, Liu et al. [19] proposed DesnowNet, which is based on a multistage network, to remove translucent and opaque snow particles. As mentioned in the introduction, their great contribution to snow removal is that they created the Snow100K 2 dataset, which consists of diverse snow characteristics. However, 5.8k snow base masks are insufficient for the snow characteristics. Therefore, DesnowNet still produces an unclean image from a real-world snowy image. Li et al. [20] proposed CGAN in order to generate a cleaner image. They trained their network on the same Snow100K 2 dataset and used an up-to-date loss function based on least squares error. They used the mean absolute error loss function, which is sensitive to outliers. Therefore, the snow particles are well removed; however, as mentioned in [28,33], even though the loss function can make the predicted image become closer to its ground truth, it leads to a blurry effect.

B. Discrete Wavelet Transform and Wavelet-CNN
The discrete wavelet transform (DWT) [22] can decompose, de-noise, and analyze vibration signals. It decomposes lowfrequency and high-frequency pass filters. DWT is a wellknown preserving method. Wavelet-SRNet [34] predicts a series of corresponding wavelet coefficients and design losses in the wavelet domain to capture global topology information and local textural details. Liu et al. [14] proposed a multi-level wavelet network (MWCN) based on the U-Net [35] architecture. DWT is built inside the contracting subnetwork of MWCNN to replace the pooling operation and guarantee that all of an image's information can be preserved. Through this, we create a residual frequency module based on the Haar wavelet transform, which is capable of learning highfrequency components and distinguishing between the frequency of the background and snow particles. More importantly, the detailed texture of our generated image can be well preserved.

C. Relativistic Discriminator
The relativistic discriminator [27] was proposed to fix and improve the standard discriminator [36,37], which had missing properties related to the prior knowledge, divergence minimization, and gradient. While the standard GAN has discriminator ( ) , where is either real or fake, the relativistic discriminator depends on both real and fake data � = � , � ; the relativistic discriminator is ( �). See Equations (13) and (14) in Section III. It is proven that the relativistic discriminator is generally stable and improves data quality. There are many relativistic discriminator preferences proposed in [27], i.e., relativistic standard GAN (RSGAN), relativistic average standard GAN (RaSGAN), and relativistic least square GAN (RaLSGAN). However, RaLSGAN provides the best results and is more stable than the others. According to iPANs [28], perceptual adversarial loss produces better results compared to vanilla adversarial loss. Thus, in this study, RaLSGAN is extended into perceptual-RaLSGAN (pRaLSGAN) with respect to the iPANs discriminator, which is discussed in Section III.D in detail.

III. PROPOSED METHOD
We now illustrate our proposed method to remove snow from a snowy image. We describe our network architecture and explain how DWT and its inversion are applied to input feature maps. We also derive our methodological equations, such as residual loss 2 , composition loss, mix-loss, and iPANs-based pRaLSGAN loss functions.

A. Network Architecture
We show our network architecture in Fig. 2, which consists of three parts, i.e., a residual image generator, a snow-free-image generator, and a discriminator.
The residual image generator is a network used for acquiring a binary snow image. First, RGB snowy image with the size 64 × 64 is assigned into the network. The initial layer is assisted by convolution, batch normalization (BN), and the LeakyReLU activation function. Then, DWT transforms the output feature map into four components of frequency feature maps, i.e., low-level frequency and highlevel frequencies , , and . The size of each frequency feature is 64 × 32 × 32. We discuss DWT further in the next section (Section III.B). Each component is sent through five ReFNets (frequency blocks) and then the feature outputs are concatenated. Next, the joint wavelet transformation features are inverted and input into the final hidden layer, which has convolution, BN, and PReLU components.
The generated residual image is subtracted from the input VOLUME XX, 2017 9 snowy image to obtain a black-snow image.
Here, is the number of layers and the symbol ⊚ indicates the convolution operator between the weight and the input features or image . The variables and represent LeakyReLU and PReLU, respectively, considering the activation function at the last layer of the residual image generator.
(•) is the batch normalization and (. ) is a DWT pooling layer where its inversion is (. ). F represents the residual frequency blocks. After transforming through the dwt layer and splitting the input feature map into four different frequency features i.e., , , , and , each one of these frequency features is input into residual frequency block , expressed as , , , . In this generator, we use five frequency modules (ReFNets) for each frequency stream, e.g. : 1 �… � 5 ( )� … �. The ReFNets module is described in Section III.C in detail. Then represents a concatenation of all the frequency outputs before the inversion of the wavelet transformation. Subtracting a residual snow image from a snowy image does not produce the desired snow-free image, but leads to a blacksnow image, as shown in Fig. 3. Logically, there is nothing but the default black color behind each falling snowflake of the input image .
The image obtained from ( − ) is concatenated with the snowy input image and input into another generative network: the snow-free generator. The advantage of concatenating white snow and black snow here allows our network to learn about synthetic snow in the dataset, as well as a type of pepperlike snow, which has been disregarded in existing works. The concatenated image has six channels and we acquire 64  Our network architecture. It consists of a residual image generator network, a snow-free image generator network, and a perceptual discriminator network. In the residual generator, low-frequency and high-frequency features are decomposed and fed into ReFNets. The feature maps are concatenated then reconstructed and fed into the last CNN layer. The snow-free generator is fed by the ( , ( − )), where c is the concatenation and is the input snowy image; this is continuously fed through 16 ReFNets. After the last layer of the snow-free image, y is reconstructed. Lastly, perceptual RaLSGAN is in charge of discriminating between real or fake identical pairs. channels at a size of 64 × 64 in the first layer of the snow-free generative network. Like the residual image generator, we assign the output feature maps through 16 blocks of ReFNets. So, except for the number of ReFNets, we use the same size and layers as used in the residual image generator. We express the free-snow image generator as shown in Equation (2) below: (2)

B. Discrete Wavelet Transform Layer
We borrow the idea of how to decompose an input into four sub-band frequency features, such as , , , and HH, from [23]. This is done by using 2D DWT with four convolutional filters, i.e., the low-pass filter and the highpass filters , , and , based on the Haar wavelet transformation technique. The operation of DWT is defined as: where ↓ 2 represents the standard down-sampling operator with a factor of 2. We can rewrite Equation (3) In our network architecture, we use DWT once in the residual generative network and once in ReFNet. Each DWT is reconstructed by inverse wavelet transformation (IWT). For example, DWT in the residual generative network is reconstructed after getting results from the ReFNet blocks; however, all outputs must be concatenated first. Following the work in [23], IWT can be written as follows: We use the same DWT and IWT technique in ReFNet.

C. Residual Frequency Network
We use ReFNet as our bottleneck network. As shown in Fig. 2 and Fig. 4, ReFNet starts with a convolution, BN, and LeakyReLU at the size of 64 × 32 × 32. Similar to the multi-scale features approach of Res2Net [25], we have four features scales. However, our features have been defined in specific formats using the DWT pooling layer explained in the discrete wavelet transformation layer section, such as low-pass LL and highpass HL, LH, and HH . Different from Res2net, we don't divide the input and output channels by the size of the scale in the conv3x3 and BN because the four features are independent after being decomposed and the size of the output is intentionally reconstructed to be the same as the input feature's size after using IWT. We express ReFNet with the following equations: In Equation (6), the feature is input into the first layer of conv1x1, BN, and LeakyReLU. Then, DWT decomposes the first feature map to , , , and . While is maintained , , and HH are trained with conv3x3, BN, and LeakyReLU, as shown in Equation (6) and Fig. 4. Feature is added to a convolutional layer of feature , and feature is also added to a convolutional layer of feature . All the feature outputs are concatenated and reconstructed with IWT. Lastly, the inversed output is assigned through conv1x1, BN, and LeakyReLU, and then feature is obtained as the final output of ReFNet.

D. Discriminative Network
There are two major focuses in our discriminator networks: the structure of the networks and the formation of loss functions.
First, we use the iPANs-based discriminator networks proposed by Sung et al. [28], which use duplicate ground truths as paired inputs for real discrimination and a generated image with the ground truth (desired output)for fake discrimination. iPANs has adapted a perceptual linear network [39] into its discriminator network and jointly modified PANs (perceptual adversarial networks) [29] to find its desired perceptual adversarial loss. PANs calculate perceptual loss by directly subtracting layers of a network from other layers of another network using 2. Additionally, Zhang et al. [39] use a perceptual linear network to calculate the perceptual loss linearly. Thus, iPANs replaces the direct features subtraction ( 2 ) in PANs with a perceptual linear network (see discriminator network in Fig. 2).
Secondly, as iPANs suggested using a least squares loss function (LSGAN) for its adversarial loss, RaLSGAN [27] is applied in our discriminator instead. This is termed as perceptual RaLSGAN (or pRaLSGAN). The steps for loss function derivation are explained Section III.E in detail.

E. Loss Functions
In Equation (1) of the residual image generator, we obtain , which represents an estimated residual image. Here, the L2 loss function is used: where n is the total number of input data, i is iterated from 1, 2, … , and is the snow mask ground truth from the dataset.
In Equation (2), we obtain as our desired snow-free image. To refine our result, we use multiple loss functions, such as mix-loss, composition loss, and perceptual adversarial loss.
Mix-loss was originally proposed by Zhao et al. [30]. This loss function is a combination of multi-scale structural similarity (ms-ssim) and ℒ 1 loss functions: where = 0.84 is empirically suggested by the author of [30].
Ms-ssim is assigned as the loss function, as shown below: Three similarity functions, i.e., luminance, contrast, and structure components are combined at different scales. The computation is done in a sliding × Gaussianweighted window. In this equation, we use = 11, meaning 11 × 11. The luminance component is calculated only at the highest scale . For our convenience, we set = = = 1 . Refer to [40] for more details about multi-scale structural similarity.
The last member of the ℒ function is ℒ 1 , which is expressed in Equation (11) below.
The composition loss function using ℒ 1 is part of our total loss function. This is to guarantee that the estimated snow-free image does not encounter any information loss issues caused by artifacts of the estimated residual image at the pixel level.
Furthermore, we use a discriminative network where we can find adversarial loss functions. First, we must find generative and discriminative losses depending on our network architecture.
We use the relativistic average LSGAN loss function based on [27], which was originally written as: Equations (13) and (14) are RaGAN loss functions, where (•) is the non-transformed discriminator output. and are the estimated snow-free images and snow-free ground truth.
Our network architecture uses an identical-paired discriminator, which has two network streams, concatenated at the last layers. Now, we obtain non-transformed discriminator outputs as We can then rewrite our equations as follows: Our goal for the discriminator is to create a perceptual adversarial loss. However, we first need to find the perceptual loss. According to our discriminative network and [28], The discriminator has two network streams A and B where ℎ , ℎ ∈ ℝ × × . We calculate the feature distance by extracting feature stacks from layers and then unitnormalize their channel's dimensions. We scale the activation channel-wise ⊗ by vector weight ∈ ℝ , and then 2 is computed.
Then, given a positive margin m, we can define the perceptual loss for the discriminator as where [•] + = max(0,•). For more information related to this equation, refer to [29]. From Equations (17) to (20), we obtain pRaLSGAN loss functions: where , , , and are hyper-parameters balancing between the loss functions. Finally, our total objective loss function for snow-free image estimation can be expressed as follows: and ℒ is an adversarial against ℒ during training.

A. Environment
Our network is trained on a Titan X (Pascal) GPU, where we implemented our method using Pytorch 1.2.0 and Cuda 10.0 in a Python 3.6 environment. We resize the training images to 256×256 and train on randomly cropped 64×64 patches. We set the batch size and learning rate to 32 and 2-4e. We use the Adam optimizer and set 1 and 2 to 0.5 and 0.9, respectively. We define the learning rate to start decay after 50 epochs. Then, we train our network with 200 epochs.
We use the Snow100K 2 dataset, which is the same as the one used in [19]. The dataset consists of 100K synthesized snowy images and corresponding snow-free images, as well as snow masks. The dataset also contains 1,329 realistic snowy images. There are three subsets, which are divided into small, medium, and large snow subsets. Table I shows more detail about the number of training and test sets in the dataset.

B. Experimental Results
Fig. 5 exhibits the visual results from DesnowNet and CGAN, along with our results in the right-most column. Visually, in the top row, we can see that DesnowNet removed distant lights, mistaking them for snowflakes. On the other hand, CGAN deformed the lights and made them look blurry, while our method could maintain the lights' information and is free from the blurry effect. For the second images in the middle row, DesnowNet couldn't remove snow well from the region of the red box at the bottom and it produced blurry numbers and letters in the region of the red box at the top. CGAN, even though it can remove more snow than DesnowNet, still yields blurry artifacts. The image produced by DesnowNet in the bottom row contains many flakes of snow that don't look clean. CGAN looks cleaner, but the image is blurred. As shown in the figure, our method 1) has a cleaner image (snow-free image), 2) shows no blur effect, and 3) avoids information loss. We adopt two commonly used metrics, i.e., PSNR and SSIM, to evaluate our snow removal results quantitatively. Using the Snow100K 2 test dataset, we evaluate the difference between the estimated snow-free image and ground-truth . We also looked at the difference between the generated mask and ground truth mask.
TABLE II displays the comparison of our approach with existing state-of-the-art methods on the synthetic test dataset. It shows the qualitative comparison results under three test subsets, i.e., small, medium, and large falling snow conditions. We also add an overall column to display the average results over the entire test dataset. DeepLab [41], JORDER [17], DesnowNet [19], and CGAN (composition GAN) [20] are There is a big performance gap between DesnowNet and DeepLab and JORDER. CGAN outperforms DesnowNet in terms of the overall results, as shown in TABLE II, but their results under each subset are competitive; these two approaches have their distinct pros and cons. For instance, DesnowNet reconstructed an output with fewer artifacts than CGAN, but its output remains snowy. Alternatively, CGAN can remove snow better than DesnowNet, but it produces blurry artifacts. Our approach solves both of these existing problems. As a result, we obtain higher performance with fewer artifacts compared to DesnowNet and CGAN.
Furthermore, we also compare our estimated snow mask in TABLE III with the existing works. CGAN is excluded from this comparison due to the lack of resources and source code provided by the authors. The result of our residual generative network yields the greatest SSIM value (0.646) but has a lower PSNR than DesnowNet (20.536 vs. 22.005).
We show our sample results for the estimated snow-free image and estimated snow mask image from the real-world snowy image input in Fig. 1; our method can handle large and heavy snow, as well as small and medium particle snow.

C. Ablation Study
In this section, we investigate the effect of using different modules in our network architecture. The proposed ReFNet module is motivated by the Res2Net [25] module, as mentioned in Section III.C. Therefore, we train our network architecture with Res2Net and compare its results with the proposed ReFNet architecture. To be fair, we set the scale dimension of Res2Net to = 4 , as we also have four frequency components after DWT decomposition. Furthermore, the first and last layers are kept the same for both modules. The network is trained with 200 epochs. Fig. 6 depicts the visual results of the two modules. At first glance, Res2Net seems to produce clearer results than ReFNet, but if we look closer, we can see that there is remaining snow and noise, as well as color changes in the estimated images. On the other hand, ReFNet maintains the image's color and removes snow and noise better than Res2Net. ReFNet helps our network produce clearer results and less snow than Res2Net.      Fig. 8 show the comparison results for the proposed method trained with the Res2Net and ReFNet modules using SSIM and PSNR metrics. The figures show that ReFNet has superior performance compared to Res2Net in almost every epoch, based on the SSIM metric in Fig. 7. However, the results are very competitive when using the PSNR metric, as shown in Fig. 8.
According to the comparison results above, our method led to greater performance than the latest state-of-the-art methods on both synthetic and real-world snowy images.

V. CONCLUSION
In this paper, we proposed a novel network architecture assisted by two generators, including a residual generator and a snow-free generator, and an iPANs-based relativistic discriminator. Each generator has the same bottleneck module: the proposed ReFNet. Perceptual RaLSGAN improves our robustness and balances our training.
Our results suggest that our network is superior to current data-driven, state-of-the-art methods. While existing works depend strongly on the accuracy of the dataset, the proposed module uses a DWT layer embedded by CNNs, which allows our method to remove unknown type of snow which is not found in the synthetic dataset. Moreover, DWT and IWT also represent down-and up-sampling layers. They prevent outputs from losing information during down-sampling and upsampling. Finally, the proposed pRaLSGAN also helps avoid blurry artifacts.

ACKNOWLEDGMENT
The authors would like to acknowledge the financial support received from the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (GR 2019R1D1A3A03103736)