Practical Blind Image Denoising via Swin-Conv-UNet and Data Synthesis

While recent years have witnessed a dramatic upsurge of exploiting deep neural networks toward solving image denoising, existing methods mostly rely on simple noise assumptions, such as additive white Gaussian noise (AWGN), JPEG compression noise and camera sensor noise, and a general-purpose blind denoising method for real images remains unsolved. In this paper, we attempt to solve this problem from the perspective of network architecture design and training data synthesis. Specifically, for the network architecture design, we propose a swin-conv block to incorporate the local modeling ability of residual convolutional layer and non-local modeling ability of swin transformer block, and then plug it as the main building block into the widely-used image-to-image translation UNet architecture. For the training data synthesis, we design a practical noise degradation model which takes into consideration different kinds of noise (including Gaussian, Poisson, speckle, JPEG compression, and processed camera sensor noises) and resizing, and also involves a random shuffle strategy and a double degradation strategy. Extensive experiments on AGWN removal and real image denoising demonstrate that the new network architecture design achieves state-of-the-art performance and the new degradation model can help to significantly improve the practicability. We believe our work can provide useful insights into current denoising research. The source code is available at https://github.com/cszn/SCUNet.


Introduction
Image denoising, which is the process of recovering a latent clean image x from its noisy observation y, is perhaps the most fundamental image restoration problem.The reason is at least three-fold.First, it can help to evaluate the effectiveness of different image priors and optimization algorithms [8].Second, it can be plugged into variable splitting algorithms (e.g., half-quadratic splitting [1] and alternating direction method of multipliers [3]) to solve other problems (e.g., deblurring and super-resolution) [23].Third, it could be the very first step for other vision tasks.
The degradation model of image denoising can be mathematically formulated by where n is the noise to be removed.Recently, deep neural networks have become the mainstream method for image denoising.To improve deep image denoising performance, researchers mainly focus on two research directions.The first one is to improve the performance under the assumption that n is additive white Gaussian noise (AWGN).The second one largely focuses on training data or noise modeling.Both directions can contribute to the ultimate goal of improving the practicability for real images.The common assumptions of n are AWGN, JPEG compression noise, Poisson noise, and camera sensor noise, among which AWGN is the most widely-used one due to its mathematical convenience.However, it is known that deep image denoising model trained by AWGN performs poorly for most of real images due to noise assumption mismatch [4,42].Nevertheless, AWGN removal is fair to test the effectiveness of different network architecture designs.In recent years, various network architecture designs have been proposed.Some representative ones are DnCNN [60], N 3 Net [43], NLRN [32], DRUNet [58], and SwinIR [30].Indeed, network architecture designs can help to capture image prior for better image denoising performance.For example, N 3 Net [43] and NLRN [32] are specifically designed to capture non-local image prior.Although the PSNR performance on benchmark datasets has been largely improved, e.g., SwinIR [30] outperforms DnCNN [60] by an average PSNR of 0.57dB on Set12 dataset for noise level 25, it is still interesting to raise the first question whether the PSNR performance can be further improved by advanced network architecture design.
In order to facilitate the practicability of deep denoising models, a flurry of work has been devoted to noise model-ing.The motivation behind this is to make the noise assumption consistent with the degradation of real images.Plotz and Roth [42] establish a realistic Darmstadt Noise Dataset (DND) with consumer cameras which is composed of different pairs of real noisy and almost noiseless reference images in the RAW domain and sRGB domain.They further show that the model retrained with accurate degradation can significantly outperform that with AWGN on the sRGB DND dataset [43].By leveraging the physics of digital sensors and the steps of an imaging pipeline, Brooks et al. [4] design a camera sensor noise synthesis method and provide an effective deep raw image denoising model.Although the above attempts have emphasized the importance of degradation models, they mainly focus on camera sensor induced noise removal.Yet, few work has been done on training a deep model for general-purpose blind image denoising.It is interesting to raise the second question of how to improve the training data for blind denoising.
We attempt to answer the above two questions with novel network architecture design and novel training data synthesis.For the network architecture design, motivated by the facts that 1) different methods for image denoising have complementary image prior modeling ability and can be incorporated to boost the performance [6]; 2) DRUNet [58] and SwinIR [30] exploit very different network architecture designs while achieving very promising denoising performance, we propose a swin-conv block to combine the local modeling ability of residual convolutional layer [19] and non-local modeling ability of swin transformer block [33], and then plug it as the main building block into the UNet architecture.In order to test its effectiveness, we evaluate its PNSR performance on benchmark datasets for AWGN removal.Since real image noise could be introduced by other types of noise, such as JPEG compression noise, processed camera sensor noise, and be further affected by resizing, it is too complex to model with a parametric probability distribution.To resolve this problem, we propose a random shuffle of different kinds of noise (including Gaussian, Poisson, speckle, JPEG compression, and processed camera sensor noises) and resizing operations (including the commonly used bilinear and bicubic interpolations) to make a rough approximation of real image noise.
Our contributions are listed as follows: 1) We propose a novel denoising network by plugging novel swin-conv blocks into multiscale UNet to boost the local and non-local modeling ability.
2) We propose a hand-designed noise synthesis model, which can be used to train a general-purpose blind image denoising model.
3) Our blind denoising model trained with the proposed noise synthesis model can significantly improve the practicability for real images.

4)
Our work provides a strong baseline for both synthetic Gaussian denoising and practical blind image denoising.The codes will be released upon acceptance.

Deep Blind Image Denoising
Compared to non-blind image denoising, where the noise type and noise level are assumed to be known, blind denoising tackles the case when the noise level of certain noise type is unknown or even the noise type is unknown.During past few years, several attempts have been made to solve the problem of deep blind denoising.Zhang et al. [60] demonstrate that a single deep model can handle Gaussian denoising with various noise levels and can even handle JPEG compression with different quality factors and single image super-resolution with different scale factors.Chen et al. [10] propose to adopt generative adversarial networks (GAN) to generate noise from clean images and then construct the paired training data for subsequent training.Guo et al. [17] propose a convolutional blind denoising network (CBDNet) with a noise estimation subnetwork and then propose to train the model with realistic noise model and real-world noisy-clean image pairs.Krull et al. [25] propose a blind-spot network which can be trained without noisy image pairs or clean target images.Yue et al. [56] propose a variational inference method for blind image denoising which incorporates both noise estimation and image denoising into a unique Bayesian framework.While achieving promising results, the above methods are mainly evaluated on the synthetic Gaussian noise or the processed camera sensor noise such as the DND dataset [42].Since real noise is far more complex, the above methods can not be readily applied for real applications.It is still unclear how to establish more practical noisy/clean image pairs for training a deep blind model.

Deep Architecture for Non-Local Prior
State-of-the-art model-based image denoising methods mostly exploit non-local self-similarity (NLSS) prior which refers to the fact that a local patch often has many non-local similar patches across the image [5].Some representative methods include BM3D [12], LSSC [36] and WNNM [15].Inspired by the effectiveness of NLSS prior, some deep learning methods attempt to explicitly model the correlation among non-local patches via the network structure.Sun and Tappen [49] propose a gradient-based discriminative non-local range Markov Random Field (MRF) method to exploit the advantages of BM3D and non-local means.Inspired by non-local variational methods, Lefkimmiatis [27] designs an unrolled network that can perform non-local processing for better denoising performance.However, the above methods adopt the non-differentiable KNN match-ing in fixed feature spaces.To resolve this, Plotz and Roth [43] propose a fully end-to-end trainable neural nearest neighbor block to leverage the principle of non-local self-similarity.Liu et al. [32] propose a non-local recurrent network (NLRN) to incorporate non-local operations into a recurrent neural network.Chen et al. [9] propose image processing transformer (IPT) to exploit the non-local modeling of transformer.However, IPT works on fixed image patch size and tends to result in border artifacts.Liang et al. [30] address this issue by adopting the swin transformer as the main building block.It has been shown that transformer-based methods favors more on images with repetitive structures, which verifies the effectiveness of the transformer for non-local modeling ability.

Method
Since we focus on learning a deep blind model with paired training data, it is necessary to revisit the Maximum A Posteriori (MAP) inference to have a better understanding.Generally, for the problem of blind image denoising, the estimated clean image x can be obtained by solving the following MAP problem with a certain optimization algorithm, where D(x, y) is the data fidelity term, P (x) is the prior term and λ is the trade-off parameter.So far, one can see that the key of solving blind denoising lies in modeling the degradation process of noisy image as well as designing the image prior of clean image.
By treating the deep model as a compact unrolled inference of Eq. ( 2), the deep blind denoising generally aims to solve the following bi-level optimization problem [11,48] where W denotes the network parameters to be learned, While the later two factors are easy to improve, how to improve the network architecture remains further study.
From the above discussions and analyses, we can conclude that the network architecture and the training data are two important factors to improve the performance of deep blind denoising model.In the following, we will separately detail our attempts to improve these two factors.

Swin-Conv-UNet
Fig. 1 illustrates the network architecture of our proposed Swin-Conv-UNet (SCUNet).The main idea of SCUNet is to integrate the complementary network architecture designs of DRUNet and SwinIR.To be specific, SCUNet plugs novel swin-conv (SC) blocks into a UNet [46] backbone.Following DRUNet [58], the UNet backbone of SCUNet has four scales, each of which has a residual connection between 2×2 strided convolution (SConv) based downscaling and 2×2 transposed convolution (TConv) based upscaling.The number of channels in each layer from the first scale to the fourth scale are 64, 128, 256 and 512, respectively.The main difference between SCUNet and DRUNet is that SCUNet adopts four SC blocks rather than four residual convolution blocks in each scale of the downscaling and upscaling.
As shown in the dashed line of Fig. 1, an SC block fuses swin transformer (SwinT) block [33] and residual convolutional (RConv) block [19,31] via two 1×1 convolutions, split and concatenation operations, and a residual connection.To be specific, for an input feature tensor X, it is first passed through a 1×1 convolution.Subsequently, it is split evenly into two features map groups X 1 and X 2 .We formulate such a process as Then, X 1 and X 2 are separately fed into a SwinT block and a RConv block, giving rise to Finally, Y 1 and Y 2 are concatenated as the input of a 1×1 convolution which has a residual connection with the input X.As such, the final output of SC block is given by It is worth pointing out that our proposed SCUNet enjoys several merits due to some novel module designs.First, the SC block fuses the local modeling ability of RConv block and non-local modeling ability of SwinT block.Second, the local and non-local modeling ability of SCUNet is further enhanced via the multiscale UNet.Third, the 1×1 convolution can effectively and efficiently facilitate information fusion between SwinT block and RConv block.Fourth, the split and concatenation operations can act as the group convolution with two groups to reduce the computation complexity and the number of parameters.We note that SCUNet essentially functions as a hybrid convolutional neural networks (CNNs)-Transformer network and there also exist several other works that integrate CNNs and Transformer for effective network architecture design.It is also worth pointing out the difference between our proposed SCUNet and two recently works including Uformer [54] and Swin-Unet [7].First, the motivation is different.Our SCUNet is motivated by the fact that state-ofthe-art denoising methods DRUNet [58] and SwinIR [30] exploit very different network architecture designs, and thus SCUNet aims to integrate the complementary network architecture designs of DRUNet and SwinIR.By contrast, Uformer and Swin-Unet aim to combine the transformer variants and UNet.Second, the main building blocks are different.Our SCUNet adopts a novel swin-conv block which integrates the local modeling ability of residual convolutional layer [19] and non-local modeling ability of swin transformer block [33] via 1×1 convolution, split and concatenation operations.By contrast, Uformer adopts a new transformer block by combining depth-wise convolution layers [29], while Swin-Unet utilizes the swin transformer block as the main building block.

Training Data Synthesis
Instead of establishing a large variety of real noisy/clean image pairs, which is laborious and challenging, we attempt to synthesize realistic noisy/clean image pairs.The main idea is to add different kinds of noise and also include the resizing, as well as incorporating a double degradation strategy and a random shuffle strategy which we will detail next.
Gaussian noise.Additive white Gaussian noise (AWGN) is the most widely-used assumption for denoising.While it can perfectly model read noise of an image sensor, it usually does not match the real noise and would deteriorate the practicability of trained deep denoising models.Nevertheless, it has been shown that deep denoising model (e.g., FFDNet [61]) trained with AWGN can remove non-Gaussian noise by setting a larger Gaussian noise level, with the sacrifice of smoothing the textures and edges.Instead of using the simplified AWGN, we adopt the 3D generalized zero-mean Gaussian noise model [40] with 3 × 3 covariance matrix to model the noise correlation between R, G, and B channels.One of the underlying reasons is that the color image demosaicing step in camera ISP pipeline can correlate the noise across channels.Depending on the cross-channel dependencies, such a generalized Gaussian model has two extreme cases, including the widely-used additive white Gaussian color noise and grayscale Gaussian noise.We uniformly sample their noise levels from {2/255, 3/255, • • • , 50/255}.Since in-camera denoising algorithms generally remove the color noise for better perceptual quality, grayscale Gaussian noise would be a good choice to model the remaining noise.For this reason, we sample the two extreme cases and general case with probabilities 0.4, 0.4 and 0.2, respectively. of electric charge.It occurs severely in low-light conditions, such as night time photography, medical imaging, optical microscopy imaging and astronomy imaging [18].Different from Gaussian noise which is signal-independent, Poisson noise is signal-dependent.Traditional model-based methods mostly apply the variable-stabilizing transformation (VST) to transfer the noise into approximate signalindependent one, and then tackle the problem with Gaussian denoising methods.However, such methods need to know the noise type beforehand, which is generally impossible for real images.Hence, removing the Poisson noise directly via the deep model would be a good choice.To sample different noise levels for an image, we first multiply the clean image by 10 α , where α is uniformly chosen from [2,4], and then divide back by 10 α after adding the signal-dependent Poisson noise.Our Poisson noise can be mathematically modeled as Following the Gaussian noise, we also consider grayscale Poisson noise by converting the clean color image into grayscale image.After that, we add the same grayscale noise to each channel of the given image.
Speckle noise.Speckle noise is multiplicative noise which usually appears in coherent imaging systems, such as synthetic aperture radar (SAR) imaging and medical ultrasonic imaging [44,51].It can be modeled by the multiplication between latent clean image and Gaussian noise.We thus simply modify the above Gaussian noise synthesis strategy by multiplying the clean image to generate speckle noise.
JPEG compression noise.Image compression can help to reduce the storage and bandwidth requirements for digital images.Among various image compression standards and formats, JPEG has been the most widely-used one since it is simple and allows for fast encoding and decoding.However, it will reduce the image quality by introducing severe 8×8 blocking artifacts with the increase of compression degree.Such a trade-off is controlled by the quality factor which ranges from 0 to 100.Due to its pervasiveness in Internet and social media usage, we add this kind of noise by uniformly sampling the quality factor from [20,95].
Processed camera sensor noise.The noise in output RGB image of modern digital cameras is mainly caused by passing the read and shot noise in raw sensor data through an image signal processing (ISP) pipeline.Hence, the distribution of the processed camera sensor noise varies with the read and shot noise model and ISP model.Inspired by [4], we synthesize this kind of noise by generating raw image from clean image via the reverse ISP pipeline, and then processing the noisy raw image via the forward ISP pipeline after adding read and shot noise to raw image.For the read and shot noise model, we exactly borrow the one proposed in [4].For the ISP model, we adopt the one proposed in [59] which consists of demosaicing, exposure compensation, white balance, camera to XYZ (D50) color space conversion, XYZ (D50) to linear RGB color space conversion, tone mapping and gamma correction.It is still worth pointing out the following details about the adopted ISP model.First, the orders of gamma correction and tone mapping, and and the tone mapping curves are different from these in [4].Our ISP adopts gamma correction as the final step, whereas [4] uses tone mapping as the final step.While it has been known that the tone mapping curves for different cameras are usually different, [4] uses a fixed tone curve.By contrast, our ISP selects the best fitted tone curves from [14] for each camera based on the error between reconstructed output and the camera ground-truth RGB output.Second, the forward-reverse tone mapping may cause color shift issue with respect to the original image due to the irreversibility, we resolve this by also applying the reverse-forward tone mapping for the clean image.
Resizing.Image resizing is one of the basic digital image editing tools.It can be used to fit the image into a certain space on a screen or be used to downscale the image to reduce the storage size.While resizing does not introduce noise to the clean images, it would affect the noise distribution of the noisy images.For example, upscaling would lead AWGN to be spatially correlated while downscaling would change processed camera sensor noise to be less signal-dependent.To model such resizing induced noise, we uniformly apply the widely-used bilinear and bicubic resizing operations.The scaling factor is uniformly chosen from [0.5, 2].Especially noteworthy is that we apply the same resizing on both noisy image and its clean counterpart since resizing will change the spatial resolution of latent clean image of the noisy image.Hence, it is essentially different from the super-resolution degradation proposed in [53,59].
In practice, real images might be resized or JPEG compressed several times [22], and JPEG compression might be performed before or after resizing.Inspired by this, our final degradation sequence employs a double degradation strategy and a random shuffle strategy.By doing this, the degradation space is expected to be largely expanded, which can facilitate the generalization ability of the trained deep blind model.Specifically, we perform the above noises and resizing twice.We add Gaussian noise and JPEG compression noise with the probabilities of 1.For the resizing and other noise addition, we set the probabilities to 0.5.Before applying the degradation sequence to a clean image, we first perform a random shuffle on the degradations.To

Discussion
4.1.Our denoising data synthesis pipeline vs. superresolution data synthesis pipeline [53,59] Our training data synthesis pipeline differs from the ones proposed in [53,59] in at least three main aspects.First, the applications are different.Our pipeline is used for deep blind image denoising, whereas the ones proposed in [53,59] are designed for deep blind super-resolution.Second, our pipeline also performs the resizing on the high-quality image to produce the corresponding clean image of the noisy images, whereas the degradation models in [53,59] do not perform such a procedure.The reason is that denoising does not necessitate removing image blur and enlarging the resolution, which is different from super-resolution.Third, our pipeline adopts more kinds of noise, such as speckle noise.Fig. 3 shows some synthesized noisy/clean patch pairs via our proposed training data synthesis pipeline.It can be seen that our data synthesis pipeline can produce very realistic noisy images.It is worth noting that the noisy/clean patch pairs are from the same high quality image with size 544×544.Since we also perform the resizing operations for clean image patches, we can observe some blurriness from some of the clean image patches.

Practical blind denoising v.s blind Gaussian
denoising and blind camera sensor noise removal for DND and SIDD Our practical blind denoising is much more difficult than blind Gaussian denoising and blind camera sensor noise removal for DND and SIDD, and is the "true" blind image denoising for practical application.It is widely-known that the deep model trained for blind Gaussian denoising does not perform well for real images due to noise assumption mismatch.For this reason, DND and SIDD are established by capturing noisy and clean images pairs from different cameras.Although these two datasets help researchers shift to real image denoising, however, they focus on camera sensor noise which also deviates significantly from the noise from the Internet in our daily life.Moreover, as shown in Fig. 5, the state-of-the-art DeamNet for these datasets even has a worse result than Noise Clinic for noisy images from a different kind of camera, which indicates that deep models trained for these two datasets do not generalize well for unseen noise, thus having very limited applications.In contrast, our model is trained on a much more complex degradation model whose the degradation space is large enough to cover a large variety of different noise combinations, and thus can significantly improve the practicability.As far as we know, the existing "true" blind denoising is the work entitled "The noise clinic: a blind image denoising algorithm."Our model can significantly outperform Noise Clinic and is the first deep model that can be readily applied for real applications.

Experiments
As discussed in Sec. 3, the network architecture and the training data are two important factors to improve the performance of deep blind denoising model.For the sake of fairness, we first evaluate our SCUNet on synthetic Gaussian denoising.We then evaluate our training data synthesis pipeline with our SCUNet on practical blind image denoising.6. Visual results and no-reference image quality assessment metrics (NIQE↓/NRQM↑/PIQE↓) results of different methods for real image denoising.The images in each row from top to bottom are "Palace", "Building", and "Stars", respectively.

Synthetic Gaussian Denoising
Implementation details.For the high quality image dataset, we use the same training dataset consisting of Waterloo Exploration Database [35], DIV2K [2], and Flick2K [31] for training.The settings of SwinT and Rconv blocks are same to those in SwinIR and DRUNet, respectively.Following the common setting, we generate the noisy image by adding AWGN with a certain noise level and separately learn a denoising model for each noise level.The parameters are optimized by minimizing the L1 loss with Adam optimizer [24].The learning rate starts from 1e-4 and and decays by a factor of 0.5 every 200,000 iterations and finally ends with 3.125e-6.The patch size and batch size are set to 128×128 and 24, respectively.We first train the model with noise level 25 and then finetune the model for other noise levels.All experiments are implemented by PyTorch 1.7.1.It takes about three days to train a denoising model on four NVIDIA RTX 2080 Ti GPUs.
To qualitatively evaluate the proposed SCUNet, we provide the denoising results of different methods on classical image "Barbara" from Set12 dataset with noise level 50 in Fig. 4. Note that we also include the traditional modelbased methods BM3D [12] and WNNM [15] for comparison since they are based on non-local priors.We have the following observations.First, WNNM produces much better visual results than some of the deep denoising methods such as DnCNN, FFDNet, RNAN and FOCNet.Second, while DAGL, DRUNet and SwinIR have better PSNR results than WNNM, they fail to recover some of the repetitive lines which indicates they still have limits in non-local prior modeling.Third, our SCUNet produces more visually pleasant results than others which further verifies the effectiveness of SCUNet for modeling image non-locality.
Color Gaussian denoising.Table 2 reports the color image denoising results of different methods on CBSD68 [37,47], Kodak24 [13], McMaster [62] and Urban100 [20] datasets.The compared methods include DnCNN, FFD-Net, DSNet [41], BRDNet [50], RNAN, RDN [64], IPT, DRUNet, SwinIR and Restormer.As one can see, our SCUNet produces the best overall performance.Specifically, SCUNet surpasses DnCNN, FFDNet and DSNet by an average PSNR of 0.5dB on CBSD68, 0.7dB on Kodak24, 1.1dB on McMaster and 1.6dB on Urban100.Interestingly, while SCUNet has a similar PSNR gain over DRUNet for different noise levels, it achieves a larger PSNR gain than SwinIR with the increase of noise level.The possible reason is that SwinIR tends to lack the ability to model the long range dependency for heavy noise removal.
Fig. 5 provides the visual results of different methods on image "163085" from CBSD68 with noise level 50.It can be seen that SwinIR fails to recover the yellow structure along the beak of the bird while DnCNN, RNAN and DRUNet introduce some smoothness.By contrast, SCUNet recovers fine structures and preserves image sharpness.Results.From Fig. 6, we can observe that our SCUNet and SCUNetG achieve the best visual results for noise removal and details preserving.For example, both CBDNet and DeamNet fail to removal the processed camera sensor noise for "Palace" while ours can remove such lowfrequency noise and recover the underlying edges.However, our results do not show promising no-reference IQA results.As pointed out in [59], such a phenomenon further indicates that no-reference IQA methods should update with degradation types.Fig. 7 provides more blind denoising results of our SCUNet and SCUNetG on real images from RNI15 dataset [61].Note that we do not know the ground-truth noise type and noise levels of these real images.For example, the "Boy", "Dog" and "Glass" are likely to be corrupted by processed camera sensor noise with unknown camera type and the "Flowers" is corrupted by Gaussian-like noise.Surprisingly, our models effectively handle these images, which could be due to the fact that they have been trained to manage a wide range of degradation scenarios created by various types of noise, resizing, and a random shuffle strategy.According to the above results, we can conclude that the proposed training data synthesis pipeline is suitable for training deep blind denoising model for real applications.Impact of the resizing for data synthesis.Since one of the main differences between our proposed noisy image synthesis from others is that we adopt resizing to diversify the noise distribution, it is interesting to investigate the performance of the trained model without using resizing in the training data synthesis.Fig. 8 provides the visual comparisons on two upsampled noisy image by bicubic resizing with a scale factor of 2. The first noisy image is corrupted by Gaussian noise with noise level 50 while the second one is corrupted by unknown processed camera sensor noise.It can be seen that the trained model without using resizing in the training data synthesis fails to completely remove the noise.Thus, we can conclude that the resizing can help to improve the generalization ability.

Conclusion
In this paper, we focus on the problem of practical blind image denoising.Inspired by the Maximum A Posteriori (MAP) inference which indicates prior modeling and degradation modeling are essential for the success of deep blind denoising, we propose a new network architecture for better prior modeling and a novel data synthesis method for better practical usage.Specifically, we design a new swinconv block which incorporates the local modeling ability of residual convolution block and non-local modeling ability of swin transformer block, and plug it as the main building block into a UNet to further enhance the local and non-local modeling ability.Moreover, we design a data synthesis pipeline which considers different kinds of noise and also involves a random shuffle strategy and a double degradation strategy.Extensive experimental results demonstrated the effectiveness of the new architecture design for Gaussian denoising and practicability of the trained deep blind model for real noisy images.
represents the training noisy-clean image pairs, L(•) is the loss function.In this sense, the deep blind denoising model should capture the knowledge of degradation process and image prior.On the other hand, the modeling ability of a deep model generally depends on network architecture, model size (or the number of parameters), and training data.It is clear that the degradation process is implicitly defined by the noisy images from the training data, which indicates the noisy images of the training data is responsible for deep blind denoising model to capture the knowledge of degradation process.In order to improve the image prior modeling ability of deep blind denoising model, one should focus on improving the following three factors, including network architecture, model size and clean images of the training data.

Figure 1 .
Figure1.The architecture of the proposed Swin-Conv-UNet (SCUNet) denoising network.SCUNet exploits the swin-conv (SC) block as the main building block of a UNet backbone.In each SC block, the input is first passed through a 1×1 convolution, and subsequently is split evenly into two feature map groups, each of which is then fed into a swin transformer (SwinT) block and residual 3×3 convolutional (RConv) block, respectively; after that, the outputs of SwinT block and RConv block are concatenated and then passed through a 1×1 convolution to produce the residual of the input."SConv" and "TConv" denote 2×2 strided convolution with stride 2 and 2×2 transposed convolution with stride 2, respectively.

2 .
Poisson noise.Poisson noise generally refers to the photon shot noise which originates from the discrete nature Schematic illustration of the proposed paired training patches synthesis pipeline.For a high quality image, a randomly shuffled degradation sequence is performed to produce a noisy image.Meanwhile, the resizing and reverse-forward tone mapping are performed to produce a corresponding clean image.Paired noisy/clean training patches are then cropped for training deep blind denoising model.Note that, since Poisson noise is signal-dependent, the dashed arrow for "Poisson" means the clean image is used to generate the Poisson noise.To tackle the color shift issue, the dashed arrow for "Camera Sensor" means the reverse-forward tone mapping is performed on the clean image.

Figure 3 .
Figure 3. Synthesized noisy/clean patch pairs via our proposed training data synthesis pipeline.The size of the high quality image patch is 544×544.The size of the noisy/clean patches is 128×128.
prevent out-of-range values after each degradation process, we always make sure the image is clipped into the range of 0-1.Due to the introduction of resizing, a large high quality image should be used for the paired training data synthesis.Fig.2provides a schematic illustration of the proposed training data synthesis pipeline.

Figure 4 .
Grayscale image denoising results of different methods on image "Barbara" from Set12 dataset.The noisy image is corrupted by AWGN with noise level 50.

Figure 7 .
Figure 7.More blind denoising results of our SCUNet and SCUNetG on real images from RNI15 dataset.From top row to bottom row: noisy images, results of SCUNet, results of SCUNetG.Please zoom in for better view.

Figure 8 .
Figure 8.Comparison between SCUNet and its variant without using resizing in the training data synthesis for denoising a resized noisy image.(a) Upsampled noisy image by bicubic resizing with a scale factor of 2, (b) denoising result of SCUNet, (c) denoising results of SCUNet with using resizing in the training data synthesis.

Table 1 .
Average PSNR(dB) results of different methods for grayscale image denoising with noise levels 15, 25 and 50 on the widely-used Set12, BSD68 and Urban100 datasets.The best and second best results are highlighted in red and blue colors, respectively.

Table 2 .
Average PSNR(dB) results of different methods for color image denoising with noise levels 15, 25 and 50 on the CBSD68, Kodak24, McMaster and Urban100 datasets.The best and second best results are highlighted in red and blue colors, respectively.SCUNet Figure 5. Color image denoising results of different methods on image on image "163085" from CBSD68 dataset.The noisy image is corrupted by AWGN with noise level 50.

Table 3 .
FLOPs, runtime and #Params comparisons on images of size 256×256 on a PC with an Nvidia Titan Xp GPU.

Table 3 .
We can see that our SCUNet achieves the lowest FLOPs due to the combination of UNet and SC block.Since SwinIR does not use any downscaling operations, it suffers from high FLOPs and long runtime.In comparison, SCUNet achieves the best trade-off between FLOPs, runtime and #Params.Note that the runtime of SCUNet can be reduced by efficient implementation.
following: First, each high quality image is first cropped into a size of 544×544 before processing it into a pair of noisy/clean images.Second, the learning rate is fixed to 1e-4 as it tends to enhance the generalization ability.Third, we