Pixelwise Estimation of Signal-Dependent Image Noise Using Deep Residual Learning

In traditional image denoising, noise level is an important scalar parameter which decides how much the input noisy image should be smoothed. Existing noise estimation methods often assume that the noise level is constant at every pixel. However, real-world noise is signal dependent, or the noise level is not constant over the whole image. In this paper, we attempt to estimate the precise and pixelwise noise level instead of a simple global scalar. To the best of our knowledge, this is the first work on the problem. Particularly, we propose a deep convolutional neural network named “deep residual noise estimator” (DRNE) for pixelwise noise-level estimation. We carefully design the architecture of the DRNE, which consists of a stack of customized residual blocks without any pooling or interpolation operation. The proposed DRNE formulates the process of noise estimation as pixel-to-pixel prediction. The experimental results show that the DRNE can achieve better performance on nonhomogeneous noise estimation than state-of-the-art methods. In addition, the DRNE can bring denoising performance gains in removing signal-dependent Gaussian noise when working with recent deep learning denoising methods.


Introduction
Noise level is an important parameter which decides how much the input noisy image should be smoothed. is parameter is directly required for many well-known denoising algorithms including Wiener filtering [1], nonlocal means (NLM) [2], BM3D [3], sparse representation-based denoising [4], and deep learning denoising methods [5,6].
Image denoising is often formulated to remove Gaussian white noise, which is additive and homogeneous. Homogeneous means that the noise variance is a constant for all pixels over the input noisy image and does not change over the position or color intensity of a pixel [7]. is formulation significantly simplifies the process of image denoising, where noise level is the only parameter required to model noise.
However, homogeneous noise assumption is not valid in real-world images [8]. Figure 1 illustrates a linear RGB image without denoising or gamma correction. We can see that noise in the bright part is stronger than noise in the dark part in the linear RGB image ("bright" and "dark" here are common concepts without strict mathematical definition; we just use the two words to describe the phenomenon). Previous works [9,10] have pointed out that noise in raw images or linear RGB images consists of two parts: (1) read noise, which is Gaussian and independent of the signal, and (2) shot noise, which is Poisson with variance equal to the signal level. erefore, real image noise can be well modeled by a signal-dependent Gaussian distribution: where x p is a true noisy measurement and y p is the true intensity. e noise parameters σ r and σ s are fixed but can vary across images as sensor gain changes [10]. Another observation is that noise in the dark part seems stronger than that in the bright part in standard RGB (sRGB) images. e reason is that sRGB images are gamma corrected and that the intensity including noise in the dark part is amplified more times than that in the bright part.
In the image signal processing (ISP) pipeline, as shown in Figure 2, denoising is performed on linear RGB images before gamma correction. Hence, the signal-dependent model equation (1) is a more precise model in industry than the commonly used signal-independent Gaussian model, and performance gain under the model is of importance.
Since the image noise variance is not constant in an image, granularity of noise-level estimation needs to be smaller in order to improve denoising performance. Pixelwise noiselevel estimation is the ultimate form.
Traditional denoising methods may require some modifications to take advantages of pixelwise noise estimation. However, deep learning methods use the pixelwise noise estimation results naturally. Recent deep learning-based image denoising methods [5,6] construct a noise-level map to introduce noise-level information to denoising networks. Currently, the noise-level map is filled with the same scalar. However, the noise-level map can also be constructed with pixelwise noise-level estimation. Our experiments will show that using pixelwise noise-level estimation leads to better performance on nonhomogeneous noise removal.
In this paper, we propose a deep residual convolutional neural network for pixelwise noise-level estimation. e main architecture simply consists of a stack of customized residual blocks, which is more proper for noise estimation. No pooling operation or a larger stride of convolution is adopted in the proposed architecture, which always intends to extract high-level features, such as semantic information. However, such a refined feature is unnecessary for noise estimation, which always focuses on low-level features, e.g., boundary and local variance. Given a noisy image, the proposed method is able to produce a pixelwise noise-level estimation map as well as a global scalar noise level. e contributions of this study are as follows: (1) Although many works have pointed out that noise standard deviation is not uniform across the image [8,9], this paper is the first work to give a pixelwise Gaussian noise-level estimation map. e method works on a more precise signal-dependent noise model (1). Moreover, the pixelwise estimation map can also collaborate with recent deep learningbased denoising methods [5], which require a noiselevel estimation map as the input. (2) Deep convolutional neural networks are adopted to provide pixel-to-pixel predictions, which are carefully designed by several residual blocks. (3) In terms of traditional scalar average estimation error, the proposed method is able to compete with the state-of-the-art methods [7] in both traditional

Noise-Level
Estimation. e challenge of noise-level estimation lies in distinguishing high-frequency noise from high-frequency image details. To overcome the difficulty, traditional methods often divide an image into fixed-size patches and search for flat parts to estimate the noise level [11][12][13][14]. ey assume that there are sufficient numbers of flat areas in the processed image [15][16][17][18].
Lee and Hoppel [19] treat smallest local variance patches as selected flat patches. Shin et al. [12] choose the patches whose standard deviation in intensity is close to the minimum one within all patches. en, the processed image is convolved with filters (e.g., Laplacian) to remove the influence of image content [20]. e second-order nature of the Laplacian tends to make it sensitive to noise while suppressing smoothly varying and uniform regions and enhancing edges. After filtering, the image still consists of both noise and object edges. erefore, some object edge detectors are proposed to recognize and remove edges. Finally, robust statistical methods such as the median of local estimates [20], the mode of local estimates [21], and the average of several smallest local estimates [22,23] are applied to estimate the final noise variance.
In recent years, PCA-based approaches [7,24,25] have attracted great attention since they can successfully deal with scenes containing rich textures. ese methods are based on the observation that patches taken from the noiseless image often lie in a low-dimensional subspace, instead of being uniformly distributed across the ambient space. In [24], the authors show that the noise variance can be estimated as the smallest eigenvalue of the image block covariance matrix. In [25], the authors prove that the naive PCA-based approach cannot lead to the satisfactory noise estimation, especially in complex scenes. To this end, their approach adds a process of selecting low-rank patches based on the gradients and other statistics of the patches. In [7], the authors estimate the statistical relationship between the noise level and the eigenvalues of the covariance matrix of patches. Also, they observe that the eigenvalues almost follow a Gaussian distribution in redundant dimensions. As a result, the work reported in [7] achieves the state-ofthe-art performance.
All methods above compute a scalar estimation, which works well with traditional denoising methods including BM3D [3] and NLM [2]. However, those methods all simplify the noise with homogeneous Gaussian variance, which is not in accordance with the real noise model (equation (1)).
Recent works on color image noise estimation consider noise variance difference on different color channels [26], which estimate three noise variances or a covariance matrix for an image. is estimation in cooperation with certain denoising methods [27] can improve denoising performance on removing multivariate Gaussian noise.

Deep Learning Denoising Methods. Recent deep learn-
ing-based image denoising methods [5,6] take a noise map and noisy RGB channels as the input. e input goes through carefully designed convolutional layers without pooling or interpolation to directly output clean estimates. Figure 3 briefly illustrates the models. e performance of deep learning methods has surpassed that of traditional denoising methods including BM3D [3] and NLM [2]. Currently, the noise-level map of deep denoising methods is filled with constant values. However, equation (1) suggests real noise is signal dependent and not homogeneous. Feeding the pixelwise noise-level map with pixelwise estimation is likely to increase the denoising performance.

Problem Formulation.
Let the noisy RGB image be denoted by z ∈ R m×n×3 and the corresponding clean RGB image be denoted by x ∈ R m×n×3 . e image degradation model is written as where n � [η] ijk is the noise tensor, which is Gaussian and white. We might as well suppose η ijk ∼N(0, σ ijk ) and denote [σ] ijk as s. If we assume noise is not signal dependent, or homogeneous, all σ ijk should be the same value. Otherwise, σ ijk might be different values related to image intensity at different pixel locations. e target of this paper is to learn a dense corresponding mapping function F(z; θ) from the dataset (z l , s l ) | l � 1, . . . , N such that where θ is the model parameter set and N is the number of training samples. If we have clean color images x l , the dataset (z l , s l ) can be easily constructed by randomly setting s and applying equation (2). e mapping function F(z; θ) will be implemented by a simple yet carefully designed deep residual network, which we will introduce in Section 3.2. e goal of this paper is to produce a pixelwise noiselevel estimation map, with which we can visualize the noise level over the image. We can also compute a global noiselevel scalar from the map and compare the proposed method with traditional estimation methods.

Network Architecture.
e 2D convolution samples a regular grid R over the input feature map and sums the sampled values with trainable weights [28]. e grid R defines the receptive field size and dilation, which ensures the dense correspondence between the input feature map and the output feature map. e correspondence property of convolution makes it good at fitting maps of dense corresponding images.
Fully convolutional networks (FCNs) inherent the correspondence property from 2D convolution, which are good at modeling problems of dense correspondence of images. For example, FCNs are suitable for image segmentation [29] and Computational Intelligence and Neuroscience image denoising [30]. Noise estimation is also a dense correspondence problem, where the output groundtruth noise-level map is densely corresponded with the input noisy image. erefore, FCNs are expected to have good performance on the problem.
Residual networks [31] are built on the fact that there are difficulties in training deep convolutional neural networks due to gradient vanish. erefore, if a CNN is too deep, the performance may be worse. Residual networks introduce shortcut connections by simply adding the input to the convolutional output, which mitigates the gradient vanish problem and increases the performance of FCNs.
ere are a few layer operations that are not suitable for image denoising. Recent deep denoising methods [5,6,30] use network architectures of almost pure convolution layers. ey do not use any pooling or interpolation layers since these layers are destructive for image details and will decrease the performance of image reconstruction. Hence, we also abandon pooling or interpolation layers in our network design.
A deep residual noise estimation (DRNE) convolutional neural network is proposed. Figure 4 illustrates the proposed DRNE network architecture. Figure 4(a) illustrates the whole architecture. e input is a noisy RGB image with three channels, which goes through a stack of residual blocks followed by a convolution and Relu [32] activation. e output is a noise-level channel. Figure 4(b) illustrates the structure of the residual block. Firstly, the c-channel input tensor is convolved to a w-channel tensor.
en, the w-channel tensor goes through a residual structure with k − 1 times of convolution and Relu and produces a w-channel output tensor. In the proposed architecture, the total layers of convolutions, or network depth, are k × d + 1.
As for implementation, the sizes of all convolution kernels are 3 × 3. We set the network width and depth by setting w � 64, k � 5, and d � 3, and thus, the total layers of convolutions are 16.

Training
3.3.1. Training Images. Training images are required to be clean and rich textured. We select the first 4,000 of images from the Waterloo exploration dataset [33] to construct training image pairs using (2). Images are cropped into 128 × 128 nonoverlapping patches with a stride of 256.
In theory, every training image needs to be randomly cropped and corrupted with signal-dependent noise levels. However, we find that corrupting every training image with only one noise level in [0, 30] to generate training pairs is sufficient to produce satisfactory results. Namely, s l � [σ l ] ij , where σ l is a scalar uniformly distributed in [0, 30].
As a result, our method is trained from homogeneous noisy patches. However, it is able to produce signal-dependent pixelwise noise estimation results in Section 4.3 and Figure 5.

Loss Function.
In theory, equation (3) is a reasonable loss function for the task. In practice, however, we find training loss drops with relatively large fluctuations using equation (3). Figure 6(a) plots the training and validation losses directly using equation (3). In early training epochs, the model is severely underfitting. Later, the losses also suffer from relatively large fluctuations. e phenomenon suggests the loss function may require regularizations. In the training stage, we feed the model with images corrupted by homogeneous noise. us, we use the mean value of elements of the predicted noise matrix to regularize the loss: where mean(x) is to compute the mean value of elements of the tensor or matrix x and λ is a scalar parameter to adjust the weights of those two terms. After applying the regularization term, the training and evaluation losses become significantly more stable and easy to converge. Figure 6(b) shows the results.

Implementation.
We implement the model with TensorFlow [34], the width of which is 64 and the depth is 16, as introduced in Section 3.2. We use equation (4) as the loss function and set λ � 0.25. e Adam [35] optimizer is adopted for training. It takes around 140 hours to train the model from scratch with Nvidia GTX 1080 Ti GPU.
More ablation study can be done for the parameters. Performance gain may be acquired with intensive trails.
Since the experiments in this paper already take a large part, and the current parameter set has good performance in the experiments, we leave it for the future work.

Experiments
In this section, we give quantitative and qualitative evaluations of the proposed method. In quantitative evaluations, we have to use the traditional scalar average error and standard deviation as evaluation criteria since other methods can only produce scalar estimation. In qualitative evaluations, we visualize the estimation map on both simulated data and real noisy images to show the effectiveness of the proposed method. In addition, we apply comparative methods to two deep denoising methods to reveal the denoising performance gain brought by the DRNE.   Figure 4: (a) Whole network architecture. e input is a noisy RGB image with three channels, which goes through a stack of residual blocks followed by a convolution and Relu [32] activation. e output is a noise-level channel. (b) Structure of the residual block. Firstly, the cchannel input tensor is convolved to a w-channel tensor. en, the w-channel tensor goes through a residual structure with k − 1 times of convolution and Relu and produces a w-channel output tensor.  Computational Intelligence and Neuroscience

Test Datasets and Compared Methods.
We compare the proposed method with state-of-the-art methods on three datasets: Kodak, McMaster [36], and BSD500 [37]. Gaussian white noise is added to clean images from the datasets to construct noisy images with groundtruths. Four methods, including Pyatykh's method [24], Liu's method [25], Chen's method [7], and the proposed DRNE, are compared using their source codes on the same datasets. For all methods to be compared, the parameters of all methods remain unchanged during the comparison.

Comparison on Simulated Homogeneous Noise.
We first evaluate all methods on traditional homogeneous noise. e fixed level of homogeneous Gaussian white noise is added to clean RGB images of the three datasets. For a fair comparison, we implement a framework to do noise addition and performance evaluation. Each compared method receives exactly the same input noisy images and uploads estimation results to the framework through a wrapper. Note that Pyatykh's method [24] can only handle gray images. erefore, noisy RGB images are first transformed to gray images in the wrapper and then processed by Pyatykh's method. Table 1 shows the average estimation errors of the compared methods on three datasets with six different noise levels. e proposed DRNE won 6 first places and 11 second places. Meanwhile, Chen's method [7] won 12 first places and 3 second places. Table 2 shows the standard deviation of errors, which implies the stability of the compared methods. e proposed DRNE won 11 first places and 6 second places. Meanwhile, Chen's method [7] won 7 first places and 8 second places.
In general, the performance of Chen's method and the proposed DRNE on homogeneous noise estimation is quite close. e differences are that Chen's method is better at average error, while the proposed DRNE is better at standard deviation.
For example, the average error of Chen's method on the McMaster dataset at noise level 0 is 1.60, which is quite large. More specifically, for the first image of the McMaster dataset shown in Figure 6(a), the estimation result of Chen's method is 4.07, which suggests the method fails to distinguish the difference between high-frequency image details and high-frequency noise. In contrast, the result of the DRNE is 1.42, which is much closer to true noise level 0.

Comparison on Simulated Nonhomogeneous Noise.
We then evaluate all methods on nonhomogeneous noise. Clean images from datasets are first divided into four rectangular parts. en, the four parts are added with noises of different levels: σ − 4, σ − 2, σ + 2, and σ + 4. Figure 5(b) illustrates the noise addition results, and Figure 5(d) illustrates the groundtruth noise.
Note the noise pattern is designed for easy comparison since it is easy to visualize and compute a global scalar estimation. e real noisy images will be evaluated later.
For traditional methods, the noise estimation results should be the weighted average of noise levels of all patches, which is σ in our settings. Tables 3 and 4 show the average errors and standard deviations of errors, respectively. Similar to the performance on homogeneous noise, Chen's method is better at average error, while the proposed DRNE is better at standard deviation. e proposed DRNE is not only able to give a scalar prediction but also able to produce a pixelwise noise-level map. Figures 5(c) and 5(g) illustrate the estimated pixelwise noise-level maps, in which the four rectangular parts of different noise levels are obvious. In addition, the estimation results are signal dependent. e flat parts of images tend to Bold fonts denote the best performance and italics denotes the second best performance. Bold fonts denote the best performance and italics denotes the second best performance.
Computational Intelligence and Neuroscience be less noisy, which is in accordance with our prior knowledge since it is difficult to distinguish between highfrequency image details and high-frequency noise.

Qualitative Results on Real Images.
To show the effectiveness of the proposed method on real images, we also evaluate our method on real linear RGB images. ese images are captured using mobile phones and saved in raw (DNG) format. en, they go through early stages of the image processing pipeline without denoising, brightness adjustment, or gamma correction. Figures 7(a) and 7(b) show the processed linear RGB images, and Figures 7(c) and 7(d) show the estimation results of the proposed method.
In general, the dark part in images suffers from weaker noise, which is in accordance with the noise model (1). For example, the black shirt in the image suffered from lower noise levels and the white wall suffered from stronger noise. ere are no comparative results of state-of-the-art methods on the task since other methods can only produce a scalar output.

Applying to Deep Learning Denoising.
We mentioned in contributions that pixelwise noise estimation is expected to improve the performance on deep learning denoising methods [5]. Now, we design an experiment to validate the conclusion.
First, we prepare two existing deep denoising models [6,38] which take noisy images and noise-level maps as the input.
en, homogeneous and nonhomogeneous noisy images are generated using different strategies. At last, we use Chen's method and the proposed method to feed the noise map to the deep denoising model. Denoising performance is recorded.
Previous evaluations have shown that the performance of Chen's method [7] and the proposed method is close, while that of other comparative methods significantly surpassed. erefore, we focus on comparison with Chen's method. Table 5 demonstrates the denoising performance on homogeneous noise. e performance on combination of three datasets and three noise levels is reported. We can see that when dealing with homogeneous noise, the performance of Chen's method and the proposed DRNE is comparable, while the proposed DRNE works slightly better in most test cases. We can conclude from Table 5 that the DRNE is an alternative to Chen's method in removing homogeneous noise. Table 6 demonstrates the denoising performance on nonhomogeneous noise. Two noise models are tested: (1) Noise variance σ is uniformly distributed in a fixed range. e performance of the two methods is also close. e DRNE works slightly better in most test cases. (2) Noise variance follows noise model equation (1) with σ r ∼U(0, 15) and σ s ∼U(0, 3.5). We clip the noise variance to ensure it does not exceed the prediction range of the DRNE.
When the real noise model is adopted, the DRNE got a significant performance gain on Kodak and BSD500 datasets. On the McMaster dataset, the performance of the two methods is close. We can conclude from Table 6 that the DRNE shows generally better performance in removing nonhomogeneous noise when dealing with real noise. Figure 8 shows visual results of noise model (1) on the Kodak dataset from Table 6. We can see obvious artifacts in the background part in Figure 8(a), while Figure 8(b) shows no artifacts, which suggests that Chen's method does not adapt well to the more realistic noise model and shows the effectiveness of the proposed DRNE.
In this section, we show that when working with the CNN denoising model, the proposed DRNE is generally better in removing nonhomogeneous noise with both quantitative and qualitative results.

Running Time Comparison.
In order to handle large images, the DRNE crops images into patches with the fixed size and handles them sequentially. For a fair comparison, we use the McMaster dataset for evaluation and set the crop size the same as the input image size 500 × 500. Table 1 shows the running time comparison of the compared methods on datasets, which is a measure on a desktop with Intel i7-5930K CPU and Nvidia GTX 1080 Ti GPU. e DRNE is implemented with Python, and other methods are implemented with MATLAB. Chen's method is the fastest on CPU.
e DRNE is the slowest on CPU partially because matrix computation in Python is not heavily optimized as in MATLAB. However, the DRNE is the second fastest with the help of GPU.  Bold fonts denote the best performance and italics denotes the second best performance.

Conclusion
In this paper, we propose a deep residual convolutional neural network named "DRNE" for Gaussian noise-level map estimation of images. e main architecture consists of a stack of carefully designed customized residual blocks. Given a noisy image, the proposed DRNE is able to produce a pixelwise noise-level estimation map as well as an overall scalar noise level. Experiments show that the proposed DRNE is able to compete with state-of-the-art methods such as Chen's method on traditional scalar noise-level estimation. In addition, the DRNE is able to produce a signal-dependent noise-level map, which is in accordance with the linear RGB image noise model (1).
Pixelwise noise-level estimation is helpful for precise noise removal. Recent deep learning-based noise removal methods [5,6] require a pixelwise noise-level estimation map as the input for noise removal. By applying the DRNE and Chen's method to deep learning denoising models, we reveal that the DRNE can bring significant performance gains in removing signal-dependent Gaussian noise (more  close to real noise than traditional Gaussian noise with fixed variance). Deep learning might be the ultimate method to separate high-frequency image details from noise by automatically mining patterns from image data. For the future work, we believe that joint training of the deep noise estimation model and the deep denoising model is possible to surpass all traditional methods in image denoising.

Conflicts of Interest
e authors declare that they have no conflicts of interest.    Table 6. We can see obvious artifacts in the background part of (a), while (b) shows no artifacts.