An improved pix2pix model based on Gabor ﬁlter for robust color image rendering

: In recent years, with the development of deep learning, image color rendering method has become a research hotspot once again. To overcome the detail problems of color overstepping and boundary blurring in the robust image color rendering method, as well as the problems of unstable training based on generative adversarial networks, we propose an color rendering method using Gabor ﬁlter based improved pix2pix for robust image. Firstly, the multi-direction/multi-scale selection characteristic of Gabor ﬁlter is used to preprocess the image to be rendered, which can retain the detailed features of the image while preprocessing to avoid the loss of features. Moreover, among the Gabor texture feature maps with 6 scales and 4 directions, the texture map with the scale of 7 and the direction of 0 ◦ has the comparable rendering performance. Finally, by improving the loss function of pix2pix model and adding the penalty term, not only the training can be stabilized, but also the ideal color image can be obtained. To reﬂect image color rendering quality of different models more objectively, PSNR and SSIM indexes are adopted to evaluate the rendered images. The experimental results of the proposed method show that the robust image rendered by this method has better visual performance and reduces the inﬂuence of light and noise on the image to a certain extent.


Introduction
At present, image color rendering as a major branch of image processing has attracted much attention. With the development of deep learning, image color rendering based on neural network has gradually become a research hotspot [1][2][3][4][5]. Because traditional color rendering methods require manual intervention and have high requirements of reference images. Moreover, when the structure and color of the image are complex, color rendering effect is not ideal [6][7][8][9][10]. Color rendering methods based on deep learning can be easily deployed in the actual production environment, and the limitation of the traditional methods can be solved [11][12][13]. By using the neural network model and the corresponding dataset training model [14,15], the image can be automatically rendered according to the model, without being affected by human or other factors [16][17][18][19].
Larsson [20] used the convolutional neural network to consider the brightness of the image as input, decomposed the color and saturation of the image by the super-column model, to realize color rendering. Iizuka [21] combined the low-dimensional feature and global feature of the image by using the fusion layer in the convolutional neural network, for generating the color of the image and processing images of any resolution. Zhang [22] designed an appropriate loss function to handle the multi-mode uncertainty in color rendering and maintain the color diversity. However, when the grayscale image features are extracted using the above mentioned method, up-sampling is adopted to make the image size consistent, resulting in the loss of image information. Moreover, the network structure cannot well extract and understand the complex features of the image, and the rendering effect is limited [23][24][25].
Isola [26] improved conditional generative adversarial networks (CGAN) to achieve the transformation between images. The proposed pix2pix model can realize conversion between different images, for example, color rendering can be realized by learning the mapping relationship between grayscale image and color image [27,28]. But the pix2pix model based generative adversarial networks (GAN) has the disadvantage of training instability. Moreover, the current image rendering methods based on deep learning are not good at rendering robust images. Gabor filter can easily extract texture information in all scales and directions of the image, and reduce the influence of light change and noise in the image to a certain extent.
Therefore, we propose a color rendering method using Gabor filter based improved pix2pix for robust image. The contributions of this paper are mainly there-folds: (1) The improved pix2pix model can not only automatically complete image rendering and achieve good visual effect, but also achieve more stable training and better image quality.
(2) Gabor filter was added to enhance the robustness of model rendered images.
(3) The metric data of a series of experiments show that the proposed method has better performance for robust image.
The rest of the paper is organized as follows. Section 2 introduces the previous work, including Gabor filter and pix2pix model. Section 3 describes the method and its design details. Section 4 introduces the experiment and comparison experiment, and evaluates the image quality. Section 5 conclusions the paper and outlooks the future work.

Gabor filter
Fourier transform is a powerful tool in signal processing, which can help us transform images from spatial domain to frequency domain, and extract features that are not easy to extract in spatial domain. However, after Fourier transform, frequency features of images at different locations are often mixed together, but Gabor filter can extract spatial local frequency features, which is an effective texture detection tool [29,30]. The Gabor filter is derived by multiplying a Gaussian by a cosine function [31][32][33], it is defined as where, x = x cos θ + y sin θ, y = −x sin θ + y cos θ. Where, x, y represent the coordinate position of the pixel, λ represents the wavelength of the filter, θ represents the tilt degree of the Gabor kernel image, ϕ represents the phase offset, σ represents the standard deviation of the Gaussian function, and γ represents the aspect ratio.
In order to make full use of the characteristics of Gabor filters, it is necessary to design Gabor filters with different directions and scales to extract features. In this study, the Gabor filter extracts the texture features of the image in 6 scales and 4 directions. Namely, the Gabor scales are 7, 9, 11, 13, 15 and 17. The Gabor directions are 0 • , 45 • , 90 • and 135 • , as shown in Figure 1(a). Extract effective texture feature sets from the output results of the filter. The extracted texture feature sets are shown in Figure  1(b), with 24 texture feature maps in total.

pix2pix model
At present, image rendering based on generative adversarial networks [34] attracts much attention because it can directly generate color images by using mapping relations. Therefore, it is widely used in image processing, text processing, natural language processing and other fields. pix2pix model [26] is a model for image-to-image conversion based generative adversarial networks. It can better synthesize image or generate color image. The following are the main features of the pix2pix model.
(1) Both the generator and discriminator structure use the convolution unit of Conv-Batchnorm-ReLU, namely, convolutional layer, batch normalization and ReLU Loss are used.
(2) The input of the pix2pix model is the specified image, such as the label image to the real image, the input is the label image, the input is the grayscale image to the color image, and the input is the grayscale image. The grayscale image as the input of the generator, and the input and output of the generator as the input of the discriminator, so as to establish the corresponding relationship between the input image and the output image, realize user control, and complete image color rendering.
(3) PatchGAN was used as discriminator for pix2pix model. Specifically, the image is divided into several fixed-size blocks, and the authenticity of each block is determined. Finally, the average value is taken as the final output. A network structure similar to U-net is adopted as a generator, and skip connections are added between i and n − i at each layer to simulate U-net, where n is the total number of layers of the network. Not only can the path be shrunk for context information, but the symmetric extension path can be positioned precisely.
(4) The loss function of the pix2pix model is as follows, which is composed of L1 loss and Vanilla GAN loss. Where, let x be the input image, y be the expected output, G be the generator, and D be the discriminator:

Method framework
In view of the detail problems existing in the generative adversarial networks based image color rendering method in complex scenes, this paper proposes an image color rendering method using Gabor filter based improved pix2pix for robust image. The network framework is shown in Figure  2. The rendering process is shown in Figure 3. After selecting the data set for training, the trained generator is used for color rendering.
Firstly, we preprocessed the image with Gabor filter, and extracted the texture feature set of the image as input for training and verification. By comparing 24 Gabor texture feature maps with 6 scales and 4 directions, the texture map with 7 scales and 0 • direction has the best color rendering effect. Secondly, this paper utilizes the existing pix2pix model architecture for image transformation to perform color rendering by learning the mapping relationship between grayscale image and color image. Finally, although the pix2pix model solves some problems existing in the generative adversarial networks, it still has the instability problem of training on large-scale image dataset. Therefore, the least square loss in LSGAN [35] is used in the objective function of pix2pix model, and the penalty term similar to WGAN GP [36] is added. We improve the overall model framework, it is shown that the proposed method has a better performance on the rendering of robust images by a series of comparison experiments.

Improved pix2pix model
The generator in generative adversarial networks hopes that the output data distribution can be more close to the distribution of the real data. Meanwhile, the discriminator of generative adversarial networks needs to make a judgment between the real data and the output data by the generator to find the real data and the fake data. The loss function can generate more real data through the Lipschitz constraint generative adversarial networks. The traditional generative adversarial networks uses the cross entropy loss or Vanilla GAN loss as the loss function. The classification is correct, but gradient dispersion occurs when the generator is updated [36,37]. LSGAN uses the square loss as the objective function, and the least square loss function penalizes the samples (fake samples) that are in the discriminant true but far away from the decision boundary, and drags the false samples far away from the decision boundary into the decision boundary, to improve the quality of the generated image.
Therefore, compared with the traditional generative adversarial networks, the image generated by LSGAN has higher quality and a more stable training process. So the least square loss function is adopted in the framework of this paper.
where, the input image is x, expected output is y, generator is G, discriminator is D, noise is z, labels of generated sample and real sample are a and b, respectively. c is the value set by the generator to let the discriminator think the generated image is real data.
Generative adversarial networks can generate better data distribution, but it has the problem of training instability. Improving the training stability of generative adversarial networks is a hot topic in deep learning. Wasserstein generative adversarial networks (WGAN) [38] uses Wasserstein distance to generate a value function with better theoretical properties than JS divergence in order to constrain the Lipschitz constant of the discriminator function, which basically solves the problems of generative adversarial networks training instability and model collapse and ensures the diversity of generated samples [39]. WGAN GP continues to improve on WGAN, and its penalty term is derived from the Wasserstein distance, where the penalty coefficient is 10.
The objective function of WGAN GP is as follows, adding the original critic loss and the gradient penalty term of WGAN GP.
, and λ is the penalty coefficient.

Experiments
To verify the effectiveness and accuracy of the proposed method, we conducted extensive experiments on summer dataset [40], with 1231 pieces of train set, and 309 pieces of test set. Experiment 1 is conducted to test the effect of application of the Gabor filter and different objective functions in the pix2pix model environment. Experiment 2 is performed to test the rendering effect when different Gabor texture feature maps are given as input. Experiment 3 is conducted to test whether the penalty term should be added in the discriminator. Experiment 4 is the rendering effect of low-quality or robust images was tested by adding noise and dimming the brightness of the image for assessing the robustness of this model.
Training parameters: The experiment was performed on a PC with Intel(R) Core(TM) i7-9750H CPU @ 2.6 GHz 2.59 GHz, a graphics card NVIDIA GeForce GTX 1650, and CUDA+Cudnn for acceleration training. The proposed method is implemented based on Python 3.7 and Pytorch framework. The number of experimental training iterations is 200, optimizer is Adam, batch size is 1, learning rate is 0.0002, and number of processes is 4. Network structures and implementation details: All the models we train are designed to 256 × 256 images. The input image of the model is 512 × 256; the left one is the original color image, and the right one is the texture feature map processed by the Gabor filter, as shown in Figure 4. By default, the pix2pix model uses a generator similar to U-net, PatchGAN, and Vanilla GAN loss.
Evaluation Metrics: To reflect image color rendering quality of different models more objectively, peak signal to noise ratio (PSNR) and structural similarity (SSIM) indexes are adopted to evaluate the rendered images [41,42]. These two indexes are often used in the evaluation metrics of image processing. PSNR is an objective standard to evaluate the quality of the color image produced. The calculation formula is as follows: where, H and W represent the width and height of the image respectively, (i, j) represents each pixel point, and n represents the number of bits of the pixel, X and Y represent two images respectively.
Because PSNR index also has its limitations, it cannot completely reflect the consistency of image quality and human visual effect, so SSIM index is used for further comparison. SSIM is a metric to measure the similarity of two images. By comparing the image rendered by the model with the original color image, the effectiveness and accuracy of this algorithm are demonstrated. The calculation formula is as follows: where, µ x and µ y respectively represent the average value of the real image and the generated image, σ 2 x and σ 2 y respectively represent the variance of the real image and the generated image, σ xy represents the covariance of the real image and the generated image, c 1 = (k 1 , L) 2 and c 2 = (k 2 , L) 2 are constants that maintain stability, and L is the dynamic range of pixel value, k 1 = 0.01, k 2 = 0.03.

Using Gabor filters and different objective function
In this study, the Gabor filter extracts the texture features of the image in 6 scales and 4 directions. For convenience, according to the texture feature set shown in Figure 1(b), the images are numbered from left to right and from top to bottom. The direction is assumed to be d and the size is s, as shown in Figure 5. For example, G1 means "s = 7, d = 0 • ". So the direction is 0 • and the scale is 7. G6 means "s = 17, d = 0 • ". So the direction is 0 • and the scale is 17. By default, the pix2pix model uses Vanilla GAN loss. Based on pix2pix model, the model using least squares loss function is called LSpix (least squares pix2pix). Based on Gabor filter, the model using Gabor texture maps is called pixGn (pix2pix Gabor n), n = 1, 6, 7, 13. To test the effect of application of the Gabor filter and different objective functions in the pix2pix model environment, we divided the experiment into adding Gabor filters (Figures 6(c),(e)), not adding Gabor filters (Figures 6(b),(d)), using least squares loss (Figures 6(d),(e)) or Vanilla GAN loss ( Figures  6(b),(c)). By comparing the images in Figure 6, it can be confirmed that the rendering effect preprocessed by least square loss or Gabor filter is better, which is the LSpixG1 model. This is because Gabor can preprocess images and obtain multi-scale and multi-direction features of images, so as to achieve good and fast feature extraction and learning during network model learning. Moreover, compared with other loss functions, the least square loss function only reaches saturation at one point, which is not easy to cause the problem of gradient disappearance.  Tables 1 and 2 compare the distortion and structural similarity between the rendered image and the ground truth, show the maximum, minimum, and average indexes. This is an additional interpretation of Figure 6. The LSpix model has the highest score in the maximum and average PSNR, which is 3.591dB and 1.083dB higher than that of the pix2pix model. Meanwhile, the LSpix model has the highest score in SSIM, which is 1.618%, 15.649% and 3.848% higher than that of the pix2pix model, respectively. This proves that our model is closer to ground truth in structure, and the colors are more reductive.

Inputing of different Gabor texture maps
In order to test the rendering effect when different Gabor texture feature maps are input, we use different feature images as input. Figure 7 shows how different Gabor texture images are rendered when Vanilla GAN loss is the target function of the pix2pix model. Figures 5(c),(d), that is, the direction is the scale is 7 and 45 • or 90 • , contain incomplete details of the original image, resulting in incomplete input texture features. Therefore, the generated image is blurred, as shown in Figures 7(a),(b). Although the 7th and 13th texture images were considered as training sets (pixG7+G13 model) with a total of 1231 × 2 images taken together, the rendering effect was not significantly improved, as shown in Figure 7(b). Evidently, by comparing the images in Figure 7, it can be found out that the visual effect of Figures 7(c)-(e) is good and not blurred. And Table 3 and 4 show the evaluation indexes after the input of different feature maps. The data show that incomplete input of texture feature map is not desirable. And to compare the operation efficiency of input different texture maps, the training time is shown in Table 5, in hours. Regardless of whether the Gabor filter was used, which texture map was entered, the operation time was around 9 hours. However, if two texture maps are used for training, such as G1 and G13 are used in pixG1+G13 model, the training set doubles and the pre-training time doubles. Even though the results shown in Figure 7(d) are good, the method is not desirable. This is because when we use filtering, we need to extract multi-scale and multi-direction features and remove redundant information. Once the important information is removed, it will certainly have a certain impact on the results, resulting in blurred images.    Figure 8 shows the performance on whether or not to add a penalty item in the discriminator based on the pixG1 model. Figure 8(a) is the effect of not adding penalty items, and Figure 8(b) is the effect of adding penalty items. Obviously, Figure 8(b) has less error in detail and better visual effect. Penalty term, that is, gradient punishment is carried out by interpolation method to make the model satisfy Lipschitz constraint. The addition of punishment terms similar to WGAN GP basically solves the problems of training instability and model collapse in the GAN model and ensures the diversity of generated samples.  Tables 6 and 7 show the evaluation indexes whether or not to add a penalty item. With the addition of penalty term, the LSpix GP model achieved the highest score in the minimum PSNR, which was 0.904dB higher than that of the original pix2pix model. Evidently, in the texture map extracted based on Gabor filter, the image with scale of 7 and direction of 0 • has the best training effect. Furthermore, when the objective function is least squares loss, the average SSIM and performance are improved. When penalty term is added, the score of maximum and average SSIM is the highest, which is 1.753% and 1.083% higher than that of the pix2pix model. Therefore, the image rendered by the LSpixG1 GP model is better than that of the original model.  To compare the operating efficiency of different objective functions given as input and increase the punishment items, the running time is listed in Table 8 in hours. For example, LSpixG6 GP represents using the least squares loss, adding the penalty item, the direction is 0 • and the scale is 17. Regardless of whether Gabor filter was used, which texture map was input, whether Vanilla GAN loss or least square loss was the target function, the training time was approximately 9 h. Although the algorithm efficiency of adding the filter alone is basically the same, the time of using the filter after adding the penalty term will be increased by 2-3 h. Therefore, the algorithm in this study adopts LSpixG1 GP model, namely Gabor texture map with model input scale of 7 and direction of 0 • , least squares loss and penalty term.

Rendering robust images
In order to evaluate the robustness of the model for rendering robust image, the rendering effect of low-quality images was tested by adding noise and dimming the image brightness, as shown in Figure  9. When testing the noise image, the Gaussian noise image with mean value of 0 and variance of 10 is added. When testing low-illumination images, power operation is performed on the pixels of the image, and the power is set to 2.5 to generate low-illumination images.  We use PSNR evaluation metric to evaluate the rendering results of each model for low-quality images. As shown in Table 9, the image rendered by the LSpix model is of higher quality when rendering noisy images. As shown in Table 10, images rendered by Gabor filter models are generally of good quality for low-illumination images. After the Gabor filter, the objective function is least square loss and the penalty term is added, the image quality of the LSpixG1 GP model is higher than that of the original model. This is because the method in this paper uses Gabor filter to avoid the interference of noise to the image to a certain extent. And when extracting features, the depth information of the image can be extracted to avoid the influence of light on the image. Clearly, the proposed method in this paper is robust to color rendering of low-quality images.  Note: Bold font is the best value for each column.

Conclusions
We proposed a novel image color rendering method based on using Gabor filter based improved pix2pix for robust image and demonstrate its feasibility and superiority for a variety of tasks. It enables automatically render robust images and has good robustness with low-quality image rendering. The experimental results on summer dataset demonstrate that the proposed method can achieve high-quality performance with image color rendering. At present, the image resolution of image processing based on deep learning is limited, which leads to the limitation in the practical application of rendering method. In the future, we will focus on improving the resolution of network model input images.