One-toone Example-based Automatic Image Coloring Using Deep Convolutional Generative Adversarial Network

Due to the indeterminate nature of the problem, image colorization techniques currently rely heavily on human intuition. Using deep convolutional networks, we can build a system that takes a source image to guide localdependent feature color mapping and color a grayscale target image. Unlike most other convolutional neural network approaches that require a lot of training data, our proposed system uses only one image for training for each target image. Our system is based on deep convolutional generative adversarial networks, which contains concepts of both supervised and unsupervised learning. We proposed a model architecture, objective functions, and both preprocessing and postprocessing algorithms for the image coloring process. We evaluated our system on a variety of input images and showed that it produce excellent results. 


INTRODUCTION
Image colorization technique, which makes grayscale image colored, is one of the classical topics of computer vision.Traditionally, there are two approaches for image colorization.One is a scribble-based approach and the other is an example-induced approach.The former approach requires the user to provide local image color information, which is a color distribution of each feature in the image for coloring.The latter approach does not rely on user guidance, rather, it requires unique color information which is not identical to the target grayscale image to be colored.Strictly, example-induced colorization transfers the color composition from a fullcolor image to a grayscale image.The main objective of our work is to color the grayscale image based on a similar, colored image using example-induced colorization.
There are many existing research that use examplebased image colorization.Recent approaches use neural network (NN) methods based on a large number of Manuscript received October 21, 2016; revised May 17, 2017.images' color information for image colorization.However, in practice, it is unrealistic to expect that one would have a large number of colored images that is similar to the target grayscale image.Hence, we focus our work on one-to-one image colorization, i.e. we use only one source image to color the target grayscale image.To achieve this, our method uses an NN approach that is focused on local image features rather than global image features.
Deep Convolutional Generative Adversarial Network (DCGAN), a modified model of the generative adversarial network (GAN), is known to be a suitable model for image generation in the unsupervised context.Our method adopts a variation of DCGAN to generate a bunch of local colored images from a grayscale target image, which includes both supervised and unsupervised learning.After the generation of local colored images, they are merged into one single colored image using the criteria of structural similarity (SSIM) index [1].Subsequently, LAB space separation and Quick segmentation [2] techniques are applied to produce an enhanced result of the model.

A. Colorization Methods
In approaches using user inserted color scribbles, Levin et al. [3] used optimization techniques based on the premise that neighboring pixels that have similar luminosity should have similar color.Yatziv and Sapiro [4] added the concept of color blending and Nie et al. [5] improved the computational time using quadtree decomposition.In example-induced approaches, Welsh et al. [6] transferred the entire color model to the target image by comparing luminance and texture between source and target images.Gupta et al. [7] used a fast cascade feature matching scheme, and Liu and Zhang [8] performed locally weighted regression on both the grayscale image and the source image for faster computation.Chariot et al. [9] used the probability distribution of all possible colors instead of choosing the most probable color.Recent papers used neural network for colorization and made fully automatic approaches.Cheng et al. [10] applied deep learning techniques and used a joint bilateral filtering for post-processing.Iizuka et al. [11] extracted global and local feature separately, and removed the dependency on segmentation.

B. GAN and Its Extensions
Goodfellow et al. [12] proposed an adversarial network structure of which both the generator model and the discriminative model are trained simultaneously.Their paper theoretically formulates global optimality and convergence of a generated distribution.One of the notable advancement of GAN architecture is by Denton et al. [13] which uses a cascade of convolutional neural networks (CNN) [14] with a Laplacian pyramid structure.Radford et al. [15] adds unsupervised learning methods to the image processing, and introduces the DCGAN.In their paper, the authors argue that DCGAN is a result of joining GAN and CNN.

III. LOCAL IMAGE COLORIZATION BY DCGAN
We propose a variation of DCGAN where the generator network maps from local images of a target to a generative distribution of colored images.Unlike most neural network methods, our approach does not contain any level of feature extractor such as a local descriptor or semantic segmentation in a network configuration.This is because we intend to achieve a native performance of DCGAN for an image colorization task.

A. Our Proposed DCGAN Architecture
The original DCGAN architecture [12] consists of two deep neural networks, a discriminator and a generator.We adopt a basic DCGAN but transform it to take input as two-dimensional random images, not a onedimensional random vector.Fig. 1 presents the whole architecture of our DCGAN.Note both the discriminator and generator input and output.The shape of the discriminator input is 64 rows by 64 columns with 3 dimensions and the shape of the discriminator output is 100 rows with 1 dimension.The shape of the generator input is 64 rows by 64 columns with 1 dimension and the shape of the generator output is 64 rows by 64 columns with 3 dimensions.The discriminator in our architecture is totally the same with the original model of DCGAN paper [12], which is a discriminant model of a CNN approach.In the entire architecture, the discriminator works to distinguish between imitation image and authentic image.It consists of four convolution layers that are connected to leaky ReLU activators [16] and batch normalizations [17] and one linear fully connected layer.The sigmoid function applied to the last part of the model interprets vector to probability.Table I   The generator in our architecture differs slightly from the original model.The generator in our generative neural network model takes the task of generating colored images from monotone images.It consists of four convolution layers and four transpose convolution layers.After all the convolution or transpose convolution layers, ReLU activator [18] and batch normalization technique are applied.The hyper-tangent function applied to an end of generator architecture.Table II

B. Objective Function of DCGAN and Training
A discriminator and a generator are the core concepts of GAN, which are different but are also correlated competitive networks.This is why it is called an "adversarial" network.Our model also follows this strategy roughly, but we adjust the objective function of the discriminator because we have information of the original color images and do not need to depend totally on unsupervised results.
The objective function of the discriminator is given in (1). is a function which is defined as the discriminator and is a function which is defined as the generator.is the th full-color image of the cut pieces of the source image and is the th grayscale image coming from . (1) The objective function of a generator is given in (2).
means the Frobenius norm.Note that the first term of ( 2) is a kind of cross entropy and the second term of (2) is MSE. (2) The results of objective function are computed from the result of the training process of the discriminator and generator using a stochastic gradient descent method.Each network's result works as training loss for another network.The total process is a type of minimax game and can be formulated by (3) using the same notation as ( 1) and ( 2). (3)

C. Preprocessing amd Postprocessing
We fix the size of the input in our architecture to 64 rows by 64 columns.Our model is set to achieve the goal that the model colorizes a grayscale image with 256 rows by 256 columns.The objective of preprocessing is to generate numerous locally spliced training data to feed the generator.The source image is cut into random regions of 64 x 64, and then random geometrical transformations, such as rotating, flipping, warping, resizing, are applied to these pieces of images.Brightness or contrast change can also be applied.The target image is also cut to the conforming shape of a regular square lattice in a random way.For example, a target image is cut into 64 pieces as defined by (4).
is the th image being cut and is a piece of the target image that has row-wise pixel index from to and columnwise pixel index from to .The objective of post-processing is to merge the colorized pieces of target images into a single complete colorized image and to enhance the colorized result.The problem of this amalgamation is that there are overlapped regions of 64 pieces.We introduce SSIM to solve this problem.The colorized pieces are transformed into grayscale image again, and SSIM is calculated by taking the structure difference between the localized target image and the generated piece.The merger process is a type of weighted average method, and the resulting image of the amalgamation is defined by the product of the relative SSIM fraction and the generated color distribution.Gaussian filters are used in the merged colorized image to eliminate the visual boundary line that forms from merging each piece of the image.Then, to enhance the colorization result, we introduce a Quickshift algorithm, which is a two-dimensional image segmentation method that is based on an approximation of a kernelized mean-shift.By applying Quickshift to the original grayscale target image, we obtain segmentation regions from the target image, and along with each label of segmentation, the color of each region is replaced by a median of discrete color distributions in the region.Finally, the original target image and colorization result image in RGB space are transformed into LAB space, and the AB space of the colorized result and the L space of the target image are merged into a single LAB space image.
The learning rate of both the discriminator and the generator is set as 0.0006.We select Adam [20] as the stochastic gradient descent method and its momentum term is 0.5.The number of generated pieces of the target image is 12,800.The scale of the local density approximation, the level in the hierarchical segmentation, and the width of the Gaussian kernel used in smoothing the sample density of Quickshift are set as 0, 10, and 2, respectively.Also, in the training step, the number of epoch ranges from 12 to 16.   Fig. 2 shows some results of our model.We can see that the model works very well in coloring the target grayscale image from only one single source image.Fig. 3 shows that change of the loss of the discriminator and the generator over the training step.Fig. 4 shows the changes of results over epoch and Fig. 5 shows the impact of the Quickshift postprocessing step.In human's semantic sense, a ground truth result is obvious; Color of buildings should be gray or blue, and color of needle-leaf tree should be brown and green.Note that much of local color distribution is generated wrongly.
Fig. 6 illustrates a limitation of our approach.We suppose the error of estimation of color distribution come from two causes.Our model focuses on a distribution of local grayscale image so it cannot see the global context of an image.On the other hand, if pieces of the source image do not appear similarly with the local grayscale target image when pieces are generated, the generator cannot generate a proper color distribution.

V. CONCLUSION AND FUTURE WORK
We proposed a DCGAN-based model that aims at oneto-one colorization of grayscale image, using only a single source image that is similar to the target image.Our model has the advantage over other NN approaches that require numerous image instances.In this paper, we provide the designed model architecture, objective function, and both the preprocessing and post-processing algorithms.We tested our model on a variety of input images and showed that it produced excellent results.Now we propose two possible approaches to overstep the limitation of our model.First, in GAN training, it always matters to find out proper data argumentation for a specific task.Here we are not concerned too much with a various type of data argumentation.If we try to consider it more, the performance of our model could be better.Second, we focus on the local image features, so we do not consider any other feature descriptor or image classification task.This is intentional because we tend to measure the performance of a sole DCGAN model which carries out example-based image colorization.However, if the main purpose is set for improving the model performance, it may be better to consider a combination with other feature descriptor in preprocessing and postprocessing, or NN model architecture.It is plausible because those approaches also were applied to the other example-based colorization model including NN concept to improve their own model performance.For example, Cheng et al. [10] and Iizuka et al. [11] introduce those techniques to their own model and report improvement of the colorization performance.
Furthermore, although we emphasize our model as the model for an example-based task, our model also can be easily transposed to the model for a general scribblebased colorization task.Generally, generating a colored image from a scribbled image is algorithmically harder than the opposite one.This is because morphological segmentation is easier and more robust than semantic one in recent image processing trend.If we design a proper preprocessing which reduces a colored image to userlikely scribbled result of a grayscale image, we can readjust our model a little bit and train the adjusted model which solves scribble-based colorization task.NN approach is rarely applied to a scribble-based colorization task so it is worthy of future work.

Figure 1 .
Figure 1.Flow of our DCGAN architecture.It consists of two NN architectures.Achromatic color part represents a generator and blue color part represents a discriminator.Each of the plane refers to each NN layer.Note that a big blue arrow and the white one do not indicate the same thing; The white one means image input as training data of the discriminator, but the blue one means a result of the discriminator that is dealt as one of objective function terms in the generator training.

( 4 )
After the preprocessing, our model treats the cut pieces of source image and target image as training data and testing data, respectively.

Figure 2 .
Figure 2. Demonstration of the model: (a) Source images which provide color distribution to the model.(b) Target images that the model has to colorize.(c) Results of target colorization.Those images are selected for showing successful results of our model.The most parts of each result are quite plausible.

Figure 3 .
Figure 3. Changes of loss of the discriminator and the generator over the training step.This result comes from an instance of Fig. 4. Horizontal axis indicates epoch.From top to bottom, vertical axis indicates loss of a discriminator, a generator, cross entropy of the generator, and MSE of the generator.In common, a sharp decline appears in early epochs.However, like usual GAN, it is hard to reduce the loss of the generator and the discriminator.

Figure 4 .
Figure 4. Series changes of result over epochs.The top left image is the source image and the top right image is the target image.The remaining sixteen images indicate the changes of result over epochs.In the horizontal direction, training epoch increases from zero to sixteen.There are noticeable differences of performance appearing with each epoch.

Figure 5 .
Figure 5. Quickshift segmentation and its application into our model.The blurred image (left) is transformed into a clear, discernible image (right).A pink blob appears on one part of right upper of the blurred image and it is obvious that color is assigned wrongly.Looking the transformed image, this erroneous inference is reduced by Quickshift segmentation and mean color selection.

Figure 6 .
Figure 6.Example of significant failure of our metric.(left) Source image.(middle) Target grayscale image.(right) Target colored image.In human's semantic sense, a ground truth result is obvious; Color of buildings should be gray or blue, and color of needle-leaf tree should be brown and green.Note that much of local color distribution is generated wrongly.
refers to details of our discriminator architecture.

TABLE I .
ARCHITECTURE OF THE DISCRIMINATOR refers to details of our generator architecture.

TABLE II .
ARCHITECTURE OF THE GENERATOR