Photographic style transfer

Wang, Li; Wang, Zhao; Yang, Xiaosong; Hu, Shi-Min; Zhang, Jianjun

doi:10.1007/s00371-018-1609-4

Photographic style transfer

Original Article
Open access
Published: 10 November 2018

Volume 36, pages 317–331, (2020)
Cite this article

Download PDF

You have full access to this open access article

The Visual Computer Aims and scope Submit manuscript

Photographic style transfer

Download PDF

Li Wang¹,
Zhao Wang¹,
Xiaosong Yang¹,
Shi-Min Hu² &
…
Jianjun Zhang¹

4597 Accesses
10 Citations
Explore all metrics

Abstract

Image style transfer has attracted much attention in recent years. However, results produced by existing works still have lots of distortions. This paper investigates the CNN-based artistic style transfer work specifically and finds out the key reasons for distortion coming from twofold: the loss of spatial structures of content image during content-preserving process and unexpected geometric matching introduced by style transformation process. To tackle this problem, this paper proposes a novel approach consisting of a dual-stream deep convolution network as the loss network and edge-preserving filters as the style fusion model. Our key contribution is the introduction of an additional similarity loss function that constrains both the detail reconstruction and style transfer procedures. The qualitative evaluation shows that our approach successfully suppresses the distortions as well as obtains faithful stylized results compared to state-of-the-art methods.

Two-stream FCNs to balance content and style for style transfer

Article 08 June 2020

Fast Image Multi-style Transfer and Its Quality Assessment

ECM: arbitrary style transfer via Enhanced-Channel Module

Article 13 July 2023

1 Introduction

Image style transfer has shown a promising future for new forms of image manipulation. A neural artistic style transformation method proposed by Gatys et al. [7] has achieved great success with convolutional neural networks, which is followed by many works [2, 3, 8, 11, 15, 29,30,31, 34] recently. They produce convincing visual results by transferring artistic features from reference painting onto the content photograph. However, these artistic style transfer methods suffer from visual distortion problem, even when both of the content and reference style images are photographic. The stylized results always contain visually intricate distortions which make them have a painting-like looking. Luan et al. [17] point out that the distortions appear only at style transformation process, and they thus propose a photorealism regularization term based on locally affine colour transformations to reconstruct fine content details. To avoid the unexpected geometric matching problem, Luan et al. [17] integrate semantic segmentation masks to Gatys et al.’s method [7]. Although the content spatial structures are preserved in many situations, details and exact shapes of structures are erased when semantic segmentation is inaccurate or contains overlapping areas. And the computation of matting Laplacian matrix and semantic segmentation consumes much extra time for high-quality output. After investigating the style transformation procedure, we discover the distortions occur at two stages: the spatial structures of content image may be lost during content-preserving process and the unexpected geometric matching can be introduced during style transformation process. Figures 1 and 2 illustrate the distortions occur at both content-preserving and style transformation processes. For example, shown as zoom-ins (c-ii), the buildings of content image are obviously distorted by content-preserving process. Moreover, shown as zoom-ins (c-iii), the buildings are also distorted after style transformation process. However, buildings of (c-iii) hold different shapes and edges from (c-ii) from content-preserving process, which means the zoom-in buildings are distorted twice.

To improve the photorealism, this paper introduces an additional similarity layer with the corresponding loss function to constrain both content preservation and style transformation processes. This similarity layer is added into several places of the convolutional neural networks to prevent distortions by minimizing a similarity loss function and other loss functions proposed in fast neural style algorithm [13].

The entire proposed method consists of two stages: detail reconstruction process and style transfer process. Our system has two key components: a dual-stream deep convolution network as Loss Network and edge-preserving filters as style fusion model (SFM). The edge-preserving filter is used to extract details and colour information of the outputs generated from the loss network, which means our scheme combines the details without colour from content and the colour without details from reference style. During the optimization process, the content and style features are captured first by the additional layers in loss network, and then a random white noise image X is passed through both detail reconstruction and style transfer networks. The final output of SFM is the stylized result.

The main contributions of this paper are as follows: we investigate into the problem of Gatys et al.’s method and find out that the lost photorealism of stylized result is caused by distortions occurring at both content preservation and style transformation stages; we propose a photographic style transfer method which is capable of improving the photorealism of stylized results. A similarity loss function using L1-norm is applied for reconstructing finer content details and preventing geometric mismatching problem. And a style fusion model using edge-preserving filter is utilized to reduce artefacts.

2 Related work

Global colour transfer methods Global colour transfer methods tend to utilize spatial-invariant objective functions to transfer images. Input images with simple styles can be processed well by these algorithms [9, 12, 22, 23]. For example, a colour shift technique proposed by Reinhard et al. [23] can extract global features in a decorrelated colour space from reference style image and transfer them onto content input. Pitié et al. [22] propose an approach that also achieves the goal of global style transfer by matching full 3D colour histograms between images with a series of 1D histogram transformation. Although these methods can handle several simple situations like tone curves (e.g., low or high contrast), they are limited in the ability to match complex areas with corresponding colour styles.

Local colour transfer methods. Local colour transfer researches propose to use spatial colour mapping technique like semantic segmentation [10, 16, 17, 21, 27, 28, 32] to handle various applications such as semantic colour gradient transfer (dark and bright) [10, 21, 27, 32], transfer of artistic edits [1, 24, 26, 33], and painting stylistic features [3, 4, 7, 13, 15, 31, 34]. Many of them [7, 10, 13, 15, 17, 28, 31, 34] are using convolutional neural network to achieve this goal. Gatys et al. [7] achieve groundbreaking performance of painterly style transfer [15, 34] by using the responses of activation layers to represent features from input images. This work focuses mainly on the photographic style transfer, especially the preservation of photorealistic attribute, which is distinguished from their painting-like style transformation [3, 7, 13, 15]. The artistic stylized results are compelling; however, because of distortion problem, the photorealism is lost when their artistic style methods are naively applied to photographic style transfer. To improve the photorealism, recently, Luan et al. [17] propose a photographic style transfer method which uses semantic segmentation and post-processing step to solve the distortion problem. Mechrez et al. [19] propose to use Screened Poisson Equation to replace Luan et al.’s post-processing step and preserve more precise content details than Luan et al.’s results. Liao et al. [16] propose a photorealistic style transfer method for sophisticated images, which are based on finding the nearest neighbour field on deep features extracted from CNN. Our work follows from the neural style algorithm [7] and presents better results than aforementioned methods.

3 Method

This section presents the architecture of our approach and the key loss functions to constrain both detail reconstruction and style transfer processes.

3.1 Architecture

Gatys et al. [7] propose an image transformation network with convolutional neural networks to accomplish the task that an input image is transformed into an output image. The network architecture of Gatys et al. [7] includes a pre-trained VGG-19 network [25] and two loss layers. The layers learn feature representations of input images and compute the representation differences between a generated image and inputs. Their algorithm adds two additional layers: content layer and style layer, which capture and store feature representations of inputs. Then, a random white noise image initialized as the same size of content input is fed into the network. The loss functions compute the distance of feature representations between the generated image with respect to content and reference style inputs separately. The derivatives of loss terms are propagated back to the loss network for next iteration until the maximum iteration number is reached. Similar to this optimization-based approach, our basic network uses the pre-trained VGG-16 network [25] as our loss network. The content loss function and perceptual loss functions in [13] are used in our network. In addition, we add another additional layer with pixel-level loss function into our network. Moreover, we also add a style fusion model as our post-processing step to reduce artefacts. Our network is an optimization-based approach which is designed for arbitrary style and content image pairs. Thus, it does not need a training process.

As shown in Fig. 3, our framework consists of two components: a dual-stream convolution network consisting of a loss network and a style fusion model. The loss network is composed by two parallel deep convolution networks and several additional layers. A scalar value $\L ^i(y,y_t)$ of loss function at layer i is computed to measure the Euclidean distance between the output image y and target image $y_t$ ($y_t$ can be content image and reference style image). For the dual-stream loss network, we refer to the upper deep convolution network as detail reconstruction network (DR network), which is designed for preserving the content details. Meanwhile, the lower convolution network is referred to as style transfer network (ST network), which aims to transfer style information, mainly colour, from reference style image to content input. As shown in the right side of Fig. 3, the style fusion model (SFM) also has two components: a detail filter and a style filter, which take the outputs of two parallel deep networks as their inputs separately.

Inputs and outputs For the DR network, the inputs are one photograph as content image $I_\mathrm{c}$ and one random white noise image $X_\mathrm{DR}$ with the same size of $I_\mathrm{c}$, and the output is one image $O_\mathrm{c}$. For the ST network, the inputs are one photograph as content image $I_\mathrm{c}$, one random white noise image $X_\mathrm{ST}$ with the same size of $I_\mathrm{c}$ and one photograph as style image $I_\mathrm{s}$. The output is one image $O_\mathrm{s}$. The $X_\mathrm{DR}$ and $X_\mathrm{ST}$ are initialized by random white noise image X. For the detail filter, the input is the output $O_\mathrm{c}$ of DR network, and the input of style filter is the output $O_\mathrm{s}$ of ST network. The output of entire SFM is one image $O_\mathrm{fusion}$.

Additional layers: There are three different layers in total: content layer, style layer and similarity layer. The content and similarity layers carry loss functions for the purpose of preserving content features from $I_\mathrm{c}$ onto $O_\mathrm{c}$. And the style layers hold the loss functions to transfer stylistic features from $I_\mathrm{s}$ to $O_\mathrm{s}$.

3.2 Loss functions

In general, we define three different loss terms for two purposes: 1 preserve the content feature information F as structure details and reconstruct them on $X_\mathrm{DR}$; 2. learn the reference style features and correctly match them to $X_\mathrm{ST}$.

Layers in convolutional neural network define nonlinear filter banks to encode input image. Hence, the representations of features in a neural network actually are the filter responses to input image [18]. We assume that a layer has D different filters, and each filter has a size M, where M is height times width. For the reconstruction of feature, let $F_i$ be the feature representations captured at ith activation layer of the DR network when $I_\mathrm{c}$ is on processing. Then, $F_i$ is a feature map with the size of $ D_i \times M_i $. The feature reconstruction loss is the squared and normalized Euclidean distance between the feature representations of X and target $I_\mathrm{c}$:

$$\begin{aligned} L_\mathrm{feat}(X,I_\mathrm{c}) = \sum _{i \in L} \frac{1}{D_i \times M_i} \Vert F_i(X) - F_i(I_\mathrm{c})\Vert _2^2 \end{aligned}$$

(1)

where L denotes the set of activation layers containing feature loss. This term helps to minimize the visual distinguishability between the random image X and target image $I_\mathrm{c}$. However, as this reconstruction is from high layers [18], the rough spatial structure of content image can be preserved, but details especially exact shapes of the structure are lost.

For the same convolutional neural network architecture, Zhao et al. [35] demonstrate using L1-norm loss in the spatial constraint better preserves the spatial structures as compared to using L2-norm. Hence, we introduce another similarity preserved loss $L_\mathrm{simi}$ based on mean absolute error (L1-norm) into loss network. We found that the L1-norm loss employed outside of the network makes the style transformation output lose the colour information from style image. Hence, we attempt to add L1-norm loss inside the network. Let MAE be the mean absolute error of the feature representations of X and $I_\mathrm{c}$ at jth activation layer of the loss network, and then the similarity preserved loss is defined as:

$$\begin{aligned} L_\mathrm{simi}(X,I_\mathrm{c}) = \sum _{j \in L} \text {MAE}(F_j(X), F_j(I_\mathrm{c})) \end{aligned}$$

(2)

where L denotes the set of activation layers added as similarity layers. The purpose of this loss term is how much information of target $I_\mathrm{c}$ is lost by X, which contributes to reconstruct exact pixels of $I_\mathrm{c}$ into X as many as possible by minimizing this term.

As mentioned above, reconstructing content features with only $L_\mathrm{feat}$ is not enough to preserve precise details, especially the exact edges inside structures. Figures 4 and 5 demonstrate the effect of $L_\mathrm{simi}$.

For the transformation of style, we need to obtain an effective representation of style in the reference image. According to [6], we use correlations of feature space to be the representation of style. And these feature correlations can be given by Gramian Matrix. Let $G_k$ be the Gramian Matrix of vectorized feature map $F_k$ at kth activation layer of ST network when the input x is on processing, and the vectorized feature map $F_k$ is reshaped to $D_k \times H_kW_k$. We define the Gramian Matrix as:

$$\begin{aligned} G_k(x) = \frac{1}{N}F_k(x) \cdot F_k(x)^T \end{aligned}$$

(3)

where N is the total number of pixels of $F_k(x)$. The Gramian Matrix is the inner product between feature maps at kth activation layer, which gives the feature correlations. Then, the style loss is the squared Frobenius norm of the difference between the Gramian Matrices of the random image $X_\mathrm{ST}$ and the target $I_\mathrm{s}$:

$$\begin{aligned} L_\mathrm{style}(X_\mathrm{ST}, I_\mathrm{s}) = \sum _{k \in L}\Vert G_k(X_\mathrm{ST})- G_k(I_\mathrm{s})\Vert _F^2 \end{aligned}$$

(4)

where L denotes the set of activation layers holding style loss. The style loss is well defined even for different sizes of $X_\mathrm{ST}$ and $I_\mathrm{s}$ since the $G_k(x)$ always has the same $D_k \times D_k$ size. As demonstrated in [6], the generated output will only preserve the stylistic feature from target image, which means the spatial structure of target image cannot be preserved by minimizing the style loss.

In this paper, the $L_\mathrm{feat}$ and $L_\mathrm{simi}$ are used to constrain the detail reconstruction procedure, which preserves the spatial structures, exact details like shapes and edges inside content image onto output $O_\mathrm{c}$ [shown as (c) in Fig. 4]. These two loss terms forms $L_\mathrm{DR}$, the joint loss of DR network. The $L_\mathrm{style}$, $L_\mathrm{feat}$ and $L_\mathrm{simi}$ constrain the style transformation procedure, which generates the output $O_\mathrm{s}$ with stylistic features mainly colour information from reference image and detailed features from content image. The combination of three loss terms forms $L_\mathrm{ST}$, the joint loss of ST network. Therefore, the two final joint loss terms are defined as:

$$\begin{aligned} L_\mathrm{DR} = \alpha _f L_\mathrm{feat} + \alpha _d L_\mathrm{simi} \end{aligned}$$

(5)

and

$$\begin{aligned} L_\mathrm{ST} = \beta _f L_\mathrm{feat} + \beta _d L_\mathrm{simi} + \beta _\mathrm{s} L_\mathrm{style} \end{aligned}$$

(6)

where $\alpha _f$ and $\alpha _d$ denote the weights of content layers and similarity layers in DR network, and $\beta _f$, $\beta _d$ and $\beta _\mathrm{s}$ denote the weights of three corresponding layers in ST network. All the implementation details of these parameters are introduced in Sect. 4.

In previous researches [12, 22], the output of prior process contains stylistic features from reference style, and these features are distributed according to the semantic structures of content input. Hence, the style transformation procedure in our ST network learns stylistic features and also distributes them into the semantic structures, which needs both style loss term and detail reconstruction loss terms. One example result from ST network is shown as (c) in Fig. 6.

3.3 Style fusion model

In Sect. 1, we mention that the distortions are introduced by both detail preservation and style transformation procedures. We use $L_\mathrm{simi}$ to prevent geometric mismatching; however, the output of ST network may still exhibit distortion and noise artefacts due to the content-style trade-off (shown in Fig. 8). To reduce the artefacts, we apply a refinement technique style fusion model (SFM) into our approach. The edge-preserving filter (recursion filter) proposed by Gastal et al. [5] is capable of effectively smoothing always noise or textures while retaining sharp edges, which is a suitable technique for reducing artefacts. We thus use the edge-preserving filter (recursion filter) [5] to smooth both output image $O_\mathrm{c}$ and $O_\mathrm{s}$ with guidance $O_\mathrm{c}$. In this paper, we refer to detail filter and style filter as the smooth process of $O_\mathrm{c}$ and $ O_\mathrm{s} $, respectively. The final result $O_\mathrm{fusion}$ is defined as:

$$\begin{aligned} O_\mathrm{fusion} = (O_\mathrm{c} - \mathrm{RF}(O_\mathrm{c}, \sigma _\mathrm{s}, \sigma _\mathrm{r}, O_\mathrm{c}) )+ \mathrm{RF}(O_\mathrm{s}, \sigma _\mathrm{s}, \sigma _\mathrm{r}, O_\mathrm{c}) \end{aligned}$$

(7)

where $\sigma _\mathrm{s}$ denotes the spatial standard deviation and $\sigma _\mathrm{r}$ denotes the range standard deviation for the edge-preserving filter [5]. Shown as (e) in Fig. 6, the clear stylized result $O_\mathrm{fusion}$ obtained by our SFM is free to the artefacts.

4 Implementation details

This section describes the implementation details for our approach. We choose pre-trained VGG-16 network [25] as the basic architecture of our DR network and ST network. The content layer with $L_\mathrm{feat}$ is added into the activation layer of relu3_3, and the style layers with $L_\mathrm{style}$ are added into relu1_2,relu2_2,relu3_3 and relu4_3 activation layers. The similarity layers are added into relu1_2,relu2_2, relu3_3 activation layers. For the DR network, we add content and similarity layers into the pre-trained VGG-16 network and choose parameters $\alpha _f = 5$ and $\alpha _d = 10^3$ for the detail reconstruction. For the ST network, we add content, similarity and style layers into the pre-trained VGG-16 network and choose $\beta _f = 5$, $\beta _d = 10$ and $\beta _\mathrm{s} = 100$ for the style transformation. We use $\sigma _\mathrm{s}=60$ (default in the public source code) and $\sigma _\mathrm{r}=1$ for the edge-preserving filter [5] in SFM. The effect of parameter $\alpha _d$, $\beta _d$ and $\sigma _\mathrm{r}$ is illustrated in Figs. 7, 8 and 9, respectively.

Table 1 Additional layers and VGG-16 network

Full size table

We use a random white noise image X ($ X_\mathrm{DR} $ and $ X_\mathrm{ST} $ represent X for DR network and for ST network, respectively) with the same size of content image as our initialized input and choose Adam [14] optimization algorithm with learning rate 1 and iteration 1000 in the optimization process for all our experiments in this paper. All the inputs including $I_\mathrm{c}$, $I_\mathrm{s}$ and X are scaled into $width = 512$ if their widths are over 512; otherwise, they remain original resolution. The dual-stream convolution networks run the optimization process at the same time, and the optimization time is around 2.5 min. by running on our GPU card (NVIDIA GeForce GTX 1060, 6G GDDR5). The whole optimization process only needs one content image and one reference style image without any limitation on resolution.

5 Results

This section discusses the selection for hyperparameters, placement for similarity layer, comparisons between our methods and state-of-the-art methods in terms of global and local colour transfer.

5.1 The effect of hyperparameters

Figures 7 and 8 demonstrate the effect of parameters $\alpha _d$ and $\beta _d$, respectively. As shown in Fig. 7, the content reconstructed result achieves the highest PSNR (peak signal-to-noise ratio) value when $\alpha _d = 10^3$. We thus choose $\alpha _d = 10^3$ to reconstruct content details in our DR network. In Fig. 8, a lower $\beta _d$ value still produces stylized result with geometric mismatching problem. Conversely, a too larger $\beta _d$ value produces less style result. Hence, we find the best value $\beta _d=10$ to produce our stylized result and all the other results in this paper. Figures 10 and 11 illustrate the choices of similarity layers in our DR network and ST network, respectively. For DR network, we choose to place similarity layers at relu1_2,relu2_2,relu3_3 as it achieves the highest PSNR score. For ST network, the stylized results (b) and (c) have very similar style transformation appearance, and we thus choose to place similarity layers at relu1_2,relu2_2,relu3_3 in our ST network, which keeps the same placements as the DR network. The implementation details of our networks are described in Tables 1, 2, and 3.

Table 2 Implementation details of DR network

Full size table

Table 3 Implementation details of ST network

Full size table

5.2 Comparisons

This section presents several comparisons between state-of-the-art methods and ours.

Comparison between representative artistic style transfer methods and ours We compare Gatys et al. [7], Ghiasi et al. [8] with ours across great differences among content images in Fig. 12. Our results preserve content structures with more precise details than other artistic prior methods. For example, our results contain all details of ceiling lamp, frescoes, carpets, and railings which are not reconstructed well by Gatys et al. [7] and Ghiasi et al. [8]. To illustrate the ability of preserving precise details, we compare content and reference style image with great details to prior artistic style transfer methods in third row. Our method reconstructs almost every detail in content image and transfers the colour style faithfully, while Gayts et al. and Ghiasi et al. [8] lose great details. The detail representations on other examples also show our strong ability to reduce distortions and preserve content spatial structures as well.

Comparison between representative global colour transfer methods and ours. In Fig. 13, we compare our method with representative global colour transfer algorithms such as Reinhard et al. [23] and Pitié et al. [22]. A global colour mapping technique is applied by both of them to match the colour statistics of content input and reference style image. However, they cannot obtain faithful colour transformation results when the inputs have spatial-varying contents, which limits their applications. For example, in the second row of Fig. 13, Reinhard et al. and Pitié et al. methods cannot transfer light style in reference style image to buildings.

Comparison between representative local photographic style transfer methods and ours In Fig. 14, we compare our method ([7]+ours) with the state-of-the-art methods, Luan et al. [17] and Liao et al. [16]. The approaches proposed by Luan et al. [17] and Liao et al. [16] are the latest methods which effectively avoids the distortion problem. Our method preserves more precise content details than Luan et al. For example, the plants in the first row, the characters of postcard in the third row and the windows in the bottom row. Our method may not obtain better faithful transformation results but our method achieves the highest score on the photorealism. Please refer to user study for more details in Sect. 5.3 and more scores in our supplemental materials. All the stylized results (including user study) of Luan et al. [17] use manually semantic segmentation masks provided by the authors and parameter $\lambda =10^4$ (default value in Luan et al.’s paper). We further compare our method with Luan et al. using different $\lambda $ values on the images in Fig. 12, please refer to our supplemental materials for more details.

Luan et al. [17] propose a two-stage photograph style transfer method which expands Gatys et al.’s artistic style transfer method. Their first stage integrates semantic segmentation into neural style [7] method for object-to-object colour transfer, and their second stage applies a post-processing step to improve the photorealism of stylized result obtained from the first stage. In terms of local object-to-object colour transfer, our similarity loss function may not transfer colour for object-to-object as faithful as manually semantic segmentation. However, our edge-preserving filter [5] used in SFM may help Luan et al.’s results avoid the posterization artefacts. In Fig. 15, we show the stylized results that we apply edge-preserving filter [5] to process the results obtained from Luan et al.’s first stage. For example, our method effectively prevents the posterization artefacts on buildings in the first row, water in the second row and forehead in the third row.

In Fig. 16, we compare our method ([17]+[5]) with state-of-the-art photographic style transfer methods Luan et al. [17] and Liao et al. [16]. Note that our method ([17]+[5]) preserves more precise content details than Luan et al. [17] while transferring style more faithfully than Liao et al. [16]. Please see more details in the our supplemental materials.

In Fig. 18, we demonstrate that our method is robust on preserving content spatial details and achieving faithful style transformation results. Note that (c) still preserves well the details of first content input even through two style transformation process with different reference style images, and the photorealism of (c) is also preserved well.

Limitation

5.3 User study

We conduct a user survey to verify several colour transfer methods on photorealism and style faithfulness. There are six different methods considered in this survey, which include Reinhard et al. [23], Pitié et al. [22], Luan et al. [17], Liao et al. [16], our methods ([7]+ours, [17]+[5]). We ask 26 human participants to score stylized results on 1-to-4 scale. For the photorealism, the score ranges from “1: definitely not photorealistic” to “4: definitely photorealistic”. For the style faithfulness, the score ranges from “1: definitely not style faithful to reference style” to “4: definitely style faithful to reference style”. For each person, he or she is asked to score the stylized results of 6 methods in a random order. There are totally 44 different scenes (excluding unrealistic and repeated scenes) selected from Luan et al. [17] dataset.

In Fig. 20, we show the average score and standard deviation of each method. For the photorealism, our method ([7]+ours) and Liao et al. [16] rank the 1st and 2nd, respectively. Luan et al. [17] and Pitié et al. [22] have the worst performance regarding the photorealism as their results exhibit some artefacts. For the style faithfulness, Luan et al. [17] and our method ([17]+[5]) rank 1st and 2nd, respectively. Our edge-preserving filter [5] used in SFM slightly declines the style faithfulness score of Luan et al. [17], but it still achieves a higher score than Liao et al. [16]. Moreover, it significantly improves the photorealism score of Luan et al.’s results. Reinhard et al. [23] and Pitié et al. [22] perform the worst in the style faithfulness as their limitations for sophisticated images.

6 Conclusions

We investigate into the reason why the photorealism of stylized results is lost even when the photographic images are input to Gatys et al.’s method [7]. And we knowledge that both content preservation and style transformation stages distort images to lose the photorealistic attribute. Hence, we propose a photographic style transfer method that constrains detail reconstruction and style transformation processes by introducing a similarity loss function. This similarity loss function not only preserves exact details and structures of content image but also prevents the mismatch of texture patches between reference style and content image. The qualitative evaluation on Luan et al.’s [17] dataset shows that our proposed approaches are capable of preventing the distortions effectively and obtaining faithful stylized results as well.

References

Bae, S., Paris, S., Durand, F.: Two-scale tone management for photographic look. ACM Trans. Gr. (TOG) 25, 637–645 (2006)
Article Google Scholar
Chen, D., Yuan, L., Liao, J., Yu, N., Hua, G.: Stylebank: an explicit representation for neural image style transfer. arXiv preprint arXiv:1703.09210 (2017)
Chen, T.Q., Schmidt, M.: Fast patch-based style transfer of arbitrary style. arXiv preprint arXiv:1612.04337 (2016)
Chi, M.T., Liu, W.C., Hsu, S.H.: Image stylization using anisotropic reaction diffusion. Vis. Comput. 32(12), 1549–1561 (2016)
Article Google Scholar
Gastal, E.S., Oliveira, M.M.: Domain transform for edge-aware image and video processing. ACM Trans. Gr. (ToG) 30, 69 (2011)
Google Scholar
Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 (2015)
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2414–2423 (2016)
Ghiasi, G., Lee, H., Kudlur, M., Dumoulin, V., Shlens, J.: Exploring the structure of a real-time, arbitrary neural artistic stylization network. arXiv preprint arXiv:1705.06830 (2017)
Gong, H., Finlayson, G.D., Fisher, R.B., Fang, F.: 3d color homography model for photo-realistic color transfer re-coding. Vis. Comput. pp. 1–11 (2017)
He, M., Liao, J., Yuan, L., Sander, P.V.: Neural color transfer between images. arXiv preprint arXiv:1710.00756 (2017)
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. arXiv preprint arXiv:1703.06868 (2017)
Hwang, Y., Lee, J.Y., So Kweon, I., Joo Kim, S.: Color transfer using probabilistic moving least squares. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3342–3349 (2014)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: European Conference on Computer Vision, pp 694–711. Springer, Berlin (2016)
Chapter Google Scholar
Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Li, C., Wand, M.: Combining markov random fields and convolutional neural networks for image synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2479–2486 (2016)
Liao, J., Yao, Y., Yuan, L., Hua, G., Kang, S.B.: Visual attribute transfer through deep image analogy. arXiv preprint arXiv:1705.01088 (2017)
Luan, F., Paris, S., Shechtman, E., Bala, K.: Deep photo style transfer. arXiv preprint arXiv:1703.07511 (2017)
Mahendran, A., Vedaldi, A.: Understanding deep image representations by inverting them. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5188–5196 (2015)
Mechrez, R., Shechtman, E., Zelnik-Manor, L.: Photorealistic style transfer with screened poisson equation. arXiv preprint arXiv:1709.09828 (2017)
Morel, J.M., Petro, A.B., Sbert, C.: Screened poisson equation for image contrast enhancement. Image Process. Line 4, 16–29 (2014)
Article Google Scholar
Oliveira, M., Sappa, A.D., Santos, V.: Unsupervised local color correction for coarsely registered images. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 201–208. IEEE, New York (2011)
Pitie, F., Kokaram, A.C., Dahyot, R.: N-dimensional probability density function transfer and its application to color transfer. In: Tenth IEEE International Conference on Computer Vision, 2005. ICCV 2005, vol. 2, pp 1434–1439. IEEE, New York (2005)
Reinhard, E., Adhikhmin, M., Gooch, B., Shirley, P.: Color transfer between images. IEEE Comput. Gr. Appl. 21(5), 34–41 (2001)
Article Google Scholar
Shih, Y., Paris, S., Barnes, C., Freeman, W.T., Durand, F.: Style transfer for headshot portraits. ACM Trans. Gr. (TOG) 33, 148 (2014)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Sunkavalli, K., Johnson, M.K., Matusik, W., Pfister, H.: Multi-scale image harmonization. ACM Trans. Gr. (TOG) 29, 125 (2010)
Google Scholar
Tai, Y.W., Jia, J., Tang, C.K.: Local color transfer via probabilistic segmentation by expectation-maximization. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005, vol. 1, pp. 747–754. IEEE, New York (2005)
Tsai, Y.H., Shen, X., Lin, Z., Sunkavalli, K., Lu, X., Yang, M.H.: Deep image harmonization. arXiv preprint arXiv:1703.00069 (2017)
Ulyanov, D., Lebedev, V., Vedaldi, A., Lempitsky, V.S.: Multi-style generative network for real-time transfer. In: ICML, pp. 1349–1357 (2016)
Wang, X., Oxholm, G., Zhang, D., Wang, Y.F.: Multimodal transfer: a hierarchical deep convolutional neural network for fast artistic style transfer. arXiv preprint arXiv:1612.01895 (2016)
Wilmot, P., Risser, E., Barnes, C.: Stable and controllable neural texture synthesis and style transfer using histogram losses. arXiv preprint arXiv:1701.08893 (2017)
Wu, F., Dong, W., Kong, Y., Mei, X., Paul, J.C., Zhang, X.: Content-based colour transfer. Comput. Gr. Forum, Wiley Online Libr. 32, 190–203 (2013)
Article Google Scholar
Yi, Z., Li, Y., Ji, S., Gong, M.: Artistic stylization of face photos based on a single exemplar. Vis. Comput. 33(11), 1443–1452 (2017)
Article Google Scholar
Zhang, H., Dana, K.: Multi-style generative network for real-time transfer. arXiv preprint arXiv:1703.06953 (2017)
Zhao, H., Gallo, O., Frosio, I., Kautz, J.: Is l2 a good loss function for neural networks for image processing? arXiv preprint arXiv:1511.08861 (2015)

Download references

Acknowledgements

This work was supported by the European Commission under FP7 Marie Curie IRSES project AniNex (612627) and the China Scholarship Council. Li Wang appreciated support from Visual Media Research Centre at Tsinghua University, during his Visiting in Beijing.

Author information

Authors and Affiliations

National Centre for Computer Animation, Bournemouth University, Poole, UK
Li Wang, Zhao Wang, Xiaosong Yang & Jianjun Zhang
Tsinghua University, Beijing, China
Shi-Min Hu

Authors

Li Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaosong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Shi-Min Hu
View author publications
You can also search for this author in PubMed Google Scholar
Jianjun Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Li Wang.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1819 KB)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Wang, L., Wang, Z., Yang, X. et al. Photographic style transfer. Vis Comput 36, 317–331 (2020). https://doi.org/10.1007/s00371-018-1609-4

Download citation

Published: 10 November 2018
Issue Date: February 2020
DOI: https://doi.org/10.1007/s00371-018-1609-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Photographic style transfer

Abstract

Similar content being viewed by others

Two-stream FCNs to balance content and style for style transfer

Fast Image Multi-style Transfer and Its Quality Assessment

ECM: arbitrary style transfer via Enhanced-Channel Module

1 Introduction

2 Related work