1 Introduction

Image style transfer has shown a promising future for new forms of image manipulation. A neural artistic style transformation method proposed by Gatys et al. [7] has achieved great success with convolutional neural networks, which is followed by many works [2, 3, 8, 11, 15, 29,30,31, 34] recently. They produce convincing visual results by transferring artistic features from reference painting onto the content photograph. However, these artistic style transfer methods suffer from visual distortion problem, even when both of the content and reference style images are photographic. The stylized results always contain visually intricate distortions which make them have a painting-like looking. Luan et al. [17] point out that the distortions appear only at style transformation process, and they thus propose a photorealism regularization term based on locally affine colour transformations to reconstruct fine content details. To avoid the unexpected geometric matching problem, Luan et al. [17] integrate semantic segmentation masks to Gatys et al.’s method [7]. Although the content spatial structures are preserved in many situations, details and exact shapes of structures are erased when semantic segmentation is inaccurate or contains overlapping areas. And the computation of matting Laplacian matrix and semantic segmentation consumes much extra time for high-quality output. After investigating the style transformation procedure, we discover the distortions occur at two stages: the spatial structures of content image may be lost during content-preserving process and the unexpected geometric matching can be introduced during style transformation process. Figures 1 and 2 illustrate the distortions occur at both content-preserving and style transformation processes. For example, shown as zoom-ins (c-ii), the buildings of content image are obviously distorted by content-preserving process. Moreover, shown as zoom-ins (c-iii), the buildings are also distorted after style transformation process. However, buildings of (c-iii) hold different shapes and edges from (c-ii) from content-preserving process, which means the zoom-in buildings are distorted twice.

To improve the photorealism, this paper introduces an additional similarity layer with the corresponding loss function to constrain both content preservation and style transformation processes. This similarity layer is added into several places of the convolutional neural networks to prevent distortions by minimizing a similarity loss function and other loss functions proposed in fast neural style algorithm [13].

The entire proposed method consists of two stages: detail reconstruction process and style transfer process. Our system has two key components: a dual-stream deep convolution network as Loss Network and edge-preserving filters as style fusion model (SFM). The edge-preserving filter is used to extract details and colour information of the outputs generated from the loss network, which means our scheme combines the details without colour from content and the colour without details from reference style. During the optimization process, the content and style features are captured first by the additional layers in loss network, and then a random white noise image X is passed through both detail reconstruction and style transfer networks. The final output of SFM is the stylized result.

The main contributions of this paper are as follows: we investigate into the problem of Gatys et al.’s method and find out that the lost photorealism of stylized result is caused by distortions occurring at both content preservation and style transformation stages; we propose a photographic style transfer method which is capable of improving the photorealism of stylized results. A similarity loss function using L1-norm is applied for reconstructing finer content details and preventing geometric mismatching problem. And a style fusion model using edge-preserving filter is utilized to reduce artefacts.

2 Related work

Global colour transfer methods Global colour transfer methods tend to utilize spatial-invariant objective functions to transfer images. Input images with simple styles can be processed well by these algorithms [9, 12, 22, 23]. For example, a colour shift technique proposed by Reinhard et al. [23] can extract global features in a decorrelated colour space from reference style image and transfer them onto content input. Pitié et al. [22] propose an approach that also achieves the goal of global style transfer by matching full 3D colour histograms between images with a series of 1D histogram transformation. Although these methods can handle several simple situations like tone curves (e.g., low or high contrast), they are limited in the ability to match complex areas with corresponding colour styles.

Local colour transfer methods. Local colour transfer researches propose to use spatial colour mapping technique like semantic segmentation [10, 16, 17, 21, 27, 28, 32] to handle various applications such as semantic colour gradient transfer (dark and bright) [10, 21, 27, 32], transfer of artistic edits [1, 24, 26, 33], and painting stylistic features [3, 4, 7, 13, 15, 31, 34]. Many of them [7, 10, 13, 15, 17, 28, 31, 34] are using convolutional neural network to achieve this goal. Gatys et al. [7] achieve groundbreaking performance of painterly style transfer [15, 34] by using the responses of activation layers to represent features from input images. This work focuses mainly on the photographic style transfer, especially the preservation of photorealistic attribute, which is distinguished from their painting-like style transformation [3, 7, 13, 15]. The artistic stylized results are compelling; however, because of distortion problem, the photorealism is lost when their artistic style methods are naively applied to photographic style transfer. To improve the photorealism, recently, Luan et al. [17] propose a photographic style transfer method which uses semantic segmentation and post-processing step to solve the distortion problem. Mechrez et al. [19] propose to use Screened Poisson Equation to replace Luan et al.’s post-processing step and preserve more precise content details than Luan et al.’s results. Liao et al. [16] propose a photorealistic style transfer method for sophisticated images, which are based on finding the nearest neighbour field on deep features extracted from CNN. Our work follows from the neural style algorithm [7] and presents better results than aforementioned methods.

Fig. 1
figure 1

Given a reference style image and a content image as inputs, photographic style transfer seeks to generate output with photorealistic attribute, which should preserve both the context of content and style of reference. Gatys et al. [7] succeed in transferring style colour but introducing distortions to the context of output. In comparison, our method transfers faithful style colour; meanwhile, it also preserves the photorealistic attribute

Fig. 2
figure 2

Distortions occur at both content-preserving and style transformation processes. c contains the zoom-in insets of input content, a and b. (c-ii) shows that a introduces distortions into reconstructed content details, and (c-iii) shows that b distorts details of a

Fig. 3
figure 3

Framework overview. We use the Loss network to preserve content and transfer style from inputs to outputs. The loss functions are added into the pre-trained VGG-16 network [25], which are computed at certain layers and backpropagated to the loss network during optimization process. For example, \(L_\mathrm{style}^\mathrm{relu1\_2}\) computes the feature representation differences between random white noise image X and style image \(I_\mathrm{s}\), where relu1_2 denotes the placement for style layer in VGG-16 network. Then, the deviation of \(L_\mathrm{style}^{relu1\_2}\) is propagated back to ST network

3 Method

This section presents the architecture of our approach and the key loss functions to constrain both detail reconstruction and style transfer processes.

3.1 Architecture

Gatys et al. [7] propose an image transformation network with convolutional neural networks to accomplish the task that an input image is transformed into an output image. The network architecture of Gatys et al. [7] includes a pre-trained VGG-19 network [25] and two loss layers. The layers learn feature representations of input images and compute the representation differences between a generated image and inputs. Their algorithm adds two additional layers: content layer and style layer, which capture and store feature representations of inputs. Then, a random white noise image initialized as the same size of content input is fed into the network. The loss functions compute the distance of feature representations between the generated image with respect to content and reference style inputs separately. The derivatives of loss terms are propagated back to the loss network for next iteration until the maximum iteration number is reached. Similar to this optimization-based approach, our basic network uses the pre-trained VGG-16 network [25] as our loss network. The content loss function and perceptual loss functions in [13] are used in our network. In addition, we add another additional layer with pixel-level loss function into our network. Moreover, we also add a style fusion model as our post-processing step to reduce artefacts. Our network is an optimization-based approach which is designed for arbitrary style and content image pairs. Thus, it does not need a training process.

As shown in Fig. 3, our framework consists of two components: a dual-stream convolution network consisting of a loss network and a style fusion model. The loss network is composed by two parallel deep convolution networks and several additional layers. A scalar value \(\L ^i(y,y_t)\) of loss function at layer i is computed to measure the Euclidean distance between the output image y and target image \(y_t\) (\(y_t\) can be content image and reference style image). For the dual-stream loss network, we refer to the upper deep convolution network as detail reconstruction network (DR network), which is designed for preserving the content details. Meanwhile, the lower convolution network is referred to as style transfer network (ST network), which aims to transfer style information, mainly colour, from reference style image to content input. As shown in the right side of Fig. 3, the style fusion model (SFM) also has two components: a detail filter and a style filter, which take the outputs of two parallel deep networks as their inputs separately.

Inputs and outputs For the DR network, the inputs are one photograph as content image \(I_\mathrm{c}\) and one random white noise image \(X_\mathrm{DR}\) with the same size of \(I_\mathrm{c}\), and the output is one image \(O_\mathrm{c}\). For the ST network, the inputs are one photograph as content image \(I_\mathrm{c}\), one random white noise image \(X_\mathrm{ST}\) with the same size of \(I_\mathrm{c}\) and one photograph as style image \(I_\mathrm{s}\). The output is one image \(O_\mathrm{s}\). The \(X_\mathrm{DR}\) and \(X_\mathrm{ST}\) are initialized by random white noise image X. For the detail filter, the input is the output \(O_\mathrm{c}\) of DR network, and the input of style filter is the output \(O_\mathrm{s}\) of ST network. The output of entire SFM is one image \(O_\mathrm{fusion}\).

Additional layers: There are three different layers in total: content layer, style layer and similarity layer. The content and similarity layers carry loss functions for the purpose of preserving content features from \(I_\mathrm{c}\) onto \(O_\mathrm{c}\). And the style layers hold the loss functions to transfer stylistic features from \(I_\mathrm{s}\) to \(O_\mathrm{s}\).

3.2 Loss functions

In general, we define three different loss terms for two purposes: 1 preserve the content feature information F as structure details and reconstruct them on \(X_\mathrm{DR}\); 2. learn the reference style features and correctly match them to \(X_\mathrm{ST}\).

Layers in convolutional neural network define nonlinear filter banks to encode input image. Hence, the representations of features in a neural network actually are the filter responses to input image [18]. We assume that a layer has D different filters, and each filter has a size M, where M is height times width. For the reconstruction of feature, let \(F_i\) be the feature representations captured at ith activation layer of the DR network when \(I_\mathrm{c}\) is on processing. Then, \(F_i\) is a feature map with the size of \( D_i \times M_i \). The feature reconstruction loss is the squared and normalized Euclidean distance between the feature representations of X and target \(I_\mathrm{c}\):

$$\begin{aligned} L_\mathrm{feat}(X,I_\mathrm{c}) = \sum _{i \in L} \frac{1}{D_i \times M_i} \Vert F_i(X) - F_i(I_\mathrm{c})\Vert _2^2 \end{aligned}$$
(1)

where L denotes the set of activation layers containing feature loss. This term helps to minimize the visual distinguishability between the random image X and target image \(I_\mathrm{c}\). However, as this reconstruction is from high layers [18], the rough spatial structure of content image can be preserved, but details especially exact shapes of the structure are lost.

For the same convolutional neural network architecture, Zhao et al. [35] demonstrate using L1-norm loss in the spatial constraint better preserves the spatial structures as compared to using L2-norm. Hence, we introduce another similarity preserved loss \(L_\mathrm{simi}\) based on mean absolute error (L1-norm) into loss network. We found that the L1-norm loss employed outside of the network makes the style transformation output lose the colour information from style image. Hence, we attempt to add L1-norm loss inside the network. Let MAE be the mean absolute error of the feature representations of X and \(I_\mathrm{c}\) at jth activation layer of the loss network, and then the similarity preserved loss is defined as:

$$\begin{aligned} L_\mathrm{simi}(X,I_\mathrm{c}) = \sum _{j \in L} \text {MAE}(F_j(X), F_j(I_\mathrm{c})) \end{aligned}$$
(2)

where L denotes the set of activation layers added as similarity layers. The purpose of this loss term is how much information of target \(I_\mathrm{c}\) is lost by X, which contributes to reconstruct exact pixels of \(I_\mathrm{c}\) into X as many as possible by minimizing this term.

As mentioned above, reconstructing content features with only \(L_\mathrm{feat}\) is not enough to preserve precise details, especially the exact edges inside structures. Figures 4 and  5 demonstrate the effect of \(L_\mathrm{simi}\).

For the transformation of style, we need to obtain an effective representation of style in the reference image. According to [6], we use correlations of feature space to be the representation of style. And these feature correlations can be given by Gramian Matrix. Let \(G_k\) be the Gramian Matrix of vectorized feature map \(F_k\) at kth activation layer of ST network when the input x is on processing, and the vectorized feature map \(F_k\) is reshaped to \(D_k \times H_kW_k\). We define the Gramian Matrix as:

$$\begin{aligned} G_k(x) = \frac{1}{N}F_k(x) \cdot F_k(x)^T \end{aligned}$$
(3)

where N is the total number of pixels of \(F_k(x)\). The Gramian Matrix is the inner product between feature maps at kth activation layer, which gives the feature correlations. Then, the style loss is the squared Frobenius norm of the difference between the Gramian Matrices of the random image \(X_\mathrm{ST}\) and the target \(I_\mathrm{s}\):

$$\begin{aligned} L_\mathrm{style}(X_\mathrm{ST}, I_\mathrm{s}) = \sum _{k \in L}\Vert G_k(X_\mathrm{ST})- G_k(I_\mathrm{s})\Vert _F^2 \end{aligned}$$
(4)
Fig. 4
figure 4

The similarity function for reconstructing finer content details. Left: the input content image. a and c are the reconstructed results through our DR network without and with similarity loss, respectively. b shows two insets of a and c (in that order), respectively. We may notice that c preserves more precise context of input than a

where L denotes the set of activation layers holding style loss. The style loss is well defined even for different sizes of \(X_\mathrm{ST}\) and \(I_\mathrm{s}\) since the \(G_k(x)\) always has the same \(D_k \times D_k\) size. As demonstrated in [6], the generated output will only preserve the stylistic feature from target image, which means the spatial structure of target image cannot be preserved by minimizing the style loss.

In this paper, the \(L_\mathrm{feat}\) and \(L_\mathrm{simi}\) are used to constrain the detail reconstruction procedure, which preserves the spatial structures, exact details like shapes and edges inside content image onto output \(O_\mathrm{c}\) [shown as (c) in Fig. 4]. These two loss terms forms \(L_\mathrm{DR}\), the joint loss of DR network. The \(L_\mathrm{style}\), \(L_\mathrm{feat}\) and \(L_\mathrm{simi}\) constrain the style transformation procedure, which generates the output \(O_\mathrm{s}\) with stylistic features mainly colour information from reference image and detailed features from content image. The combination of three loss terms forms \(L_\mathrm{ST}\), the joint loss of ST network. Therefore, the two final joint loss terms are defined as:

$$\begin{aligned} L_\mathrm{DR} = \alpha _f L_\mathrm{feat} + \alpha _d L_\mathrm{simi} \end{aligned}$$
(5)

and

$$\begin{aligned} L_\mathrm{ST} = \beta _f L_\mathrm{feat} + \beta _d L_\mathrm{simi} + \beta _\mathrm{s} L_\mathrm{style} \end{aligned}$$
(6)

where \(\alpha _f\) and \(\alpha _d\) denote the weights of content layers and similarity layers in DR network, and \(\beta _f\), \(\beta _d\) and \(\beta _\mathrm{s}\) denote the weights of three corresponding layers in ST network. All the implementation details of these parameters are introduced in Sect. 4.

In previous researches [12, 22], the output of prior process contains stylistic features from reference style, and these features are distributed according to the semantic structures of content input. Hence, the style transformation procedure in our ST network learns stylistic features and also distributes them into the semantic structures, which needs both style loss term and detail reconstruction loss terms. One example result from ST network is shown as (c) in Fig. 6.

3.3 Style fusion model

In Sect. 1, we mention that the distortions are introduced by both detail preservation and style transformation procedures. We use \(L_\mathrm{simi}\) to prevent geometric mismatching; however, the output of ST network may still exhibit distortion and noise artefacts due to the content-style trade-off (shown in Fig. 8). To reduce the artefacts, we apply a refinement technique style fusion model (SFM) into our approach. The edge-preserving filter (recursion filter) proposed by Gastal et al. [5] is capable of effectively smoothing always noise or textures while retaining sharp edges, which is a suitable technique for reducing artefacts. We thus use the edge-preserving filter (recursion filter) [5] to smooth both output image \(O_\mathrm{c}\) and \(O_\mathrm{s}\) with guidance \(O_\mathrm{c}\). In this paper, we refer to detail filter and style filter as the smooth process of \(O_\mathrm{c}\) and \( O_\mathrm{s} \), respectively. The final result \(O_\mathrm{fusion}\) is defined as:

$$\begin{aligned} O_\mathrm{fusion} = (O_\mathrm{c} - \mathrm{RF}(O_\mathrm{c}, \sigma _\mathrm{s}, \sigma _\mathrm{r}, O_\mathrm{c}) )+ \mathrm{RF}(O_\mathrm{s}, \sigma _\mathrm{s}, \sigma _\mathrm{r}, O_\mathrm{c}) \end{aligned}$$
(7)

where \(\sigma _\mathrm{s}\) denotes the spatial standard deviation and \(\sigma _\mathrm{r}\) denotes the range standard deviation for the edge-preserving filter [5]. Shown as (e) in Fig. 6, the clear stylized result \(O_\mathrm{fusion}\) obtained by our SFM is free to the artefacts.

Fig. 5
figure 5

The similarity function for preventing geometric mismatching problem. a is the stylized result without similarity loss, and b is the stylized result with similarity loss. Note that the zoom-in regions show that the similarity loss effectively prevents the unexpected geometric matching

4 Implementation details

Fig. 6
figure 6

The style fusion model for reducing noise artefacts and avoiding distortions. a is the reconstructed content output of our DR network, and b is the extracted details (white points) of content without colour from a. c is the stylized output of our ST network, and d is the extracted colour without details from c. e Is the fusion stylized result from SFM. We may notice that c still exhibits noise (red rectangles) and distortion (green rectangles) artefacts due to content-style trade-off (please refer to Fig. 8). However, the final stylized result (e) is free of noise and distortion artefacts. We recommend readers to view the electronic version

This section describes the implementation details for our approach. We choose pre-trained VGG-16 network [25] as the basic architecture of our DR network and ST network. The content layer with \(L_\mathrm{feat}\) is added into the activation layer of relu3_3, and the style layers with \(L_\mathrm{style}\) are added into relu1_2,relu2_2,relu3_3 and relu4_3 activation layers. The similarity layers are added into relu1_2,relu2_2, relu3_3 activation layers. For the DR network, we add content and similarity layers into the pre-trained VGG-16 network and choose parameters \(\alpha _f = 5\) and \(\alpha _d = 10^3\) for the detail reconstruction. For the ST network, we add content, similarity and style layers into the pre-trained VGG-16 network and choose \(\beta _f = 5\), \(\beta _d = 10\) and \(\beta _\mathrm{s} = 100\) for the style transformation. We use \(\sigma _\mathrm{s}=60\) (default in the public source code) and \(\sigma _\mathrm{r}=1\) for the edge-preserving filter [5] in SFM. The effect of parameter \(\alpha _d\), \(\beta _d\) and \(\sigma _\mathrm{r}\) is illustrated in Figs. 7,  8 and  9, respectively.

Fig. 7
figure 7

The effect of parameter \(\alpha _d\) for our DR network. Note that the reconstructed content result achieves the highest PSNR at \(\alpha _d=10^3\). The lower and larger values decrease the accuracy of reconstructed result. Hence, we find the best parameter \(\alpha _d=10^3\) for our DR network and use it to produce all the other results in this paper

Fig. 8
figure 8

The effect of parameter \(\beta _d\) for content-style trade-off. A lower \(\beta _d\) value cannot prevent unexpected geometric matching. For example, the regions of tower tops (green rectangles) in a and b. A larger \(\beta _d\) value loses the style of reference image. For example, the buildings (red rectangles) in d and e have undesired dark colour style, which should be in the golden light style. Note that the stylized result at \(\beta _d=1\times 10^{1}\) still exhibits some distortion and noise artefacts but they will be eliminated by SFM. We thus choose \(\beta _d=1\times 10^{1}\) to produce our style transformation result of the ST network and all the other results in this paper. We recommend readers to view the electronic version

Fig. 9
figure 9

The effect of parameter \(\sigma _\mathrm{r}\) for SFM. Note that a lower \(\sigma _\mathrm{r}\) value cannot prevent noise artefacts, for example, red rectangles in a and b, and a larger \(\sigma _\mathrm{r}\) value suppresses the transferred style, for instance, green rectangles in d and e. We found the best parameter \(\sigma _\mathrm{r}=1\) to produce our result and all the other results in our paper

Fig. 10
figure 10

Placements for similarity layers in DR network. ad Show the reconstructed content results with similarity layers at different places in our DR network. Note that the reconstructed result achieves the highest PSNR score at relu1_2,relu2_2,relu3_3. Hence, we place similarity layers at relu1_2,relu2_2,relu3_3 in our DR network for all the experiments in this paper

Fig. 11
figure 11

Placements for similarity layers in ST network. ac show the stylized results with similarity layers at different places in our ST network. Note that a presents a worse stylized result than b and c as the centre area of blanket and walls upside are not in golden style colour. It is difficult to tell that either b or c outperforms better style transformation as they achieve a very similar style transfer result (conducting a series of other experiments, please refer to our supplemental materials for more details). We thus choose to place similarity layers at relu1_2,relu2_2,relu3_3 in our ST network, which keeps the same placements as the DR network

Table 1 Additional layers and VGG-16 network

We use a random white noise image X (\( X_\mathrm{DR} \) and \( X_\mathrm{ST} \) represent X for DR network and for ST network, respectively) with the same size of content image as our initialized input and choose Adam [14] optimization algorithm with learning rate 1 and iteration 1000 in the optimization process for all our experiments in this paper. All the inputs including \(I_\mathrm{c}\), \(I_\mathrm{s}\) and X are scaled into \(width = 512\) if their widths are over 512; otherwise, they remain original resolution. The dual-stream convolution networks run the optimization process at the same time, and the optimization time is around 2.5 min. by running on our GPU card (NVIDIA GeForce GTX 1060, 6G GDDR5). The whole optimization process only needs one content image and one reference style image without any limitation on resolution.

5 Results

This section discusses the selection for hyperparameters, placement for similarity layer, comparisons between our methods and state-of-the-art methods in terms of global and local colour transfer.

5.1 The effect of hyperparameters

Figures 7 and  8 demonstrate the effect of parameters \(\alpha _d\) and \(\beta _d\), respectively. As shown in Fig. 7, the content reconstructed result achieves the highest PSNR (peak signal-to-noise ratio) value when \(\alpha _d = 10^3\). We thus choose \(\alpha _d = 10^3\) to reconstruct content details in our DR network. In Fig. 8, a lower \(\beta _d\) value still produces stylized result with geometric mismatching problem. Conversely, a too larger \(\beta _d\) value produces less style result. Hence, we find the best value \(\beta _d=10\) to produce our stylized result and all the other results in this paper. Figures 10 and  11 illustrate the choices of similarity layers in our DR network and ST network, respectively. For DR network, we choose to place similarity layers at relu1_2,relu2_2,relu3_3 as it achieves the highest PSNR score. For ST network, the stylized results (b) and (c) have very similar style transformation appearance, and we thus choose to place similarity layers at relu1_2,relu2_2,relu3_3 in our ST network, which keeps the same placements as the DR network. The implementation details of our networks are described in Tables 1, 2, and  3.

Table 2 Implementation details of DR network
Table 3 Implementation details of ST network
Fig. 12
figure 12

Comparison between Gatys et al. [7], Ghiasi et al. [8] and ours. Gatys et al. [7] and Ghiasi et al. [8] produce a larger amount of distortions in their results while ours are free of distortions. The stylized results of Ghisai et al. [8] method use the interpolation weight of 0.8 and other default parameter values in their paper

5.2 Comparisons

This section presents several comparisons between state-of-the-art methods and ours.

Comparison between representative artistic style transfer methods and ours We compare Gatys et al. [7], Ghiasi et al. [8] with ours across great differences among content images in Fig. 12. Our results preserve content structures with more precise details than other artistic prior methods. For example, our results contain all details of ceiling lamp, frescoes, carpets, and railings which are not reconstructed well by Gatys et al. [7] and Ghiasi et al. [8]. To illustrate the ability of preserving precise details, we compare content and reference style image with great details to prior artistic style transfer methods in third row. Our method reconstructs almost every detail in content image and transfers the colour style faithfully, while Gayts et al. and Ghiasi et al. [8] lose great details. The detail representations on other examples also show our strong ability to reduce distortions and preserve content spatial structures as well.

Comparison between representative global colour transfer methods and ours. In Fig. 13, we compare our method with representative global colour transfer algorithms such as Reinhard et al. [23] and Pitié et al. [22]. A global colour mapping technique is applied by both of them to match the colour statistics of content input and reference style image. However, they cannot obtain faithful colour transformation results when the inputs have spatial-varying contents, which limits their applications. For example, in the second row of Fig. 13, Reinhard et al. and Pitié et al. methods cannot transfer light style in reference style image to buildings.

Fig. 13
figure 13

Comparison between representative global colour transfer methods Reinhard et al. [23], Pitié et al. [22] and ours

Fig. 14
figure 14

Comparison between Luan et al. [17], Liao et al. [16] and ours. All examples from Luan et al. [17] dataset

Fig. 15
figure 15

Comparison between Luan et al. [17] and ours ([17]+[5]). Our method effectively handles the posterization effect of Luan et al. [17]. All examples from Luan et al. [17] dataset. We recommend readers to view the electronic version

Fig. 16
figure 16

Comparison between Luan et al. [17], Liao et al. [16] and ours ([17]+[5]). Our method preserves finer content details than Luan et al. [17] and transfer style more faithful than Liao et al. [16]. All examples from Luan et al. [17] dataset. We recommend readers to view the electronic version

Fig. 17
figure 17

Comparison between Mechrez et al. [19] and ours ([17]+[5]). The zoom-ins show the insets of Luan et al.’s first stage output, Mechrez et al. [19] and ours ([17]+[5]) (in that order). We recommend readers to view the electronic version

Fig. 18
figure 18

Inputs: Content image (upper) and Reference Style image 1 (lower). a represents the stylized result combining context of content image and style of Reference Style image 1. b is the Reference Style image 2. c is the stylized result transferring style of Reference Style image 2 to context of a

Fig. 19
figure 19

Some failure cases

Fig. 20
figure 20

User study results for photorealism and style faithfulness

Comparison between representative local photographic style transfer methods and ours In Fig. 14, we compare our method ([7]+ours) with the state-of-the-art methods, Luan et al. [17] and Liao et al. [16]. The approaches proposed by Luan et al. [17] and Liao et al. [16] are the latest methods which effectively avoids the distortion problem. Our method preserves more precise content details than Luan et al. For example, the plants in the first row, the characters of postcard in the third row and the windows in the bottom row. Our method may not obtain better faithful transformation results but our method achieves the highest score on the photorealism. Please refer to user study for more details in Sect. 5.3 and more scores in our supplemental materials. All the stylized results (including user study) of Luan et al. [17] use manually semantic segmentation masks provided by the authors and parameter \(\lambda =10^4\) (default value in Luan et al.’s paper). We further compare our method with Luan et al. using different \(\lambda \) values on the images in Fig. 12, please refer to our supplemental materials for more details.

Luan et al. [17] propose a two-stage photograph style transfer method which expands Gatys et al.’s artistic style transfer method. Their first stage integrates semantic segmentation into neural style [7] method for object-to-object colour transfer, and their second stage applies a post-processing step to improve the photorealism of stylized result obtained from the first stage. In terms of local object-to-object colour transfer, our similarity loss function may not transfer colour for object-to-object as faithful as manually semantic segmentation. However, our edge-preserving filter [5] used in SFM may help Luan et al.’s results avoid the posterization artefacts. In Fig. 15, we show the stylized results that we apply edge-preserving filter [5] to process the results obtained from Luan et al.’s first stage. For example, our method effectively prevents the posterization artefacts on buildings in the first row, water in the second row and forehead in the third row.

In Fig. 16, we compare our method ([17]+[5]) with state-of-the-art photographic style transfer methods Luan et al. [17] and Liao et al. [16]. Note that our method ([17]+[5]) preserves more precise content details than Luan et al. [17] while transferring style more faithfully than Liao et al. [16]. Please see more details in the our supplemental materials.

In Fig. 18, we demonstrate that our method is robust on preserving content spatial details and achieving faithful style transformation results. Note that (c) still preserves well the details of first content input even through two style transformation process with different reference style images, and the photorealism of (c) is also preserved well.

Limitation

5.3 User study

We conduct a user survey to verify several colour transfer methods on photorealism and style faithfulness. There are six different methods considered in this survey, which include Reinhard et al. [23], Pitié et al. [22], Luan et al. [17], Liao et al. [16], our methods ([7]+ours, [17]+[5]). We ask 26 human participants to score stylized results on 1-to-4 scale. For the photorealism, the score ranges from “1: definitely not photorealistic” to “4: definitely photorealistic”. For the style faithfulness, the score ranges from “1: definitely not style faithful to reference style” to “4: definitely style faithful to reference style”. For each person, he or she is asked to score the stylized results of 6 methods in a random order. There are totally 44 different scenes (excluding unrealistic and repeated scenes) selected from Luan et al. [17] dataset.

In Fig. 20, we show the average score and standard deviation of each method. For the photorealism, our method ([7]+ours) and Liao et al. [16] rank the 1st and 2nd, respectively. Luan et al. [17] and Pitié et al. [22] have the worst performance regarding the photorealism as their results exhibit some artefacts. For the style faithfulness, Luan et al. [17] and our method ([17]+[5]) rank 1st and 2nd, respectively. Our edge-preserving filter [5] used in SFM slightly declines the style faithfulness score of Luan et al. [17], but it still achieves a higher score than Liao et al. [16]. Moreover, it significantly improves the photorealism score of Luan et al.’s results. Reinhard et al. [23] and Pitié et al. [22] perform the worst in the style faithfulness as their limitations for sophisticated images.

6 Conclusions

We investigate into the reason why the photorealism of stylized results is lost even when the photographic images are input to Gatys et al.’s method [7]. And we knowledge that both content preservation and style transformation stages distort images to lose the photorealistic attribute. Hence, we propose a photographic style transfer method that constrains detail reconstruction and style transformation processes by introducing a similarity loss function. This similarity loss function not only preserves exact details and structures of content image but also prevents the mismatch of texture patches between reference style and content image. The qualitative evaluation on Luan et al.’s [17] dataset shows that our proposed approaches are capable of preventing the distortions effectively and obtaining faithful stylized results as well.