Skip to main content
Log in

Abstract

Manually re-drawing an image in a certain artistic style takes a professional artist a long time. Doing this for a video sequence single-handedly is beyond imagination. We present two computational approaches that transfer the style from one image (for example, a painting) to a whole video sequence. In our first approach, we adapt to videos the original image style transfer technique by Gatys et al. based on energy minimization. We introduce new ways of initialization and new loss functions to generate consistent and stable stylized video sequences even in cases with large motion and strong occlusion. Our second approach formulates video stylization as a learning problem. We propose a deep network architecture and training procedures that allow us to stylize arbitrary-length videos in a consistent and stable way, and nearly in real time. We show that the proposed methods clearly outperform simpler baselines both qualitatively and quantitatively. Finally, we propose a way to adapt these approaches also to 360\(^\circ \) images and videos as they emerge with recent virtual reality hardware.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Notes

  1. https://github.com/manuelruder/artistic-videos.

  2. https://github.com/jcjohnson/neural-style.

  3. https://github.com/manuelruder/fast-artistic-videos.

  4. https://github.com/jcjohnson/fast-neural-style.

  5. In the pretrained VGG models, which we use for our perceptual loss, values are centered around the mean pixel of the ImageNet dataset.

References

  • Butler, D. J., Wulff, J., Stanley, G. B., & Black, M. J. (2012). A naturalistic open source movie for optical flow evaluation. In ECCV (pp. 611–625).

  • Chen, D., Liao, J., Yuan, L., Yu, N., & Hua, G. (2017). Coherent online video style transfer. In ICCV (pp. 1114–1123).

  • Collobert, R., Kavukcuoglu, K., & Farabet, C. (2011). Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop.

  • Gatys, L. A., Ecker, A. S., & Bethge, M. (2015). Texture synthesis using convolutional neural networks. In NIPS (pp. 262–270).

  • Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. In CVPR (pp. 2414–2423).

  • Ghiasi, G., Lee, H., Kudlur, M., Dumoulin, V., & Shlens, J. (2017). Exploring the structure of a real-time, arbitrary neural artistic stylization network. In BMVC.

  • Gupta, A., Johnson, J., Alahi, A., & Fei-Fei, L. (2017). Characterizing and improving stability in neural style transfer. In ICCV (pp. 4087–4096).

  • Hays, J., & Essa, I. (2004). Image and video based painterly animation. In Proceedings of the 3rd international symposium on non-photorealistic animation and rendering, NPAR (pp. 113–120).

  • Huang, H., Wang, H., Luo, W., Ma, L., Jiang, W., Zhu, X., Li, Z., & Liu, W. (2017). Real-time neural style transfer for videos. In CVPR (pp. 7044–7052).

  • Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., & Brox, T. (2017). Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR .

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML.

  • Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In ECCV (pp. 694–711).

  • Li, C., & Wand, M. (2016a). Combining markov random fields and convolutional neural networks for image synthesis. In CVPR (pp. 2479–2486).

  • Li, C., & Wand, M. (2016b). Precomputed real-time texture synthesis with markovian generative adversarial networks. In ECCV (pp. 702–716).

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollr, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV .

  • Litwinowicz, P. (1997). Processing images and video for an impressionist effect. In Proceedings of the 24th annual conference on computer graphics and interactive techniques, SIGGRAPH (pp. 407–414).

  • Luan, F., Paris, S., Shechtman, E., & Bala, K. (2017). Deep photo style transfer. arXiv:1703.07511.

  • Marszalek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In CVPR (pp. 2929–2936).

  • Nikulin, Y., & Novak, R. (2016). Exploring the neural algorithm of artistic style. CoRR. arXiv:abs/1602.07188.

  • O’Donovan, P., & Hertzmann, A. (2012). Anipaint: Interactive painterly animation from video. Transactions on Visualization and Computer Graphics, 18(3), 475–487.

    Article  Google Scholar 

  • Revaud, J., Weinzaepfel, P., Harchaoui, Z., & Schmid, C. (2015). Epicflow: Edge-preserving interpolation of correspondences for optical flow. In CVPR (pp. 1164–1172).

  • Ruder, M., Dosovitskiy, A., & Brox, T. (2016). Artistic style transfer for videos. In GCPR (pp. 26–36).

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR .

  • Sundaram, N., Brox, T., & Keutzer, K. (2010). Dense point trajectories by GPU-accelerated large displacement optical flow. In ECCV (pp. 438–451).

  • Ulyanov, D., Lebedev, V., Vedaldi, A., & Lempitsky, V. S. (2016). Texture networks: Feed-forward synthesis of textures and stylized images. In ICML (pp. 1349–1357).

  • Ulyanov, D., Vedaldi, A., & Lempitsky, V. S. (2016). Instance normalization: The missing ingredient for fast stylization. CoRR. arXiv:abs/1607.08022.

  • Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013). DeepFlow: Large displacement optical flow with deep matching. In ICCV (pp. 1385–1392).

  • Yu, F., & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions. In ICLR .

  • Zhang, H., & Dana, K. J. (2017). Multi-style generative network for real-time transfer. CoRR. arXiv:abs/1703.06953.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manuel Ruder.

Additional information

Communicated by Patrick Perez.

This study was partially supported by the Excellence Initiative of the German Federal and State Governments EXC 294.

A Appendix

A Appendix

1.1 A.1 Supplementary Videos

A supplementary video, available at https://youtu.be/2C3sxtnxpRE, shows moving sequences corresponding to figures from this paper, plus a number of additional results:

  • Results of the optimization-based algorithm on different sequences, including a comparison of the basic algorithm and the multi-pass and long-term algorithm

  • Comparison of “naive” (\(\mathbf {c}\)) and “advanced” (\(\mathbf {c}_{long}\)) weighting schemes for long-term consistency

  • Results of the feed-forward algorithm on different sequences, including results from different techniques to reduce the propagation of errors.

  • Comparison between optimization-based and fast style transfer

  • A demonstration of our panoramic video algorithm

Another video, showing a full panoramic video in \(360^{\circ }\), can be found at https://youtu.be/pkgMUfNeUCQ.

1.2 A.2 Style Images and Parameter Configuration

1.2.1 A.2.1 Optimization-Based Approach

Figure 21 shows the style images chosen to evaluate the optimization-based approach and to perform the user study (except Composition VII), inspired by the selection of styles by Gatys et al. (2016).

Fig. 21
figure 21

Styles used for experiments on Sintel. Left to right, top to bottom: “Composition VII” by Wassily Kandinsky (1913), Self-Portrait by Pablo Picasso (1907), “Seated female nude” by Pablo Picasso (1910), “Woman with a Hat” by Henri Matisse (1905), “The Scream” by Edvard Munch (1893), “Shipwreck” by William Turner (1805)

1.2.2 A.2.2 Network-Based Approach

Figure 22 shows the style images used for the detailed analysis of the network-based approach and for spherical image and video evaluation, inspired by the selection of styles by Johnson et al. (2016). In Table 9, the parameters for training individual models are shown.

Fig. 22
figure 22

Style images used for the evaluation. From left to right, top to bottom: Woman with a Hat by Henri Matisse (1905), Self-Portrait by Pablo Picasso (1907), The Scream by Edvard Munch (1893), collage of the painting June Tree by Natasha Wescoat (also referred to as Candy by Johnson et al.), glass painting referred to as Mosaic by Johnson et al

1.3 A.3 Complemental Qualitative and Quantitative Comparisons

1.3.1 A.3.1 Effect of Errors in Optical Flow Estimation

Table 9 The individual training parameters for the style images
Fig. 23
figure 23

Scene from the Sintel video showing how the algorithm deals with optical flow errors (red rectangle) and disocclusions (blue circle). Both artifacts are somehow repaired in the optimization process due to the exclusion of uncertain areas from our temporal constraint. Still, optical flow errors lead to imperfect results. The third image shows the uncertainty of the flow filed in black and motion boundaries in gray (Color figure online)

The quality of results produced by our algorithm strongly depends on the quality of optical flow estimation. This is illustrated in Fig. 23. When the optical flow is correct (top right region of the image), the method manages to repair the artifacts introduced by warping in the disoccluded region. However, erroneous optical flow (tip of the sword in the bottom right) leads to degraded performance. The optimization process partially compensates for the errors (sword edges get sharp), but cannot fully recover.

1.3.2 A.3.2 Temporal Loss on Individual Style Images

Table 10 shows the temporal error on individual style images complementing the evaluation results shown in the main part of the paper in Tables 6 and  5.

Table 10 Comparison of the temporal loss (mean squared error) on individual style images and the Sintel dataset, using our multi-frame, mixed training approach. This complements Table 6

1.3.3 A.3.3 Convergence of Our Style Transfer Network

Our network converges without overfitting, as seen in Fig. 24. For the validation loss, we always use the average loss of 5 consecutive frames, processed recursively, so that the validation objective stays constant during the training of the multi-frame model. We used 120,000 iterations for the sake of accuracy, but Fig. 24 also indicates that the training can be stopped earlier in practice.

Fig. 24
figure 24

Training loss (blue) and validation loss (red) for a multi-frame training on a logarithmic scale. Length of the frame sequence used for training, depending on the iteration number: 0–50 k: 2; 50–60 k: 3; 60–120 k: 5. The validation loss is computed every 10k iterations (starting from iteration 10 k) and is always the average loss for 5 consecutive frames processed recursively. Therefore the validation loss is larger than the training loss in the beginning, but decreases as our multi-frame training begins (Color figure online)

1.3.4 A.3.4 Comparison of Different Methods to Reduce the Propagation of Error in the Network Approach

Figure 25 shows a comparison using different style images demonstrating the effectiveness of different methods to reduce the propagation of error in the network-based approach.

Fig. 25
figure 25

Comparison of quality: even for scenes with fast-motion, our advanced approaches retain visual quality and produce a result similar to Johnson et al. applied per-frame. The straightforward two-frame training, though, suffers from degeneration of quality

1.4 A.4 Additional Method Description Details

1.4.1 A.4.1 Forward-Backward Consistency Check for Optical Flow

Let \(\varvec{\omega }= (u, v)\) be the optical flow in forward direction and \(\hat{\varvec{\omega }} =(\hat{u}, \hat{v})\) the flow in backward direction. Denote by \(\widetilde{\varvec{\omega }}\) the forward flow warped to the second image:

$$\begin{aligned} \widetilde{\varvec{\omega }} (x,y) = \varvec{\omega }((x,y) + \hat{\varvec{\omega }}(x,y)). \end{aligned}$$
(15)

In areas without disocclusion, this warped flow should be approximately the opposite of the backward flow. Therefore we mark as disocclusions those areas where the following inequality holds:

$$\begin{aligned} |\widetilde{\varvec{\omega }}+ \hat{\varvec{\omega }}|^2 > 0.01 (|\widetilde{\varvec{\omega }}|^2 + |\hat{\varvec{\omega }}|^2) + 0.5 \end{aligned}$$
(16)

Motion boundaries are detected using the following inequality:

$$\begin{aligned} |\nabla \hat{\mathbf {{u}}}|^2 + |\nabla \hat{\mathbf {{v}}}|^2 > 0.01 |{{\hat{\mathbf {\mathrm {w}}}}}|^2 + 0.002 \end{aligned}$$
(17)

Coefficients in inequalities (16) and (17) are taken from Sundaram et al. (2010).

1.4.2 A.4.2 Batch and Instance Normalization

Batch normalization (Ioffe and Szegedy 2015) normalizes the feature activations for individual feature maps in a mini-batch after each layer of a neural network. This has been found to be advantageous especially for very deep neural networks, where the variance over the output is likely to shift during training. Let \(x \in \mathbb {R}^{B \times C \times H \times W}\) be a tensor with batch size B, C channels and spatial dimensions \(H \times W\), and let \(x_{bchw}\) be the bchw-th element in this tensor. Then, the mean and the variance are calculated as \(\mu _{c} = \frac{1}{BHW} \sum _{b=1}^{B} \sum _{h=1}^{H} \sum _{w=1}^{W} x_{bchw}\) and \(\sigma _{c}^2 = \frac{1}{BHW} \sum _{b=1}^{B} \sum _{h=1}^{H} \sum _{w=1}^{W} (x_{bchw} - \mu _{c})^2\).

The batch normalization layer performs the following operation to compute the output tensor y:

$$\begin{aligned} y_{bchw} = \gamma \frac{x_{bchw} - \mu _{c}}{\sqrt{\sigma _{c}^2 + \epsilon }} + \beta , \end{aligned}$$
(18)

with learnable scale and shift parameters \(\gamma \) and \(\beta \).

In contrast, instance normalization normalizes every single instance in a batch, that is, contrast normalization is performed for each individual instance. Therefore, the mean and the variance are computed per instance and feature map. We define a separate mean and variance for each instance as \(\mu _{bc} = \frac{1}{HW} \sum _{h=1}^{H}\sum _{w=1}^{W} x_{bchw}\) and \(\sigma _{bc}^2 = \frac{1}{HW}\sum _{h=1}^{H}\sum _{w=1}^{W} (x_{bchw} - \mu _{bc})^2\).

The instance normalization layer performs the following operation to compute the output tensor y:

$$\begin{aligned} y_{bchw} = \frac{x_{bchw} - \mu _{bc}}{\sqrt{\sigma _{bc}^2 + \epsilon }}. \end{aligned}$$
(19)

1.4.3 A.4.3 Reprojection for Border Consistency in Spherical Images

For perspective transformation of a border region we virtually organize the cube faces in a three dimensional space, so that we can use 3D projection techniques such as the pinhole camera model. According to the pinhole camera model, for a point \((x_1, x_2, x_3)\) in a three-dimensional Cartesian space, the projection \((y_1, y_2)\) on the target plane is calculated as:

$$\begin{aligned} {{y_1}\atopwithdelims (){y_2}} = -\frac{d}{x_3} {{x_1}\atopwithdelims (){x_2}}, \end{aligned}$$
(20)

where d is the distance from the projection to the origin. We arrange the already stylized cube face at an angle of \(90^{\circ }\) to the projection plane to project its border into the plane of another cube face.

1.4.4 A.4.4 Evaluation Metric for Spherical Images

To evaluate if there are unusually high gradients in a given region compared to the rest of the image, we calculate the ratio of gradient magnitudes in that region to gradient magnitudes in the overal image. We take the maximum color gradients per pixel, i.e., we define our image gradient in x direction as \(G_x(\mathbf {p}) = {\text {cmax}}\Bigl (\mathbf {p}[red]_x, \mathbf {p}[green]_x, \mathbf {p}[blue]_x\Bigr )\). On that basis, we define the ratio \(r_x =\dfrac{|| \ \mathbf {s}_{x} \circ G_{x} ||_1}{|| \mathbf {s}_{x} ||_1}\Big / \dfrac{|| G_{x} ||_1}{D}\) where D is the dimensionality of the image and \(\mathbf {s}\) is a binary vector which encodes the region where we want to test for unusual gradients (in x direction), such that the vector is 1 in the desired region and 0 everywhere else. By \(\circ \) we denote the element-wise multiplication.

The error metric \(E_{gradient}\), which is a weighted average between horizontal and vertical gradients, is then calculated as follows:

$$\begin{aligned} E_{gradient} = \frac{ ||\mathbf {s}_{x}||_1 \cdot r_x + ||\mathbf {s}_{y}||_1 \cdot r_y }{||\mathbf {s}_{x}||_1 + ||\mathbf {s}_{y}||_1} \end{aligned}$$
(21)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ruder, M., Dosovitskiy, A. & Brox, T. Artistic Style Transfer for Videos and Spherical Images. Int J Comput Vis 126, 1199–1219 (2018). https://doi.org/10.1007/s11263-018-1089-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-018-1089-z

Keywords

Navigation