research-article

Animating landscape: self-supervised learning of decoupled motion and appearance for single-image video synthesis

Authors:
Yuki Endo

University of Tsukuba & Toyohashi University of Technology

University of Tsukuba & Toyohashi University of Technology
View Profile

,
Yoshihiro Kanamori

University of Tsukuba

University of Tsukuba
View Profile

,
Shigeru Kuriyama

Toyohashi University of Technology

Toyohashi University of Technology
View Profile

Authors Info & Claims

ACM Transactions on Graphics Volume 38 Issue 6Article No.: 175pp 1–19https://doi.org/10.1145/3355089.3356523

Published:08 November 2019Publication History

ACM Transactions on Graphics

Abstract

Automatic generation of a high-quality video from a single image remains a challenging task despite the recent advances in deep generative models. This paper proposes a method that can create a high-resolution, long-term animation using convolutional neural networks (CNNs) from a single landscape image where we mainly focus on skies and waters. Our key observation is that the motion (e.g., moving clouds) and appearance (e.g., time-varying colors in the sky) in natural scenes have different time scales. We thus learn them separately and predict them with decoupled control while handling future uncertainty in both predictions by introducing latent codes. Unlike previous methods that infer output frames directly, our CNNs predict spatially-smooth intermediate data, i.e., for motion, flow fields for warping, and for appearance, color transfer maps, via self-supervised learning, i.e., without explicitly-provided ground truth. These intermediate data are applied not to each previous output frame, but to the input image only once for each output frame. This design is crucial to alleviate error accumulation in long-term predictions, which is the essential problem in previous recurrent approaches. The output frames can be looped like cinemagraph, and also be controlled directly by specifying latent codes or indirectly via visual annotations. We demonstrate the effectiveness of our method through comparisons with the state-of-the-arts on video prediction as well as appearance manipulation. Resultant videos, codes, and datasets will be available at http://www.cgg.cs.tsukuba.ac.jp/~endo/projects/AnimatingLandscape.

Supplemental Material

a175-endo.mp4

mp4

334.3 MB

Download

References

Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H. Campbell, and Sergey Levine. 2018. Stochastic Variational Video Prediction. (4 2018).Google Scholar
Jiamin Bai, Aseem Agarwala, Maneesh Agrawala, and Ravi Ramamoorthi. 2012. Selectively de-animating video. ACM Trans. Graph. 31, 4 (2012), 66:1--66:10.Google ScholarDigital Library
Wonmin Byeon, Qin Wang, Rupesh Kumar Srivastava, and Petros Koumoutsakos. 2018. ContextVP: Fully Context-Aware Video Prediction. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part XVI. 781--797.Google Scholar
Yung-Yu Chuang, Dan B. Goldman, Ke Colin Zheng, Brian Curless, David Salesin, and Richard Szeliski. 2005. Animating pictures with stochastic motion textures. ACM Trans. Graph. 24, 3 (2005), 853--860.Google ScholarDigital Library
Emily L. Denton and Vighnesh Birodkar. 2017. Unsupervised Learning of Disentangled Representations from Video. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4--9 December 2017, Long Beach, CA, USA. 4417--4426.Google Scholar
Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Häusser, Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. 2015. FlowNet: Learning Optical Flow with Convolutional Networks. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7--13, 2015. 2758--2766.Google Scholar
Ruohan Gao, Bo Xiong, and Kristen Grauman. 2017. Im2Flow: Motion Hallucination from Static Images for Action Recognition. CoRR abs/1712.04109 (2017). arXiv:1712.04109Google Scholar
Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. 2016. Image Style Transfer Using Convolutional Neural Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. 2414--2423.Google Scholar
Zekun Hao, Xun Huang, and Serge J. Belongie. 2018. Controllable Video Generation With Sparse Trajectories. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. 7854--7863.Google Scholar
James Hays and Alexei A. Efros. 2007. Scene completion using millions of photographs. ACM Trans. Graph. 26, 3 (2007), 4.Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37, 9 (2015), 1904--1916.Google ScholarDigital Library
Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part II. 694--711.Google Scholar
Levent Karacan, Zeynep Akata, Aykut Erdem, and Erkut Erdem. 2018. Manipulating Attributes of Natural Scenes via Hallucination. CoRR abs/1808.07413 (2018). arXiv:1808.07413Google Scholar
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In ICLR 2018.Google Scholar
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014). arXiv:1412.6980Google Scholar
Diederik P. Kingma and Max Welling. 2013. Auto-Encoding Variational Bayes. CoRR abs/1312.6114 (2013). arXiv:1312.6114 http://arxiv.org/abs/1312.6114Google Scholar
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3--6, 2012, Lake Tahoe, Nevada, United States. 1106--1114.Google ScholarDigital Library
Pierre-Yves Laffont, Zhile Ren, Xiaofeng Tao, Chao Qian, and James Hays. 2014. Transient attributes for high-level understanding and editing of outdoor scenes. ACM Trans. Graph. 33, 4 (2014), 149:1--149:11.Google ScholarDigital Library
Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. 2017. Universal Style Transfer via Feature Transforms. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4--9 December 2017, Long Beach, CA, USA. 385--395.Google ScholarDigital Library
Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. 2018. Flow-Grounded Spatial-Temporal Video Prediction from Still Images. In European Conference on Computer Vision.Google ScholarCross Ref
Jing Liao, Mark Finch, and Hugues Hoppe. 2015. Fast computation of seamless video loops. ACM Trans. Graph. 34, 6 (2015), 197:1--197:10.Google ScholarDigital Library
Zicheng Liao, Neel Joshi, and Hugues Hoppe. 2013. Automated video looping with progressive dynamism. ACM Trans. Graph. 32, 4 (2013), 77:1--77:10.Google ScholarDigital Library
William Lotter, Gabriel Kreiman, and David D. Cox. 2017. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning. (4 2017).Google Scholar
Fujun Luan, Sylvain Paris, Eli Shechtman, and Kavita Bala. 2017. Deep Photo Style Transfer. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017, 6997--7005.Google Scholar
Ricardo Martin-Brualla, David Gallup, and Steven M. Seitz. 2015. Time-lapse mining from internet photos. ACM Trans. Graph. 34, 4 (2015), 62:1--62:8.Google ScholarDigital Library
Michael Mathieu, Camille Couprie, and Yann Lecun. 2016. Deep multi-scale video prediction beyond mean square error. In ICLR'06.Google Scholar
Roey Mechrez, Eli Shechtman, and Lihi Zelnik-Manor. 2017. Photorealistic Style Transfer with Screened Poisson Equation. In British Machine Vision Conference 2017, BMVC 2017, London, UK, September 4--7, 2017.Google Scholar
Tae-Hyun Oh, Kyungdon Joo, Neel Joshi, Baoyuan Wang, In So Kweon, and Sing Bing Kang. 2017. Personalized Cinemagraphs Using Semantic Understanding and Collaborative Learning. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017. 5170--5179.Google Scholar
Makoto Okabe, Ken-ichi Anjyo, Takeo Igarashi, and Hans-Peter Seidel. 2009. Animating Pictures of Fluid using Video Examples. Comput. Graph. Forum 28, 2 (2009), 677--686.Google ScholarCross Ref
Makoto Okabe, Ken Anjyo, and Rikio Onai. 2011. Creating Fluid Animation from a Single Image using Video Database. Comput. Graph. Forum 30, 7 (2011), 1973--1982.Google ScholarCross Ref
Makoto Okabe, Yoshinori Dobashi, and Ken Anjyo. 2018. Animating pictures of water scenes using video retrieval. The Visual Computer 34, 3 (2018), 347--358.Google ScholarDigital Library
Ekta Prashnani, Maneli Noorkami, Daniel Vaquero, and Pradeep Sen. 2017. A Phase-Based Approach for Animating Images Using Video Examples. Comput. Graph. Forum 36, 6 (2017), 303--311.Google ScholarDigital Library
Anurag Ranjan and Michael J. Black. 2017. Optical Flow Estimation Using a Spatial Pyramid Network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 2720--2729.Google ScholarCross Ref
Marc'Aurelio Ranzato, Arthur Szlam, Joan Bruna, Michaël Mathieu, Ronan Collobert, and Sumit Chopra. 2014. Video (language) modeling: a baseline for generative models of natural videos. CoRR abs/1412.6604 (2014). arXiv:1412.6604Google Scholar
Erik Reinhard, Michael Ashikhmin, Bruce Gooch, and Peter Shirley. 2001. Color Transfer between Images. IEEE Computer Graphics and Applications 21, 5 (2001), 34--41.Google ScholarDigital Library
Zhe Ren, Junchi Yan, Bingbing Ni, Bin Liu, Xiaokang Yang, and Hongyuan Zha. 2017. Unsupervised Deep Learning for Optical Flow Estimation. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4--9, 2017, San Francisco, California, USA. 1495--1501.Google ScholarDigital Library
Jérôme Revaud, Philippe Weinzaepfel, Zaïd Harchaoui, and Cordelia Schmid. 2016. DeepMatching: Hierarchical Deformable Dense Matching. International Journal of Computer Vision 120, 3 (2016), 300--323.Google ScholarDigital Library
O. Ronneberger, P. Fischer, and T. Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI) (LNCS), Vol. 9351. Springer, 234--241. http://lmb.informatik.uni-freiburg.de/Publications/2015/RFB15a (available on arXiv:1505.04597 [cs.CV]).Google Scholar
Arno Schödl, Richard Szeliski, David Salesin, and Irfan A. Essa. 2000. Video textures. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2000, New Orleans, LA, USA, July 23--28, 2000. 489--498.Google Scholar
Christian Schüldt, Ivan Laptev, and Barbara Caputo. 2004. Recognizing Human Actions: A Local SVM Approach. In 17th International Conference on Pattern Recognition, ICPR 2004, Cambridge, UK, August 23--26, 2004. 32--36.Google Scholar
Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wangchun Woo. 2015. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7--12, 2015, Montreal, Quebec, Canada. 802--810.Google ScholarDigital Library
Yi-Chang Shih, Sylvain Paris, Frédo Durand, and William T. Freeman. 2013. Data-driven hallucination of different times of day from a single outdoor photo. ACM Trans. Graph. 32, 6 (2013), 200:1--200:11.Google ScholarDigital Library
K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014).Google Scholar
Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. 2015. Unsupervised Learning of Video Representations using LSTMs. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6--11 July 2015. 843--852.Google Scholar
Yu-Wing Tai, Jiaya Jia, and Chi-Keung Tang. 2005. Local Color Transfer via Probabilistic Segmentation by Expectation-Maximization. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), 20--26 June 2005, San Diego, CA, USA. 747--754.Google Scholar
Ilya O. Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schölkopf. 2017. Wasserstein Auto-Encoders. CoRR abs/1711.01558 (2017). arXiv:1711.01558 http://arxiv.org/abs/1711.01558Google Scholar
Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, and Ming-Hsuan Yang. 2016. Sky is not the limit: semantic-aware sky replacement. ACM Trans. Graph. 35, 4 (2016), 149:1--149:11.Google ScholarDigital Library
Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating Videos with Scene Dynamics. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5--10, 2016, Barcelona, Spain. 613--621.Google ScholarDigital Library
Jacob Walker, Abhinav Gupta, and Martial Hebert. 2015. Dense Optical Flow Prediction from a Static Image. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7--13, 2015. 2443--2451.Google ScholarDigital Library
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018a. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
Yang Wang, Yi Yang, Zhenheng Yang, Liang Zhao, Peng Wang, and Wei Xu. 2018b. Occlusion Aware Unsupervised Learning of Optical Flow. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Philippe Weinzaepfel, Jérôme Revaud, Zaïd Harchaoui, and Cordelia Schmid. 2013. DeepFlow: Large Displacement Optical Flow with Deep Matching. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1--8, 2013. 1385--1392.Google Scholar
Fuzhang Wu, Weiming Dong, Yan Kong, Xing Mei, Jean-Claude Paul, and Xiaopeng Zhang. 2013. Content-Based Colour Transfer. Comput. Graph. Forum 32, 1 (2013), 190--203.Google ScholarCross Ref
Wei Xiong, Wenhan Luo, Lin Ma, Wei Liu, and Jiebo Luo. 2018. Learning to Generate Time-Lapse Videos Using Multi-Stage Dynamic Generative Adversarial Networks. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on.Google ScholarCross Ref
Tianfan Xue, Jiajun Wu, Katherine L. Bouman, and Bill Freeman. 2016. Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5--10, 2016, Barcelona, Spain. 91--99.Google Scholar
Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
Yipin Zhou and Tamara L. Berg. 2016. Learning Temporal Transformations from Time-Lapse Videos. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part VIII. 262--277.Google Scholar
Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A. Efros, Oliver Wang, and Eli Shechtman. 2017. Toward Multimodal Image-to-Image Translation. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4--9 December 2017, Long Beach, CA, USA. 465--476.Google Scholar

Index Terms

Animating landscape: self-supervised learning of decoupled motion and appearance for single-image video synthesis
1. Computing methodologies
  1. Computer graphics
    1. Image manipulation
      1. Image processing

Recommendations

Learnable spatiotemporal feature pyramid for prediction of future optical flow in videos
Abstract
The success of deep learning-based techniques in solving various computer vision problems motivated the researchers to apply deep learning to predict the optical flow of a video in the next frame. However, the problem of predicting the motion of ...
Read More
Predicting movie box-office revenues using deep neural networks

In the film industry, the ability to predict a movie's box-office revenues before its theatrical release can decrease its financial risk. However, accurate predictions are not easily obtained. The complex relationship between movie-related data and ...
Read More
Video Frame Interpolation via Cyclic Fine-Tuning and Asymmetric Reverse Flow
Image Analysis
Abstract
The objective in video frame interpolation is to predict additional in-between frames in a video while retaining natural motion and good visual quality. In this work, we use a convolutional neural network (CNN) that takes two frames as input and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Graphics Volume 38, Issue 6
December 2019
1292 pages
ISSN:0730-0301
EISSN:1557-7368
DOI:10.1145/3355089
Issue’s Table of Contents

Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 November 2019
Published in tog Volume 38, Issue 6

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
appearance manipulation
convolutional neural networks
optical flow prediction
single-image video synthesis
time-lapse video
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 26
  Total Citations
  View Citations
- 426
  Total Downloads
- Downloads (Last 12 months)19
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Animating landscape: self-supervised learning of decoupled motion and appearance for single-image video synthesis

ACM Transactions on Graphics

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Learnable spatiotemporal feature pyramid for prediction of future optical flow in videos

Predicting movie box-office revenues using deep neural networks

Video Frame Interpolation via Cyclic Fine-Tuning and Asymmetric Reverse Flow

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Animating landscape: self-supervised learning of decoupled motion and appearance for single-image video synthesis

ACM Transactions on Graphics

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Learnable spatiotemporal feature pyramid for prediction of future optical flow in videos

Predicting movie box-office revenues using deep neural networks

Video Frame Interpolation via Cyclic Fine-Tuning and Asymmetric Reverse Flow

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media