skip to main content
research-article

Animating landscape: self-supervised learning of decoupled motion and appearance for single-image video synthesis

Published:08 November 2019Publication History
Skip Abstract Section

Abstract

Automatic generation of a high-quality video from a single image remains a challenging task despite the recent advances in deep generative models. This paper proposes a method that can create a high-resolution, long-term animation using convolutional neural networks (CNNs) from a single landscape image where we mainly focus on skies and waters. Our key observation is that the motion (e.g., moving clouds) and appearance (e.g., time-varying colors in the sky) in natural scenes have different time scales. We thus learn them separately and predict them with decoupled control while handling future uncertainty in both predictions by introducing latent codes. Unlike previous methods that infer output frames directly, our CNNs predict spatially-smooth intermediate data, i.e., for motion, flow fields for warping, and for appearance, color transfer maps, via self-supervised learning, i.e., without explicitly-provided ground truth. These intermediate data are applied not to each previous output frame, but to the input image only once for each output frame. This design is crucial to alleviate error accumulation in long-term predictions, which is the essential problem in previous recurrent approaches. The output frames can be looped like cinemagraph, and also be controlled directly by specifying latent codes or indirectly via visual annotations. We demonstrate the effectiveness of our method through comparisons with the state-of-the-arts on video prediction as well as appearance manipulation. Resultant videos, codes, and datasets will be available at http://www.cgg.cs.tsukuba.ac.jp/~endo/projects/AnimatingLandscape.

Skip Supplemental Material Section

Supplemental Material

a175-endo.mp4

mp4

334.3 MB

References

  1. Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H. Campbell, and Sergey Levine. 2018. Stochastic Variational Video Prediction. (4 2018).Google ScholarGoogle Scholar
  2. Jiamin Bai, Aseem Agarwala, Maneesh Agrawala, and Ravi Ramamoorthi. 2012. Selectively de-animating video. ACM Trans. Graph. 31, 4 (2012), 66:1--66:10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Wonmin Byeon, Qin Wang, Rupesh Kumar Srivastava, and Petros Koumoutsakos. 2018. ContextVP: Fully Context-Aware Video Prediction. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part XVI. 781--797.Google ScholarGoogle Scholar
  4. Yung-Yu Chuang, Dan B. Goldman, Ke Colin Zheng, Brian Curless, David Salesin, and Richard Szeliski. 2005. Animating pictures with stochastic motion textures. ACM Trans. Graph. 24, 3 (2005), 853--860.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Emily L. Denton and Vighnesh Birodkar. 2017. Unsupervised Learning of Disentangled Representations from Video. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4--9 December 2017, Long Beach, CA, USA. 4417--4426.Google ScholarGoogle Scholar
  6. Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Häusser, Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. 2015. FlowNet: Learning Optical Flow with Convolutional Networks. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7--13, 2015. 2758--2766.Google ScholarGoogle Scholar
  7. Ruohan Gao, Bo Xiong, and Kristen Grauman. 2017. Im2Flow: Motion Hallucination from Static Images for Action Recognition. CoRR abs/1712.04109 (2017). arXiv:1712.04109Google ScholarGoogle Scholar
  8. Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. 2016. Image Style Transfer Using Convolutional Neural Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. 2414--2423.Google ScholarGoogle Scholar
  9. Zekun Hao, Xun Huang, and Serge J. Belongie. 2018. Controllable Video Generation With Sparse Trajectories. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. 7854--7863.Google ScholarGoogle Scholar
  10. James Hays and Alexei A. Efros. 2007. Scene completion using millions of photographs. ACM Trans. Graph. 26, 3 (2007), 4.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37, 9 (2015), 1904--1916.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part II. 694--711.Google ScholarGoogle Scholar
  13. Levent Karacan, Zeynep Akata, Aykut Erdem, and Erkut Erdem. 2018. Manipulating Attributes of Natural Scenes via Hallucination. CoRR abs/1808.07413 (2018). arXiv:1808.07413Google ScholarGoogle Scholar
  14. Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In ICLR 2018.Google ScholarGoogle Scholar
  15. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014). arXiv:1412.6980Google ScholarGoogle Scholar
  16. Diederik P. Kingma and Max Welling. 2013. Auto-Encoding Variational Bayes. CoRR abs/1312.6114 (2013). arXiv:1312.6114 http://arxiv.org/abs/1312.6114Google ScholarGoogle Scholar
  17. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3--6, 2012, Lake Tahoe, Nevada, United States. 1106--1114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Pierre-Yves Laffont, Zhile Ren, Xiaofeng Tao, Chao Qian, and James Hays. 2014. Transient attributes for high-level understanding and editing of outdoor scenes. ACM Trans. Graph. 33, 4 (2014), 149:1--149:11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. 2017. Universal Style Transfer via Feature Transforms. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4--9 December 2017, Long Beach, CA, USA. 385--395.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. 2018. Flow-Grounded Spatial-Temporal Video Prediction from Still Images. In European Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  21. Jing Liao, Mark Finch, and Hugues Hoppe. 2015. Fast computation of seamless video loops. ACM Trans. Graph. 34, 6 (2015), 197:1--197:10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Zicheng Liao, Neel Joshi, and Hugues Hoppe. 2013. Automated video looping with progressive dynamism. ACM Trans. Graph. 32, 4 (2013), 77:1--77:10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. William Lotter, Gabriel Kreiman, and David D. Cox. 2017. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning. (4 2017).Google ScholarGoogle Scholar
  24. Fujun Luan, Sylvain Paris, Eli Shechtman, and Kavita Bala. 2017. Deep Photo Style Transfer. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017, 6997--7005.Google ScholarGoogle Scholar
  25. Ricardo Martin-Brualla, David Gallup, and Steven M. Seitz. 2015. Time-lapse mining from internet photos. ACM Trans. Graph. 34, 4 (2015), 62:1--62:8.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Michael Mathieu, Camille Couprie, and Yann Lecun. 2016. Deep multi-scale video prediction beyond mean square error. In ICLR'06.Google ScholarGoogle Scholar
  27. Roey Mechrez, Eli Shechtman, and Lihi Zelnik-Manor. 2017. Photorealistic Style Transfer with Screened Poisson Equation. In British Machine Vision Conference 2017, BMVC 2017, London, UK, September 4--7, 2017.Google ScholarGoogle Scholar
  28. Tae-Hyun Oh, Kyungdon Joo, Neel Joshi, Baoyuan Wang, In So Kweon, and Sing Bing Kang. 2017. Personalized Cinemagraphs Using Semantic Understanding and Collaborative Learning. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017. 5170--5179.Google ScholarGoogle Scholar
  29. Makoto Okabe, Ken-ichi Anjyo, Takeo Igarashi, and Hans-Peter Seidel. 2009. Animating Pictures of Fluid using Video Examples. Comput. Graph. Forum 28, 2 (2009), 677--686.Google ScholarGoogle ScholarCross RefCross Ref
  30. Makoto Okabe, Ken Anjyo, and Rikio Onai. 2011. Creating Fluid Animation from a Single Image using Video Database. Comput. Graph. Forum 30, 7 (2011), 1973--1982.Google ScholarGoogle ScholarCross RefCross Ref
  31. Makoto Okabe, Yoshinori Dobashi, and Ken Anjyo. 2018. Animating pictures of water scenes using video retrieval. The Visual Computer 34, 3 (2018), 347--358.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Ekta Prashnani, Maneli Noorkami, Daniel Vaquero, and Pradeep Sen. 2017. A Phase-Based Approach for Animating Images Using Video Examples. Comput. Graph. Forum 36, 6 (2017), 303--311.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Anurag Ranjan and Michael J. Black. 2017. Optical Flow Estimation Using a Spatial Pyramid Network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 2720--2729.Google ScholarGoogle ScholarCross RefCross Ref
  34. Marc'Aurelio Ranzato, Arthur Szlam, Joan Bruna, Michaël Mathieu, Ronan Collobert, and Sumit Chopra. 2014. Video (language) modeling: a baseline for generative models of natural videos. CoRR abs/1412.6604 (2014). arXiv:1412.6604Google ScholarGoogle Scholar
  35. Erik Reinhard, Michael Ashikhmin, Bruce Gooch, and Peter Shirley. 2001. Color Transfer between Images. IEEE Computer Graphics and Applications 21, 5 (2001), 34--41.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Zhe Ren, Junchi Yan, Bingbing Ni, Bin Liu, Xiaokang Yang, and Hongyuan Zha. 2017. Unsupervised Deep Learning for Optical Flow Estimation. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4--9, 2017, San Francisco, California, USA. 1495--1501.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Jérôme Revaud, Philippe Weinzaepfel, Zaïd Harchaoui, and Cordelia Schmid. 2016. DeepMatching: Hierarchical Deformable Dense Matching. International Journal of Computer Vision 120, 3 (2016), 300--323.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. O. Ronneberger, P. Fischer, and T. Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI) (LNCS), Vol. 9351. Springer, 234--241. http://lmb.informatik.uni-freiburg.de/Publications/2015/RFB15a (available on arXiv:1505.04597 [cs.CV]).Google ScholarGoogle Scholar
  39. Arno Schödl, Richard Szeliski, David Salesin, and Irfan A. Essa. 2000. Video textures. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2000, New Orleans, LA, USA, July 23--28, 2000. 489--498.Google ScholarGoogle Scholar
  40. Christian Schüldt, Ivan Laptev, and Barbara Caputo. 2004. Recognizing Human Actions: A Local SVM Approach. In 17th International Conference on Pattern Recognition, ICPR 2004, Cambridge, UK, August 23--26, 2004. 32--36.Google ScholarGoogle Scholar
  41. Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wangchun Woo. 2015. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7--12, 2015, Montreal, Quebec, Canada. 802--810.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Yi-Chang Shih, Sylvain Paris, Frédo Durand, and William T. Freeman. 2013. Data-driven hallucination of different times of day from a single outdoor photo. ACM Trans. Graph. 32, 6 (2013), 200:1--200:11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014).Google ScholarGoogle Scholar
  44. Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. 2015. Unsupervised Learning of Video Representations using LSTMs. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6--11 July 2015. 843--852.Google ScholarGoogle Scholar
  45. Yu-Wing Tai, Jiaya Jia, and Chi-Keung Tang. 2005. Local Color Transfer via Probabilistic Segmentation by Expectation-Maximization. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), 20--26 June 2005, San Diego, CA, USA. 747--754.Google ScholarGoogle Scholar
  46. Ilya O. Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schölkopf. 2017. Wasserstein Auto-Encoders. CoRR abs/1711.01558 (2017). arXiv:1711.01558 http://arxiv.org/abs/1711.01558Google ScholarGoogle Scholar
  47. Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, and Ming-Hsuan Yang. 2016. Sky is not the limit: semantic-aware sky replacement. ACM Trans. Graph. 35, 4 (2016), 149:1--149:11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating Videos with Scene Dynamics. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5--10, 2016, Barcelona, Spain. 613--621.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Jacob Walker, Abhinav Gupta, and Martial Hebert. 2015. Dense Optical Flow Prediction from a Static Image. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7--13, 2015. 2443--2451.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018a. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  51. Yang Wang, Yi Yang, Zhenheng Yang, Liang Zhao, Peng Wang, and Wei Xu. 2018b. Occlusion Aware Unsupervised Learning of Optical Flow. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  52. Philippe Weinzaepfel, Jérôme Revaud, Zaïd Harchaoui, and Cordelia Schmid. 2013. DeepFlow: Large Displacement Optical Flow with Deep Matching. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1--8, 2013. 1385--1392.Google ScholarGoogle Scholar
  53. Fuzhang Wu, Weiming Dong, Yan Kong, Xing Mei, Jean-Claude Paul, and Xiaopeng Zhang. 2013. Content-Based Colour Transfer. Comput. Graph. Forum 32, 1 (2013), 190--203.Google ScholarGoogle ScholarCross RefCross Ref
  54. Wei Xiong, Wenhan Luo, Lin Ma, Wei Liu, and Jiebo Luo. 2018. Learning to Generate Time-Lapse Videos Using Multi-Stage Dynamic Generative Adversarial Networks. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on.Google ScholarGoogle ScholarCross RefCross Ref
  55. Tianfan Xue, Jiajun Wu, Katherine L. Bouman, and Bill Freeman. 2016. Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5--10, 2016, Barcelona, Spain. 91--99.Google ScholarGoogle Scholar
  56. Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  57. Yipin Zhou and Tamara L. Berg. 2016. Learning Temporal Transformations from Time-Lapse Videos. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part VIII. 262--277.Google ScholarGoogle Scholar
  58. Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A. Efros, Oliver Wang, and Eli Shechtman. 2017. Toward Multimodal Image-to-Image Translation. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4--9 December 2017, Long Beach, CA, USA. 465--476.Google ScholarGoogle Scholar

Index Terms

  1. Animating landscape: self-supervised learning of decoupled motion and appearance for single-image video synthesis

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Graphics
      ACM Transactions on Graphics  Volume 38, Issue 6
      December 2019
      1292 pages
      ISSN:0730-0301
      EISSN:1557-7368
      DOI:10.1145/3355089
      Issue’s Table of Contents

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 November 2019
      Published in tog Volume 38, Issue 6

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader