Abstract
Automatic generation of a high-quality video from a single image remains a challenging task despite the recent advances in deep generative models. This paper proposes a method that can create a high-resolution, long-term animation using convolutional neural networks (CNNs) from a single landscape image where we mainly focus on skies and waters. Our key observation is that the motion (e.g., moving clouds) and appearance (e.g., time-varying colors in the sky) in natural scenes have different time scales. We thus learn them separately and predict them with decoupled control while handling future uncertainty in both predictions by introducing latent codes. Unlike previous methods that infer output frames directly, our CNNs predict spatially-smooth intermediate data, i.e., for motion, flow fields for warping, and for appearance, color transfer maps, via self-supervised learning, i.e., without explicitly-provided ground truth. These intermediate data are applied not to each previous output frame, but to the input image only once for each output frame. This design is crucial to alleviate error accumulation in long-term predictions, which is the essential problem in previous recurrent approaches. The output frames can be looped like cinemagraph, and also be controlled directly by specifying latent codes or indirectly via visual annotations. We demonstrate the effectiveness of our method through comparisons with the state-of-the-arts on video prediction as well as appearance manipulation. Resultant videos, codes, and datasets will be available at http://www.cgg.cs.tsukuba.ac.jp/~endo/projects/AnimatingLandscape.
Supplemental Material
- Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H. Campbell, and Sergey Levine. 2018. Stochastic Variational Video Prediction. (4 2018).Google Scholar
- Jiamin Bai, Aseem Agarwala, Maneesh Agrawala, and Ravi Ramamoorthi. 2012. Selectively de-animating video. ACM Trans. Graph. 31, 4 (2012), 66:1--66:10.Google ScholarDigital Library
- Wonmin Byeon, Qin Wang, Rupesh Kumar Srivastava, and Petros Koumoutsakos. 2018. ContextVP: Fully Context-Aware Video Prediction. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part XVI. 781--797.Google Scholar
- Yung-Yu Chuang, Dan B. Goldman, Ke Colin Zheng, Brian Curless, David Salesin, and Richard Szeliski. 2005. Animating pictures with stochastic motion textures. ACM Trans. Graph. 24, 3 (2005), 853--860.Google ScholarDigital Library
- Emily L. Denton and Vighnesh Birodkar. 2017. Unsupervised Learning of Disentangled Representations from Video. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4--9 December 2017, Long Beach, CA, USA. 4417--4426.Google Scholar
- Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Häusser, Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. 2015. FlowNet: Learning Optical Flow with Convolutional Networks. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7--13, 2015. 2758--2766.Google Scholar
- Ruohan Gao, Bo Xiong, and Kristen Grauman. 2017. Im2Flow: Motion Hallucination from Static Images for Action Recognition. CoRR abs/1712.04109 (2017). arXiv:1712.04109Google Scholar
- Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. 2016. Image Style Transfer Using Convolutional Neural Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. 2414--2423.Google Scholar
- Zekun Hao, Xun Huang, and Serge J. Belongie. 2018. Controllable Video Generation With Sparse Trajectories. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. 7854--7863.Google Scholar
- James Hays and Alexei A. Efros. 2007. Scene completion using millions of photographs. ACM Trans. Graph. 26, 3 (2007), 4.Google ScholarDigital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37, 9 (2015), 1904--1916.Google ScholarDigital Library
- Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part II. 694--711.Google Scholar
- Levent Karacan, Zeynep Akata, Aykut Erdem, and Erkut Erdem. 2018. Manipulating Attributes of Natural Scenes via Hallucination. CoRR abs/1808.07413 (2018). arXiv:1808.07413Google Scholar
- Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In ICLR 2018.Google Scholar
- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014). arXiv:1412.6980Google Scholar
- Diederik P. Kingma and Max Welling. 2013. Auto-Encoding Variational Bayes. CoRR abs/1312.6114 (2013). arXiv:1312.6114 http://arxiv.org/abs/1312.6114Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3--6, 2012, Lake Tahoe, Nevada, United States. 1106--1114.Google ScholarDigital Library
- Pierre-Yves Laffont, Zhile Ren, Xiaofeng Tao, Chao Qian, and James Hays. 2014. Transient attributes for high-level understanding and editing of outdoor scenes. ACM Trans. Graph. 33, 4 (2014), 149:1--149:11.Google ScholarDigital Library
- Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. 2017. Universal Style Transfer via Feature Transforms. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4--9 December 2017, Long Beach, CA, USA. 385--395.Google ScholarDigital Library
- Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. 2018. Flow-Grounded Spatial-Temporal Video Prediction from Still Images. In European Conference on Computer Vision.Google ScholarCross Ref
- Jing Liao, Mark Finch, and Hugues Hoppe. 2015. Fast computation of seamless video loops. ACM Trans. Graph. 34, 6 (2015), 197:1--197:10.Google ScholarDigital Library
- Zicheng Liao, Neel Joshi, and Hugues Hoppe. 2013. Automated video looping with progressive dynamism. ACM Trans. Graph. 32, 4 (2013), 77:1--77:10.Google ScholarDigital Library
- William Lotter, Gabriel Kreiman, and David D. Cox. 2017. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning. (4 2017).Google Scholar
- Fujun Luan, Sylvain Paris, Eli Shechtman, and Kavita Bala. 2017. Deep Photo Style Transfer. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017, 6997--7005.Google Scholar
- Ricardo Martin-Brualla, David Gallup, and Steven M. Seitz. 2015. Time-lapse mining from internet photos. ACM Trans. Graph. 34, 4 (2015), 62:1--62:8.Google ScholarDigital Library
- Michael Mathieu, Camille Couprie, and Yann Lecun. 2016. Deep multi-scale video prediction beyond mean square error. In ICLR'06.Google Scholar
- Roey Mechrez, Eli Shechtman, and Lihi Zelnik-Manor. 2017. Photorealistic Style Transfer with Screened Poisson Equation. In British Machine Vision Conference 2017, BMVC 2017, London, UK, September 4--7, 2017.Google Scholar
- Tae-Hyun Oh, Kyungdon Joo, Neel Joshi, Baoyuan Wang, In So Kweon, and Sing Bing Kang. 2017. Personalized Cinemagraphs Using Semantic Understanding and Collaborative Learning. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017. 5170--5179.Google Scholar
- Makoto Okabe, Ken-ichi Anjyo, Takeo Igarashi, and Hans-Peter Seidel. 2009. Animating Pictures of Fluid using Video Examples. Comput. Graph. Forum 28, 2 (2009), 677--686.Google ScholarCross Ref
- Makoto Okabe, Ken Anjyo, and Rikio Onai. 2011. Creating Fluid Animation from a Single Image using Video Database. Comput. Graph. Forum 30, 7 (2011), 1973--1982.Google ScholarCross Ref
- Makoto Okabe, Yoshinori Dobashi, and Ken Anjyo. 2018. Animating pictures of water scenes using video retrieval. The Visual Computer 34, 3 (2018), 347--358.Google ScholarDigital Library
- Ekta Prashnani, Maneli Noorkami, Daniel Vaquero, and Pradeep Sen. 2017. A Phase-Based Approach for Animating Images Using Video Examples. Comput. Graph. Forum 36, 6 (2017), 303--311.Google ScholarDigital Library
- Anurag Ranjan and Michael J. Black. 2017. Optical Flow Estimation Using a Spatial Pyramid Network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 2720--2729.Google ScholarCross Ref
- Marc'Aurelio Ranzato, Arthur Szlam, Joan Bruna, Michaël Mathieu, Ronan Collobert, and Sumit Chopra. 2014. Video (language) modeling: a baseline for generative models of natural videos. CoRR abs/1412.6604 (2014). arXiv:1412.6604Google Scholar
- Erik Reinhard, Michael Ashikhmin, Bruce Gooch, and Peter Shirley. 2001. Color Transfer between Images. IEEE Computer Graphics and Applications 21, 5 (2001), 34--41.Google ScholarDigital Library
- Zhe Ren, Junchi Yan, Bingbing Ni, Bin Liu, Xiaokang Yang, and Hongyuan Zha. 2017. Unsupervised Deep Learning for Optical Flow Estimation. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4--9, 2017, San Francisco, California, USA. 1495--1501.Google ScholarDigital Library
- Jérôme Revaud, Philippe Weinzaepfel, Zaïd Harchaoui, and Cordelia Schmid. 2016. DeepMatching: Hierarchical Deformable Dense Matching. International Journal of Computer Vision 120, 3 (2016), 300--323.Google ScholarDigital Library
- O. Ronneberger, P. Fischer, and T. Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI) (LNCS), Vol. 9351. Springer, 234--241. http://lmb.informatik.uni-freiburg.de/Publications/2015/RFB15a (available on arXiv:1505.04597 [cs.CV]).Google Scholar
- Arno Schödl, Richard Szeliski, David Salesin, and Irfan A. Essa. 2000. Video textures. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2000, New Orleans, LA, USA, July 23--28, 2000. 489--498.Google Scholar
- Christian Schüldt, Ivan Laptev, and Barbara Caputo. 2004. Recognizing Human Actions: A Local SVM Approach. In 17th International Conference on Pattern Recognition, ICPR 2004, Cambridge, UK, August 23--26, 2004. 32--36.Google Scholar
- Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wangchun Woo. 2015. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7--12, 2015, Montreal, Quebec, Canada. 802--810.Google ScholarDigital Library
- Yi-Chang Shih, Sylvain Paris, Frédo Durand, and William T. Freeman. 2013. Data-driven hallucination of different times of day from a single outdoor photo. ACM Trans. Graph. 32, 6 (2013), 200:1--200:11.Google ScholarDigital Library
- K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014).Google Scholar
- Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. 2015. Unsupervised Learning of Video Representations using LSTMs. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6--11 July 2015. 843--852.Google Scholar
- Yu-Wing Tai, Jiaya Jia, and Chi-Keung Tang. 2005. Local Color Transfer via Probabilistic Segmentation by Expectation-Maximization. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), 20--26 June 2005, San Diego, CA, USA. 747--754.Google Scholar
- Ilya O. Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schölkopf. 2017. Wasserstein Auto-Encoders. CoRR abs/1711.01558 (2017). arXiv:1711.01558 http://arxiv.org/abs/1711.01558Google Scholar
- Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, and Ming-Hsuan Yang. 2016. Sky is not the limit: semantic-aware sky replacement. ACM Trans. Graph. 35, 4 (2016), 149:1--149:11.Google ScholarDigital Library
- Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating Videos with Scene Dynamics. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5--10, 2016, Barcelona, Spain. 613--621.Google ScholarDigital Library
- Jacob Walker, Abhinav Gupta, and Martial Hebert. 2015. Dense Optical Flow Prediction from a Static Image. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7--13, 2015. 2443--2451.Google ScholarDigital Library
- Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018a. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
- Yang Wang, Yi Yang, Zhenheng Yang, Liang Zhao, Peng Wang, and Wei Xu. 2018b. Occlusion Aware Unsupervised Learning of Optical Flow. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Philippe Weinzaepfel, Jérôme Revaud, Zaïd Harchaoui, and Cordelia Schmid. 2013. DeepFlow: Large Displacement Optical Flow with Deep Matching. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1--8, 2013. 1385--1392.Google Scholar
- Fuzhang Wu, Weiming Dong, Yan Kong, Xing Mei, Jean-Claude Paul, and Xiaopeng Zhang. 2013. Content-Based Colour Transfer. Comput. Graph. Forum 32, 1 (2013), 190--203.Google ScholarCross Ref
- Wei Xiong, Wenhan Luo, Lin Ma, Wei Liu, and Jiebo Luo. 2018. Learning to Generate Time-Lapse Videos Using Multi-Stage Dynamic Generative Adversarial Networks. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on.Google ScholarCross Ref
- Tianfan Xue, Jiajun Wu, Katherine L. Bouman, and Bill Freeman. 2016. Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5--10, 2016, Barcelona, Spain. 91--99.Google Scholar
- Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Yipin Zhou and Tamara L. Berg. 2016. Learning Temporal Transformations from Time-Lapse Videos. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part VIII. 262--277.Google Scholar
- Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A. Efros, Oliver Wang, and Eli Shechtman. 2017. Toward Multimodal Image-to-Image Translation. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4--9 December 2017, Long Beach, CA, USA. 465--476.Google Scholar
Index Terms
- Animating landscape: self-supervised learning of decoupled motion and appearance for single-image video synthesis
Recommendations
Learnable spatiotemporal feature pyramid for prediction of future optical flow in videos
AbstractThe success of deep learning-based techniques in solving various computer vision problems motivated the researchers to apply deep learning to predict the optical flow of a video in the next frame. However, the problem of predicting the motion of ...
Predicting movie box-office revenues using deep neural networks
In the film industry, the ability to predict a movie's box-office revenues before its theatrical release can decrease its financial risk. However, accurate predictions are not easily obtained. The complex relationship between movie-related data and ...
Video Frame Interpolation via Cyclic Fine-Tuning and Asymmetric Reverse Flow
Image AnalysisAbstractThe objective in video frame interpolation is to predict additional in-between frames in a video while retaining natural motion and good visual quality. In this work, we use a convolutional neural network (CNN) that takes two frames as input and ...
Comments