Conditional Temporal Variational AutoEncoder for Action Video Prediction

Xu, Xiaogang; Wang, Yi; Wang, Liwei; Yu, Bei; Jia, Jiaya

doi:10.1007/s11263-023-01832-8

Conditional Temporal Variational AutoEncoder for Action Video Prediction

Manuscript
Published: 18 June 2023

Volume 131, pages 2699–2722, (2023)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Xiaogang Xu ORCID: orcid.org/0000-0002-9213-8001¹,
Yi Wang²,
Liwei Wang³,
Bei Yu³ &
…
Jiaya Jia³

538 Accesses
2 Altmetric
Explore all metrics

Abstract

To synthesize a realistic action sequence based on a single human image, it is crucial to model both motion patterns and diversity in the action video. This paper proposes an Action Conditional Temporal Variational AutoEncoder (ACT-VAE) to improve motion prediction accuracy and capture movement diversity. ACT-VAE predicts pose sequences for an action clip from a single input image. It is implemented as a deep generative model that maintains temporal coherence according to the action category with a novel temporal modeling on latent space. Further, ACT-VAE is a general action sequence prediction framework. When connected with a plug-and-play Pose-to-Image network, ACT-VAE can synthesize image sequences. Extensive experiments bear out our approach can predict accurate pose and synthesize realistic image sequences, surpassing state-of-the-art approaches. Compared to existing methods, ACT-VAE improves model accuracy and preserves diversity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Uncoupled-Modulation CVAE for 3D Action-Conditioned Human Motion Synthesis

Deep Video Generation, Prediction and Completion of Human Action Sequences

Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models

Article 23 October 2019

Data Availability

The data that support the results and analysis of this study is publicly available in a repository. The dataset of Penn-action is available at http://dreamdragon.github.io/PennAction. The dataset of Human3.6M is available at http://vision.imar.ro/human3.6m/description.php. The dataset of NTU RGB+D Dataset is available at https://rose1.ntu.edu.sg/dataset/actionRecognition.

References

Aberman, K., Wu, R., Lischinski, D., Chen, B., & Cohen-Or, D. (2019). Learning character-agnostic motion for motion retargeting in 2d. arXiv:1905.01680.
Adeli, V., Ehsanpour, M., Reid, I., Niebles, J. C., Savarese, S., Adeli, E., & Rezatofighi, H. (2021). Tripod: Human trajectory and pose dynamics forecasting in the wild. In International conference on computer vision.
Ahuja, C., & Morency, L. P. (2019). Language2pose: Natural language grounded pose forecasting. In 2019 International conference on 3D vision (3DV).
Aliakbarian, S., Saleh, F. S., Salzmann, M., Petersson, L., & Gould, S. (2020). A stochastic conditioning scheme for diverse human motion prediction. In IEEE conference on computer vision and pattern recognition.
Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H., & Levine, S. (2017). Stochastic variational video prediction. arXiv:1710.11252.
Balaji, Y., Min, M. R., Bai, B., Chellappa, R., & Graf, H. P. (2019). Conditional gan with discriminative filter generation for text-to-video synthesis. In IJCAI.
Cai, H., Bai, C., Tai, Y. W., & Tang, C. K. (2018). Deep video generation, prediction and completion of human action sequences. In The European Conference on Computer Vision.
Cai, Y., Huang, L., Wang, Y., Cham, T. J., Cai, J., Yuan, J., Liu, J., Yang, X., Zhu, Y., Shen, X., et al. (2020). Learning progressive joint propagation for human motion prediction. In The european conference on computer vision.
Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S., & Sheikh, Y. A. (2019). Openpose: Realtime multi-person 2d pose estimation using part affinity fields. In IEEE Transactions on Pattern Analysis and Machine Intelligence. Intell.
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In IEEE conference on computer vision and pattern recognition.
Castrejon, L., Ballas, N., & Courville, A. (2019). Improved conditional vrnns for video prediction. In International Conference on Computer Vision.
Chen, G., Li, J., Lu, J., & Zhou, J. (2021). Human trajectory prediction via counterfactual analysis. In International Conference on Computer Vision.
Chen, W., & Hays, J. (2018). Sketchygan: Towards diverse and realistic sketch to image synthesis. In IEEE conference on computer vision and pattern recognition.
Cheng, Y., Yang, B., Wang, B., Yan, W., & Tan, R. T. (2019). Occlusion-aware networks for 3d human pose estimation in video. In International conference on computer vision.
Choi, H., Moon, G., Chang, J. Y., & Lee, K. M. (2021). Beyond static features for temporally consistent 3d human pose and shape from a video. In IEEE conference on computer vision and pattern recognition.
Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A.C., & Bengio, Y. (2015). A recurrent latent variable model for sequential data. In Advances in Neural Information Processing Systems.
Clark, A., Donahue, J., & Simonyan, K. (2019). Adversarial video generation on complex datasets. arXiv:1907.06571.
Cui, A., McKee, D., & Lazebnik, S. (2021). Dressing in order: Recurrent person image generation for pose transfer, virtual try-on and outfit editing. In International Conference on Computer Vision
Denton, E., & Fergus, R. (2018). Stochastic video generation with a learned prior. arXiv:1802.07687.
Duan, H., Zhao, Y., Chen, K., Lin, D., & Dai, B. (2022). Revisiting skeleton-based action recognition. In IEEE conference on computer vision and pattern recognition.
Duan, J., Wang, L., Long, C., Zhou, S., Zheng, F., Shi, L., & Hua, G. (2022). Complementary attention gated network for pedestrian trajectory prediction. In AAAI.
Finn, C., Goodfellow, I., & Levine, S. (2016). Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems.
Frühstück, A., Singh, K. K., Shechtman, E., Mitra, N. J., Wonka, P., & Lu, J. (2022). Insetgan for full-body image generation. In IEEE conference on computer vision and pattern recognition.
Fu, J., Li, S., Jiang, Y., Lin, K. Y., Qian, C., Loy, C. C., Wu, W., & Liu, Z. (2022). Stylegan-human: A data-centric odyssey of human generation. In The European Conference on Computer Vision.
Gafni, O., Ashual, O., & Wolf, L. (2021). Single-shot freestyle dance reenactment. In IEEE conference on computer vision and pattern recognition.
Ge, C., Song, Y., Ge, Y., Yang, H., Liu, W., & Luo, P. (2021). Disentangled cycle consistency for highly-realistic virtual try-on. In IEEE conference on computer vision and pattern recognition.
Ge, Y., Song, Y., Zhang, R., Ge, C., Liu, W., & Luo, P. (2021). Parser-free virtual try-on via distilling appearance flows. In IEEE conference on computer vision and pattern recognition.
Geng, Z., Sun, K., Xiao, B., Zhang, Z., & Wang, J. (2021). Bottom-up human pose estimation via disentangled keypoint regression. In IEEE conference on computer vision and pattern recognition.
Ghosh, A., Zhang, R., Dokania, P.K., Wang, O., Efros, A.A., Torr, P.H., & Shechtman, E. (2019). Interactive sketch & fill: Multiclass sketch-to-image translation. In International conference on computer vision.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems.
Gopalakrishnan, A., Mali, A., Kifer, D., Giles, L., & Ororbia, A.G. (2019). A neural temporal model for human motion prediction. In IEEE conference on computer vision and pattern recognition.
Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber, J. (2016). Lstm: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems.
Guen, V. L., & Thome, N. (2020). Disentangling physical dynamics from unknown factors for unsupervised video prediction. In IEEE conference on computer vision and pattern recognition.
Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., & Cheng, L. (2022). Generating diverse and natural 3d human motions from text. In IEEE conference on computer vision and pattern recognition.
Guo, X., & Choi, J. (2019). Human motion prediction via learning local structure representations and temporal dependencies. In AAAI.
Guo, X., Zhao, Y., & Li, J. (2021). Danceit: Music-inspired dancing video synthesis. IEEE Transactions on Image Process.
Han, L., Ren, J., Lee, H.Y., Barbieri, F., Olszewski, K., Minaee, S., Metaxas, D., & Tulyakov, S. (2022). Show me what and tell me how: Video synthesis via multimodal conditioning. In IEEE conference on computer vision and pattern recognition.
Ho, T.T., Virtusio, J.J., Chen, Y.Y., Hsu, C.M., & Hua, K.L. (2020). Sketch-guided deep portrait generation. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM).
Huang, Y., Bi, H., Li, Z., Mao, T., & Wang, Z. (2019). Stgat: Modeling spatial-temporal interactions for human trajectory prediction. In International Conference on Computer Vision.
Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2013). Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Iqbal, U., Molchanov, P., & Kautz, J. (2020). Weakly-supervised 3d human pose learning via multi-view images in the wild. In IEEE conference on computer vision and pattern recognition.
Jia, X., De Brabandere, B., Tuytelaars, T., & Gool, L. V. (2016). Dynamic filter networks. In Advances in Neural Information Processing Systems.
Jiang, Y., Yang, S., Qju, H., Wu, W., Loy, C. C., & Liu, Z. (2022). Text2human: Text-driven controllable human image generation. ACM Transactions on Graph.
Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In The European Conference on Computer Vision.
Kalchbrenner, N., van den Oord, A., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., & Kavukcuoglu, K. (2017). Video pixel networks. In ICML.
Kappel, M., Golyanik, V., Elgharib, M., Henningson, J. O., Seidel, H. P., Castillo, S., Theobalt, C., & Magnor, M. (2021). High-fidelity neural human motion transfer from monocular video. In IEEE conference on computer vision and pattern recognition.
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., & Aila, T. (2020). Analyzing and improving the image quality of stylegan. In IEEE IEEE conference on computer vision and pattern recognition.
Kim, Y., Nam, S., Cho, I., & Kim, S.J. (2019). Unsupervised keypoint learning for guiding class-conditional video prediction. In Advances in Neural Information Processing Systems.
Kim, Y., Nam, S., Cho, I., & Kim, S. J. (2019). Unsupervised keypoint learning for guiding class-conditional video prediction. In Advances in Neural Information Processing Systems.
Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.
Kingma, D.P., & Welling, M. (2014). Auto-encoding variational bayes. In The International Conference on Learning Representations.
Kocabas, M., Karagoz, S., & Akbas, E. (2019). Self-supervised learning of 3d human pose using multi-view geometry. In IEEE conference on computer vision and pattern recognition.
Kothari, P., Sifringer, B., & Alahi, A. (2021). Interpretable social anchors for human trajectory forecasting in crowds. In IEEE conference on computer vision and pattern recognition.
Kumar, M., Babaeizadeh, M., Erhan, D., Finn, C., Levine, S., Dinh, L., & Kingma, D. (2019). Videoflow: A flow-based generative model for video. arXiv:1903.01434
Kwon, Y.H., & Park, M.G. (2019). Predicting future frames using retrospective cycle gan. In IEEE conference on computer vision and pattern recognition.
Lee, A.X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., & Levine, S. (2018). Stochastic adversarial video prediction. arXiv:1804.01523.
Lee, H. Y., Yang, X., Liu, M. Y., Wang, T. C., Lu, Y. D., Yang, M. H., & Kautz, J. (2019). Dancing to music. In Advances in Neural Information Processing Systems.
Li, C., Zhang, Z., Sun Lee, W., & Hee Lee, G. (2018). Convolutional sequence to sequence model for human dynamics. In IEEE conference on computer vision and pattern recognition.
Li, L., Wang, S., Zhang, Z., Ding, Y., Zheng, Y., Yu, X., & Fan, C. (2021). Write-a-speaker: Text-based emotional and rhythmic talking-head generation. In AAAI.
Li, X., Zhang, J., Li, K., Vyas, S., & Rawat, Y.S. (2022). Pose-guided generative adversarial net for novel view action synthesis. In Proceedings of the IEEE/CVF winter conference on applications of computer vision.
Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., & Yang, M.H. (2018). Flow-grounded spatial-temporal video prediction from still images. In The European Conference on Computer Vision.
Li, Y., Li, Y., Lu, J., Shechtman, E., Lee, Y. J., & Singh, K. K. (2021). Collaging class-specific gans for semantic image synthesis. In International Conference on Computer Vision.
Liu, D., Wu, L., Zheng, F., Liu, L., & Wang, M. (2022). Verbal-person nets: Pose-guided multi-granularity language-to-person generation. IEEE Transactions on Neural Networks and Learning Systems.
Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., & Van Gool, L. (2017). Pose guided person image generation. In Advances in Neural Information Processing Systems.
Mao, W., Liu, M., & Salzmann, M. (2020). History repeats itself: Human motion prediction via motion attention. In The European Conference on Computer Vision.
Mao, W., Liu, M., Salzmann, M., & Li, H. (2019). Learning trajectory dependencies for human motion prediction. In The European Conference on Computer Vision.
Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., & Paul Smolley, S. (2017). Least squares generative adversarial networks. In IEEE conference on computer vision and pattern recognition.
Mathieu, M., Couprie, C., & LeCun, Y. (2015). Deep multi-scale video prediction beyond mean square error. arXiv:1511.05440.
Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In 11th Annual Conference of the International Speech Communication Association.
Minderer, M., Sun, C., Villegas, R., Cole, F., Murphy, K. P., & Lee, H. (2019). Unsupervised learning of object structure and dynamics from videos. In Advances in Neural Information Processing Systems.
Neverova, N., Alp Guler, R., & Kokkinos, I. (2018). Dense pose transfer. In The European Conference on Computer Vision.
Oliu, M., Selva, J., & Escalera, S. (2018). Folded recurrent neural networks for future video prediction. In The European Conference on Computer Vision.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems.
Pavllo, D., Feichtenhofer, C., Grangier, D., & Auli, M. (2019). 3d human pose estimation in video with temporal convolutions and semi-supervised training. In IEEE conference on computer vision and pattern recognition.
Perez, E., Strub, F., De Vries, H., Dumoulin, V., & Courville, A. (2018). Film: Visual reasoning with a general conditioning layer. In AAAI.
Piergiovanni, A., Angelova, A., Toshev, A., & Ryoo, M.S. (2020). Adversarial generative grammars for human activity prediction. arXiv:2008.04888.
Razavi, A., Oord, A. V. D., Poole, B., & Vinyals, O. (2019). Preventing posterior collapse with delta-vaes. In ICML
Ren, X., Li, H., Huang, Z., & Chen, Q. (2020). Self-supervised dance video synthesis conditioned on music. In ACM International Conference on Multimedia.
Ren, Y., Fan, X., Li, G., Liu, S., & Li, T.H. (2022). Neural texture extraction and distribution for controllable person image synthesis. In IEEE conference on computer vision and pattern recognition.
Rhodin, H., Spörri, J., Katircioglu, I., Constantin, V., Meyer, F., Müller, E., Salzmann, M., & Fua, P. (2018). Learning monocular 3d human pose estimation from multi-view images. In IEEE conference on computer vision and pattern recognition.
Shahroudy, A., Liu, J., Ng, T. T., & Wang, G. (2016). Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In IEEE conference on computer vision and pattern recognition.
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., & Sebe, N. (2019). First order motion model for image animation. In Advances in Neural Information Processing Systems.
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., & Sebe, N. (2019). Animating arbitrary objects via deep motion transfer. In IEEE conference on computer vision and pattern recognition.
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems.
Siyao, L., Yu, W., Gu, T., Lin, C., Wang, Q., Qian, C., Loy, C. C., & Liu, Z. (2022). Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In IEEE conference on computer vision and pattern recognition.
Srivastava, N., Mansimov, E., & Salakhudinov, R. (2015). Unsupervised learning of video representations using lstms. In ICML.
Tang, H., Bai, S., Zhang, L., Torr, P.H., & Sebe, N. (2020). Xinggan for person image generation. In The European Conference on Computer Vision.
Tulyakov, S., Liu, M.Y., Yang, X., & Kautz, J. (2018). Mocogan: Decomposing motion and content for video generation. In IEEE conference on computer vision and pattern recognition.
Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., & Gelly, S. (2018). Towards accurate generative models of video: A new metric & challenges. arXiv:1812.01717.
Villegas, R., Yang, J., Ceylan, D., & Lee, H. (2018). Neural kinematic networks for unsupervised motion retargetting. In IEEE conference on computer vision and pattern recognition.
Villegas, R., Yang, J., Hong, S., Lin, X., & Lee, H. (2017). Decomposing motion and content for natural video sequence prediction. arXiv:1706.08033.
Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., & Lee, H. (2017). Learning to generate long-term future via hierarchical prediction. In ICML.
Walker, J., Marino, K., Gupta, A., & Hebert, M. (2017). The pose knows: Video forecasting by generating pose futures. In International Conference on Computer Vision.
Wandt, B., Rudolph, M., Zell, P., Rhodin, H., & Rosenhahn, B. (2021). Canonpose: Self-supervised monocular 3d human pose estimation in the wild. In IEEE conference on computer vision and pattern recognition.
Wang, B., Adeli, E., Chiu, H. K., Huang, D. A., & Niebles, J. C. (2019). Imitation learning for human pose prediction. In International Conference on Computer Vision.
Wang, T. C., Liu, M. Y., Zhu, J. Y., Liu, G., Tao, A., Kautz, J., & Catanzaro, B. (2018). Video-to-video synthesis. arXiv:1808.06601.
Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., & Catanzaro, B. (2018). High-resolution image synthesis and semantic manipulation with conditional gans. In IEEE conference on computer vision and pattern recognition.
Wang, W., Alameda-Pineda, X., Xu, D., Fua, P., Ricci, E., & Sebe, N. (2018). Every smile is unique: Landmark-guided diverse smile generation. In IEEE conference on computer vision and pattern recognition.
Wang, Y., Li, M., Cai, H., Chen, W.M., & Han, S. (2022). Lite pose: Efficient architecture design for 2d human pose estimation. In IEEE conference on computer vision and pattern recognition.
Wang, Y., Zhang, J., Zhu, H., Long, M., Wang, J., & Yu, P. S. (2019). Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In IEEE conference on computer vision and pattern recognition.
Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Process.
Wichers, N., Villegas, R., Erhan, D., & Lee, H. (2018). Hierarchical long-term video prediction without supervision. arXiv:1806.04768.
Wu, Q., Chen, X., Huang, Z., & Wang, W. (2020). Generating future frames with mask-guided prediction. In The IEEE International Conference on Multimedia and Expo.
Xu, J., Ni, B., Li, Z., Cheng, S., & Yang, X. (2018). Structure preserving video prediction. In IEEE conference on computer vision and pattern recognition.
Yan, X., Rastogi, A., Villegas, R., Sunkavalli, K., Shechtman, E., Hadap, S., Yumer, E., & Lee, H. (2018). Mt-vae: Learning motion transformations to generate multimodal human dynamics. In The European Conference on Computer Vision.
Yang, C., Wang, Z., Zhu, X., Huang, C., Shi, J., & Lin, D. (2018). Pose guided human video generation. In The European Conference on Computer Vision.
Yang, Z., Zhu, W., Wu, W., Qian, C., Zhou, Q., Zhou, B., & Loy, C.C. (2020). Transmomo: Invariance-driven unsupervised video motion retargeting. In IEEE conference on computer vision and pattern recognition.
Yoo, Y., Yun, S., Jin Chang, H., Demiris, Y., & Young Choi, J. (2017). Variational autoencoded regression: high dimensional regression of visual data on complex manifold. In IEEE conference on computer vision and pattern recognition.
Yoon, J.S., Liu, L., Golyanik, V., Sarkar, K., Park, H.S., & Theobalt, C. (2021). Pose-guided human animation from a single image in the wild. In IEEE conference on computer vision and pattern recognition.
Yuan, Y., & Kitani, K. (2020). Dlow: Diversifying latent flows for diverse human motion prediction. In The European Conference on Computer Vision.
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In IEEE conference on computer vision and pattern recognition.
Zhang, W., Zhu, M., & Derpanis, K.G. (2013). From actemes to action: A strongly-supervised representation for detailed action understanding. In International Conference on Computer Vision.
Zhao, L., Peng, X., Tian, Y., Kapadia, M., & Metaxas, D. (2018). Learning to forecast and refine residual motion for image-to-video generation. In The European Conference on Computer Vision.
Zhou, X., Huang, S., Li, B., Li, Y., Li, J., & Zhang, Z. (2019). Text guided person image synthesis. In IEEE conference on computer vision and pattern recognition.
Zhu, J.Y., Park, T., Isola, P., & Efros, A.A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In International Conference on Computer Vision.
Zhu, W., Yang, Z., Di, Z., Wu, W., Wang, Y., & Loy, C.C. (2022). Mocanet: Motion retargeting in-the-wild via canonicalization networks. In AAAI.
Zhu, Z., Huang, T., Shi, B., Yu, M., Wang, B., & Bai, X. (2019). Progressive pose attention transfer for person image generation. In IEEE conference on computer vision and pattern recognition.
Zhuang, W., Wang, C., Chai, J., Wang, Y., Shao, M., & Xia, S. (2022). Music2dance: Dancenet for music-driven dance generation. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)

Download references

Acknowledgements

This work is supported by Key Research Project of Zhejiang Lab (No. K2022PG1BB01). This work is also supported by Research Project of Zhejiang Lab (No.2022PD0AC02).

Author information

Authors and Affiliations

Zhejiang Lab, Hangzhou City, Zhejiang Province, China
Xiaogang Xu
Shanghai AI Laboratory, Shanghai City, China
Yi Wang
The Chinese University of Hong Kong, Hong Kong, China
Liwei Wang, Bei Yu & Jiaya Jia

Authors

Xiaogang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Liwei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bei Yu
View author publications
You can also search for this author in PubMed Google Scholar
Jiaya Jia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaogang Xu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 9480 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xu, X., Wang, Y., Wang, L. et al. Conditional Temporal Variational AutoEncoder for Action Video Prediction. Int J Comput Vis 131, 2699–2722 (2023). https://doi.org/10.1007/s11263-023-01832-8

Download citation

Received: 21 March 2022
Accepted: 24 May 2023
Published: 18 June 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s11263-023-01832-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Conditional Temporal Variational AutoEncoder for Action Video Prediction

Abstract

Access this article

Similar content being viewed by others

Learning Uncoupled-Modulation CVAE for 3D Action-Conditioned Human Motion Synthesis

Deep Video Generation, Prediction and Completion of Human Action Sequences

Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 9480 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Conditional Temporal Variational AutoEncoder for Action Video Prediction

Abstract

Access this article

Similar content being viewed by others

Learning Uncoupled-Modulation CVAE for 3D Action-Conditioned Human Motion Synthesis

Deep Video Generation, Prediction and Completion of Human Action Sequences

Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 9480 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation