Stagemix video generation using face and body keypoints detection

Jung, Minjoon; Lee, Seunghyun; Sim, Eun Seon; Jo, Min Ho; Lee, Yu Jin; Choi, Hye Bin; Kwon, Junseok

doi:10.1007/s11042-022-13103-8

Stagemix video generation using face and body keypoints detection

Published: 25 April 2022

Volume 81, pages 38531–38542, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Minjoon Jung¹,
Seunghyun Lee²,
Eun Seon Sim³,
Min Ho Jo⁴,
Yu Jin Lee⁵,
Hye Bin Choi⁵ &
…
Junseok Kwon ORCID: orcid.org/0000-0001-9526-7549⁶

258 Accesses
1 Altmetric
Explore all metrics

Abstract

Playing multiple stage videos of a particular singer as if they are one is called Stagemix video. The consumption of video media has increased recently, and the demand for video editing has also increased. Stagemix videos have gained popularity in various communities, and a number of YouTubers who upload videos with cross-cuts are appearing. In this work, we introduce a novel task, Stagemix video generation. Stagemix video generation requires considerable time and skillful editing skills. To address this, we suggest a method of auto-generating Stagemix video, a novel technique that plays multiple stage videos of a particular singer as if they are one. Our novel methods automatically generate a Stagemix video and improve performance with face or body keypoints which is extracted by CNN-based extractor. Quantitative differences between frames and creation time show that our methods effectively produce a natural video.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic Curation System Using Multimodal Analysis Approach (MAA)

Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer

Pictonaut: movie cartoonization using 3D human pose estimation and GANs

Article Open access 18 February 2023

References

Fang HS, Xie S, Tai YW, Lu C (2017) Rmpe: Regional multi-person pose estimation. In: ICCV
Gabeur V, Sun C, Alahari K, Schmid C (2020) Multi-modal transformer for video retrieval. In: European conference on computer vision (ECCV), vol 5. Springer
Ging S, Zolfaghari M, Pirsiavash H, Brox T (2020)
Jain A, Tompson J, Andriluka M, Taylor GW, Bregler C (2013)
Ji S, Xu W, Yang M, Yu K (2012) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35 (1):221–231
Article Google Scholar
Jiao Y, Li Z, Huang S, Yang X, Liu B, Zhang T (2018) Three-dimensional attention-based deep ranking model for video highlight detection. IEEE Trans Multimed 20(10):2693–2705
Article Google Scholar
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P et al (2017)
Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE international conference on computer vision, pp 5533–5541
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6459
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612. https://doi.org/10.1109/TIP.2003.819861
Article Google Scholar
Wei SE, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4724–4732
Xiong B, Kalantidis Y, Ghadiyaram D, Grauman K (2019) Less is more: Learning highlight detection from video duration. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1258–1267
Yu Y, Lee S, Na J, Kang J, Kim G (2018) A deep ranking model for spatio-temporal highlight detection from a 360 video

Download references

Author information

Authors and Affiliations

Seoul National University, Seoul, South Korea
Minjoon Jung
University of Seoul, Seoul, South Korea
Seunghyun Lee
Konkuk University, Seoul, South Korea
Eun Seon Sim
Sogang University, Seoul, South Korea
Min Ho Jo
Ewha Womans University, Seoul, South Korea
Yu Jin Lee & Hye Bin Choi
School of Computer Science and Engineering, Chung-Ang University, Seoul, Korea
Junseok Kwon

Authors

Minjoon Jung
View author publications
You can also search for this author in PubMed Google Scholar
Seunghyun Lee
View author publications
You can also search for this author in PubMed Google Scholar
Eun Seon Sim
View author publications
You can also search for this author in PubMed Google Scholar
Min Ho Jo
View author publications
You can also search for this author in PubMed Google Scholar
Yu Jin Lee
View author publications
You can also search for this author in PubMed Google Scholar
Hye Bin Choi
View author publications
You can also search for this author in PubMed Google Scholar
Junseok Kwon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junseok Kwon.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jung, M., Lee, S., Sim, E.S. et al. Stagemix video generation using face and body keypoints detection. Multimed Tools Appl 81, 38531–38542 (2022). https://doi.org/10.1007/s11042-022-13103-8

Download citation

Received: 15 March 2021
Revised: 20 May 2021
Accepted: 04 April 2022
Published: 25 April 2022
Issue Date: November 2022
DOI: https://doi.org/10.1007/s11042-022-13103-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stagemix video generation using face and body keypoints detection

Abstract

Access this article

Similar content being viewed by others

Automatic Curation System Using Multimodal Analysis Approach (MAA)

Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer

Pictonaut: movie cartoonization using 3D human pose estimation and GANs

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Stagemix video generation using face and body keypoints detection

Abstract

Access this article

Similar content being viewed by others

Automatic Curation System Using Multimodal Analysis Approach (MAA)

Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer

Pictonaut: movie cartoonization using 3D human pose estimation and GANs

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation