Free-Viewpoint RGB-D Human Performance Capture and Rendering

Nguyen-Ha, Phong; Sarafianos, Nikolaos; Lassner, Christoph; Heikkilä, Janne; Tung, Tony

doi:10.1007/978-3-031-19787-1_27

Phong Nguyen-Ha¹²,
Nikolaos Sarafianos¹³,
Christoph Lassner¹³,
Janne Heikkilä¹² &
…
Tony Tung¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13676))

Included in the following conference series:

European Conference on Computer Vision

2228 Accesses
4 Citations

Abstract

Capturing and faithfully rendering photorealistic humans from novel views is a fundamental problem for AR/VR applications. While prior work has shown impressive performance capture results in laboratory settings, it is non-trivial to achieve casual free-viewpoint human capture and rendering for unseen identities with high fidelity, especially for facial expressions, hands, and clothes. To tackle these challenges we introduce a novel view synthesis framework that generates realistic renders from unseen views of any human captured from a single-view and sparse RGB-D sensor, similar to a low-cost depth camera, and without actor-specific models. We propose an architecture to create dense feature maps in novel views obtained by sphere-based neural rendering, and create complete renders using a global context inpainting model. Additionally, an enhancer network leverages the overall fidelity, even in occluded areas from the original view, producing crisp renders with fine details. We show that our method generates high-quality novel views of synthetic and real human actors given a single-stream, sparse RGB-D input. It generalizes to unseen identities, and new poses and faithfully reconstructs facial expressions. Our approach outperforms prior view synthesis methods and is robust to different levels of depth sparsity.

P. Nguyen-Ha—This work was conducted during an internship at Meta Reality Labs Research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

de Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H.P., Thrun, S.: Performance capture from sparse multi-view video. In: TOG (2008)
Google Scholar
Aliev, K.-A., Sevastopolsky, A., Kolos, M., Ulyanov, D., Lempitsky, V.: Neural point-based graphics. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12367, pp. 696–712. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58542-6_42
Chapter Google Scholar
Alldieck, T., Pons-Moll, G., Theobalt, C., Magnor, M.: Tex2shape: detailed full human body geometry from a single image. In: ICCV (2019)
Google Scholar
Ballan, L., Brostow, G.J., Puwein, J., Pollefeys, M.: Unstructured video-based rendering: interactive exploration of casually captured videos. In: SIGGRAPH (2010)
Google Scholar
Bansal, A., Vo, M., Sheikh, Y., Ramanan, D., Narasimhan, S.: 4D visualization of dynamic events from unconstrained multi-view videos. In: CVPR (2020)
Google Scholar
Bemana, M., Myszkowski, K., Seidel, H.P., Ritschel, T.: X-fields: implicit neural view-, light- and time-image interpolation. In: SIGGRAPH Asia (2020)
Google Scholar
Bhatnagar, B.L., Tiwari, G., Theobalt, C., Pons-Moll, G.: Multi-garment net: learning to dress 3D people from images. In: ICCV (2019)
Google Scholar
Broxton, M., et al.: Immersive light field video with a layered mesh representation. TOG 39, 861–8615 (2020)
Article Google Scholar
Carranza, J., Theobalt, C., Magnor, M.A., Seidel, H.P.: Free-viewpoint video of human actors. TOG 22, 569–577 (2003)
Article Google Scholar
Chaudhuri, B., Sarafianos, N., Shapiro, L., Tung, T.: Semi-supervised synthesis of high-resolution editable textures for 3D humans. In: CVPR (2021)
Google Scholar
Chaurasia, G., Duchene, S., Sorkine-Hornung, O., Drettakis, G.: Depth synthesis and local warps for plausible image-based navigation. TOG (2013)
Google Scholar
Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement networks. In: ICCV (2017)
Google Scholar
Chi, L., Jiang, B., Mu, Y.: Fast fourier convolution. In: NeurIPS (2020)
Google Scholar
Chibane, J., Bansal, A., Lazova, V., Pons-Moll, G.: Stereo radiance fields (SRF): learning view synthesis from sparse views of novel scenes. In: CVPR (2021)
Google Scholar
Collet, A., et al.: High-quality streamable free-viewpoint video. TOG 34, 1–13 (2015)
Article Google Scholar
Debevec, P., Yu, Y., Borshukov, G.: Efficient view-dependent image-based rendering with projective texture-mapping. In: Eurographics Rendering Workshop (1998)
Google Scholar
Flynn, J., et al.: DeepView: view synthesis with learned gradient descent. In: CVPR (2019)
Google Scholar
Flynn, J., Neulander, I., Philbin, J., Snavely, N.: Deep stereo: learning to predict new views from the world’s imagery. In: CVPR (2016)
Google Scholar
Ganin, Y., Kononenko, D., Sungatullina, D., Lempitsky, V.: DeepWarp: photorealistic image resynthesis for gaze manipulation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 311–326. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_20
Chapter Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS (2014)
Google Scholar
Guo, K., et al.: The relightables: volumetric performance capture of humans with realistic relighting. TOG 38, 1–19 (2019)
Google Scholar
Huang, Z., et al.: Deep volumetric video from very sparse multi-view performance capture. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 351–369. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_21
Chapter Google Scholar
Huang, Z., Xu, Y., Lassner, C., Li, H., Tung, T.: ARCH: animatable reconstruction of clothed humans. In: CVPR (2020)
Google Scholar
Ianina, A., Sarafianos, N., Xu, Y., Rocco, I., Tung, T.: BodyMap: learning full-body dense correspondence map. In: CVPR (2022)
Google Scholar
Insafutdinov, E., Dosovitskiy, A.: Unsupervised learning of shape and pose with differentiable point clouds. In: NeurIPS (2018)
Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: NeurIPS (2015)
Google Scholar
Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: ICCV (2015)
Google Scholar
Kalantari, N.K., Wang, T.C., Ramamoorthi, R.: Learning-based view synthesis for light field cameras. TOG 35, 1–10 (2016)
Article Google Scholar
Kanade, T., Rander, P., Narayanan, P.: Virtualized reality: constructing virtual worlds from real scenes. IEEE MultiMedia 4, 34–47 (1997)
Article Google Scholar
Kopanas, G., Philip, J., Leimkühler, T., Drettakis, G.: Point-based neural rendering with per-view optimization. In: Computer Graphics Forum (2021)
Google Scholar
Kwon, Y., Kim, D., Ceylan, D., Fuchs, H.: Neural human performer: learning generalizable radiance fields for human performance rendering. In: NeurIPS (2021)
Google Scholar
Kwon, Y., et al.: Rotationally-temporally consistent novel view synthesis of human performance video. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 387–402. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_23
Chapter Google Scholar
Lassner, C., Zollhofer, M.: Pulsar: efficient sphere-based neural rendering. In: CVPR (2021)
Google Scholar
Le, H.A., Mensink, T., Das, P., Gevers, T.: Novel view synthesis from a single image via point cloud transformation. In: BMVC (2020)
Google Scholar
Li, H., et al.: Temporally coherent completion of dynamic shapes. TOG 31, 1–11 (2012)
Article Google Scholar
Li, T., et al.: Neural 3D video synthesis. In: CVPR (2021)
Google Scholar
Lin, C.H., Kong, C., Lucey, S.: Learning efficient point cloud generation for dense 3D object reconstruction. In: AAAI (2018)
Google Scholar
Lombardi, S., Simon, T., Saragih, J., Schwartz, G., Lehrmann, A., Sheikh, Y.: Neural volumes: learning dynamic renderable volumes from images. TOG (2019)
Google Scholar
Lombardi, S., Simon, T., Schwartz, G., Zollhoefer, M., Sheikh, Y., Saragih, J.: Mixture of volumetric primitives for efficient neural rendering. TOG (2021)
Google Scholar
Martin-Brualla, R., et al.: Lookingood: enhancing performance capture with real-time neural re-rendering. TOG (2018)
Google Scholar
Meshry, M., et al.: Neural rerendering in the wild. In: CVPR (2019)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24
Chapter Google Scholar
Neverova, N., Alp Güler, R., Kokkinos, I.: Dense pose transfer. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 128–143. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_8
Chapter Google Scholar
Neverova, N., Novotny, D., Khalidov, V., Szafraniec, M., Labatut, P., Vedaldi, A.: Continuous surface embeddings. In: NeurIPS (2020)
Google Scholar
Nguyen, P., Karnewar, A., Huynh, L., Rahtu, E., Matas, J., Heikkila, J.: RGBD-net: predicting color and depth images for novel views synthesis. In: 3DV (2021)
Google Scholar
Noguchi, A., Sun, X., Lin, S., Harada, T.: Neural articulated radiance field. In: ICCV (2021)
Google Scholar
Palafox, P., Sarafianos, N., Tung, T., Dai, A.: SPAMs: structured implicit parametric models. In: CVPR (2022)
Google Scholar
Pandey, R., et al.: Volumetric capture of humans with a single RGBD camera via semi-parametric learning. In: CVPR (2019)
Google Scholar
Patel, P., Huang, C.H.P., Tesch, J., Hoffmann, D.T., Tripathi, S., Black, M.J.: AGORA: avatars in geography optimized for regression analysis. In: CVPR (2021)
Google Scholar
Peng, S., et al.: Animatable neural radiance fields for modeling dynamic human bodies. In: ICCV (2021)
Google Scholar
Peng, S., et al.: Neural body: implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In: CVPR (2021)
Google Scholar
RenderPeople: http://renderpeople.com
Riegler, G., Koltun, V.: Free view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 623–640. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_37
Chapter Google Scholar
Roveri, R., Rahmann, L., Oztireli, C., Gross, M.: A network architecture for point cloud classification via automatic depth images generation. In: CVPR (2018)
Google Scholar
Rückert, D., Franke, L., Stamminger, M.: Adop: Approximate differentiable one-pixel point rendering. arXiv preprint arXiv:2110.06635 (2021)
Shum, H., Kang, S.B.: Review of image-based rendering techniques. In: Visual Communications and Image Processing (2000)
Google Scholar
Srinivasan, P.P., Tucker, R., Barron, J.T., Ramamoorthi, R., Ng, R., Snavely, N.: Pushing the boundaries of view extrapolation with multiplane images. In: CVPR (2019)
Google Scholar
Suvorov, R., et al.: Resolution-robust large mask inpainting with Fourier convolutions. In: WACV (2022)
Google Scholar
Tan, F., et al.: Humangps: geodesic preserving feature for dense human correspondences. In: CVPR (2021)
Google Scholar
Tewari, A., et al.: State of the art on neural rendering. In: Computer Graphics Forum (2020)
Google Scholar
Thies, J., Zollhöfer, M., Theobalt, C., Stamminger, M., Nießner, M.: IGNOR: Image-guided Neural Object Rendering. In: ICLR (2020)
Google Scholar
Tretschk, E., Tewari, A., Golyanik, V., Zollhöfer, M., Lassner, C., Theobalt, C.: Non-rigid neural radiance fields: reconstruction and novel view synthesis of a dynamic scene from monocular video. In: ICCV (2021)
Google Scholar
Wang, T., Sarafianos, N., Yang, M.H., Tung, T.: Animatable neural radiance fields from monocular RGB-D. arXiv preprint arXiv:2204.01218 (2022)
Wiles, O., Gkioxari, G., Szeliski, R., Johnson, J.: Synsin: end-to-end view synthesis from a single image. In: CVPR (2020)
Google Scholar
Xie, Y., et al.: Neural fields in visual computing and beyond (2021)
Google Scholar
Xu, H., Alldieck, T., Sminchisescu, C.: H-nerf: neural radiance fields for rendering and temporal reconstruction of humans in motion. In: NeurIPS (2021)
Google Scholar
Yoon, J.S., Kim, K., Gallo, O., Park, H.S., Kautz, J.: Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera. In: CVPR (2020)
Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
Google Scholar
Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification: Learning view synthesis using multiplane images. TOG (2018)
Google Scholar
Zitnick, C., Kang, S.B., Uyttendaele, M., Winder, S., Szeliski, R.: High-quality video view interpolation using a layered representation. TOG 23, 600–608 (2004)
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank Albert Para Pozzo, Sam Johnson and Ronald Mallet for the initial discussions related to the project.

Author information

Authors and Affiliations

Center for Machine Vision and Signal Analysis, University of Oulu, Oulu, Finland
Phong Nguyen-Ha & Janne Heikkilä
Meta Reality Labs Research, Sausalito, USA
Nikolaos Sarafianos, Christoph Lassner & Tony Tung

Authors

Phong Nguyen-Ha
View author publications
You can also search for this author in PubMed Google Scholar
Nikolaos Sarafianos
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Lassner
View author publications
You can also search for this author in PubMed Google Scholar
Janne Heikkilä
View author publications
You can also search for this author in PubMed Google Scholar
Tony Tung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Phong Nguyen-Ha .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4398 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen-Ha, P., Sarafianos, N., Lassner, C., Heikkilä, J., Tung, T. (2022). Free-Viewpoint RGB-D Human Performance Capture and Rendering. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13676. Springer, Cham. https://doi.org/10.1007/978-3-031-19787-1_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-19787-1_27
Published: 21 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19786-4
Online ISBN: 978-3-031-19787-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Free-Viewpoint RGB-D Human Performance Capture and Rendering