Skip to main content

Free-Viewpoint RGB-D Human Performance Capture and Rendering

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13676))

Included in the following conference series:

Abstract

Capturing and faithfully rendering photorealistic humans from novel views is a fundamental problem for AR/VR applications. While prior work has shown impressive performance capture results in laboratory settings, it is non-trivial to achieve casual free-viewpoint human capture and rendering for unseen identities with high fidelity, especially for facial expressions, hands, and clothes. To tackle these challenges we introduce a novel view synthesis framework that generates realistic renders from unseen views of any human captured from a single-view and sparse RGB-D sensor, similar to a low-cost depth camera, and without actor-specific models. We propose an architecture to create dense feature maps in novel views obtained by sphere-based neural rendering, and create complete renders using a global context inpainting model. Additionally, an enhancer network leverages the overall fidelity, even in occluded areas from the original view, producing crisp renders with fine details. We show that our method generates high-quality novel views of synthetic and real human actors given a single-stream, sparse RGB-D input. It generalizes to unseen identities, and new poses and faithfully reconstructs facial expressions. Our approach outperforms prior view synthesis methods and is robust to different levels of depth sparsity.

P. Nguyen-Ha—This work was conducted during an internship at Meta Reality Labs Research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. de Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H.P., Thrun, S.: Performance capture from sparse multi-view video. In: TOG (2008)

    Google Scholar 

  2. Aliev, K.-A., Sevastopolsky, A., Kolos, M., Ulyanov, D., Lempitsky, V.: Neural point-based graphics. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12367, pp. 696–712. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58542-6_42

    Chapter  Google Scholar 

  3. Alldieck, T., Pons-Moll, G., Theobalt, C., Magnor, M.: Tex2shape: detailed full human body geometry from a single image. In: ICCV (2019)

    Google Scholar 

  4. Ballan, L., Brostow, G.J., Puwein, J., Pollefeys, M.: Unstructured video-based rendering: interactive exploration of casually captured videos. In: SIGGRAPH (2010)

    Google Scholar 

  5. Bansal, A., Vo, M., Sheikh, Y., Ramanan, D., Narasimhan, S.: 4D visualization of dynamic events from unconstrained multi-view videos. In: CVPR (2020)

    Google Scholar 

  6. Bemana, M., Myszkowski, K., Seidel, H.P., Ritschel, T.: X-fields: implicit neural view-, light- and time-image interpolation. In: SIGGRAPH Asia (2020)

    Google Scholar 

  7. Bhatnagar, B.L., Tiwari, G., Theobalt, C., Pons-Moll, G.: Multi-garment net: learning to dress 3D people from images. In: ICCV (2019)

    Google Scholar 

  8. Broxton, M., et al.: Immersive light field video with a layered mesh representation. TOG 39, 861–8615 (2020)

    Article  Google Scholar 

  9. Carranza, J., Theobalt, C., Magnor, M.A., Seidel, H.P.: Free-viewpoint video of human actors. TOG 22, 569–577 (2003)

    Article  Google Scholar 

  10. Chaudhuri, B., Sarafianos, N., Shapiro, L., Tung, T.: Semi-supervised synthesis of high-resolution editable textures for 3D humans. In: CVPR (2021)

    Google Scholar 

  11. Chaurasia, G., Duchene, S., Sorkine-Hornung, O., Drettakis, G.: Depth synthesis and local warps for plausible image-based navigation. TOG (2013)

    Google Scholar 

  12. Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement networks. In: ICCV (2017)

    Google Scholar 

  13. Chi, L., Jiang, B., Mu, Y.: Fast fourier convolution. In: NeurIPS (2020)

    Google Scholar 

  14. Chibane, J., Bansal, A., Lazova, V., Pons-Moll, G.: Stereo radiance fields (SRF): learning view synthesis from sparse views of novel scenes. In: CVPR (2021)

    Google Scholar 

  15. Collet, A., et al.: High-quality streamable free-viewpoint video. TOG 34, 1–13 (2015)

    Article  Google Scholar 

  16. Debevec, P., Yu, Y., Borshukov, G.: Efficient view-dependent image-based rendering with projective texture-mapping. In: Eurographics Rendering Workshop (1998)

    Google Scholar 

  17. Flynn, J., et al.: DeepView: view synthesis with learned gradient descent. In: CVPR (2019)

    Google Scholar 

  18. Flynn, J., Neulander, I., Philbin, J., Snavely, N.: Deep stereo: learning to predict new views from the world’s imagery. In: CVPR (2016)

    Google Scholar 

  19. Ganin, Y., Kononenko, D., Sungatullina, D., Lempitsky, V.: DeepWarp: photorealistic image resynthesis for gaze manipulation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 311–326. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_20

    Chapter  Google Scholar 

  20. Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS (2014)

    Google Scholar 

  21. Guo, K., et al.: The relightables: volumetric performance capture of humans with realistic relighting. TOG 38, 1–19 (2019)

    Google Scholar 

  22. Huang, Z., et al.: Deep volumetric video from very sparse multi-view performance capture. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 351–369. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_21

    Chapter  Google Scholar 

  23. Huang, Z., Xu, Y., Lassner, C., Li, H., Tung, T.: ARCH: animatable reconstruction of clothed humans. In: CVPR (2020)

    Google Scholar 

  24. Ianina, A., Sarafianos, N., Xu, Y., Rocco, I., Tung, T.: BodyMap: learning full-body dense correspondence map. In: CVPR (2022)

    Google Scholar 

  25. Insafutdinov, E., Dosovitskiy, A.: Unsupervised learning of shape and pose with differentiable point clouds. In: NeurIPS (2018)

    Google Scholar 

  26. Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: NeurIPS (2015)

    Google Scholar 

  27. Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: ICCV (2015)

    Google Scholar 

  28. Kalantari, N.K., Wang, T.C., Ramamoorthi, R.: Learning-based view synthesis for light field cameras. TOG 35, 1–10 (2016)

    Article  Google Scholar 

  29. Kanade, T., Rander, P., Narayanan, P.: Virtualized reality: constructing virtual worlds from real scenes. IEEE MultiMedia 4, 34–47 (1997)

    Article  Google Scholar 

  30. Kopanas, G., Philip, J., Leimkühler, T., Drettakis, G.: Point-based neural rendering with per-view optimization. In: Computer Graphics Forum (2021)

    Google Scholar 

  31. Kwon, Y., Kim, D., Ceylan, D., Fuchs, H.: Neural human performer: learning generalizable radiance fields for human performance rendering. In: NeurIPS (2021)

    Google Scholar 

  32. Kwon, Y., et al.: Rotationally-temporally consistent novel view synthesis of human performance video. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 387–402. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_23

    Chapter  Google Scholar 

  33. Lassner, C., Zollhofer, M.: Pulsar: efficient sphere-based neural rendering. In: CVPR (2021)

    Google Scholar 

  34. Le, H.A., Mensink, T., Das, P., Gevers, T.: Novel view synthesis from a single image via point cloud transformation. In: BMVC (2020)

    Google Scholar 

  35. Li, H., et al.: Temporally coherent completion of dynamic shapes. TOG 31, 1–11 (2012)

    Article  Google Scholar 

  36. Li, T., et al.: Neural 3D video synthesis. In: CVPR (2021)

    Google Scholar 

  37. Lin, C.H., Kong, C., Lucey, S.: Learning efficient point cloud generation for dense 3D object reconstruction. In: AAAI (2018)

    Google Scholar 

  38. Lombardi, S., Simon, T., Saragih, J., Schwartz, G., Lehrmann, A., Sheikh, Y.: Neural volumes: learning dynamic renderable volumes from images. TOG (2019)

    Google Scholar 

  39. Lombardi, S., Simon, T., Schwartz, G., Zollhoefer, M., Sheikh, Y., Saragih, J.: Mixture of volumetric primitives for efficient neural rendering. TOG (2021)

    Google Scholar 

  40. Martin-Brualla, R., et al.: Lookingood: enhancing performance capture with real-time neural re-rendering. TOG (2018)

    Google Scholar 

  41. Meshry, M., et al.: Neural rerendering in the wild. In: CVPR (2019)

    Google Scholar 

  42. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24

    Chapter  Google Scholar 

  43. Neverova, N., Alp Güler, R., Kokkinos, I.: Dense pose transfer. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 128–143. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_8

    Chapter  Google Scholar 

  44. Neverova, N., Novotny, D., Khalidov, V., Szafraniec, M., Labatut, P., Vedaldi, A.: Continuous surface embeddings. In: NeurIPS (2020)

    Google Scholar 

  45. Nguyen, P., Karnewar, A., Huynh, L., Rahtu, E., Matas, J., Heikkila, J.: RGBD-net: predicting color and depth images for novel views synthesis. In: 3DV (2021)

    Google Scholar 

  46. Noguchi, A., Sun, X., Lin, S., Harada, T.: Neural articulated radiance field. In: ICCV (2021)

    Google Scholar 

  47. Palafox, P., Sarafianos, N., Tung, T., Dai, A.: SPAMs: structured implicit parametric models. In: CVPR (2022)

    Google Scholar 

  48. Pandey, R., et al.: Volumetric capture of humans with a single RGBD camera via semi-parametric learning. In: CVPR (2019)

    Google Scholar 

  49. Patel, P., Huang, C.H.P., Tesch, J., Hoffmann, D.T., Tripathi, S., Black, M.J.: AGORA: avatars in geography optimized for regression analysis. In: CVPR (2021)

    Google Scholar 

  50. Peng, S., et al.: Animatable neural radiance fields for modeling dynamic human bodies. In: ICCV (2021)

    Google Scholar 

  51. Peng, S., et al.: Neural body: implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In: CVPR (2021)

    Google Scholar 

  52. RenderPeople: http://renderpeople.com

  53. Riegler, G., Koltun, V.: Free view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 623–640. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_37

    Chapter  Google Scholar 

  54. Roveri, R., Rahmann, L., Oztireli, C., Gross, M.: A network architecture for point cloud classification via automatic depth images generation. In: CVPR (2018)

    Google Scholar 

  55. Rückert, D., Franke, L., Stamminger, M.: Adop: Approximate differentiable one-pixel point rendering. arXiv preprint arXiv:2110.06635 (2021)

  56. Shum, H., Kang, S.B.: Review of image-based rendering techniques. In: Visual Communications and Image Processing (2000)

    Google Scholar 

  57. Srinivasan, P.P., Tucker, R., Barron, J.T., Ramamoorthi, R., Ng, R., Snavely, N.: Pushing the boundaries of view extrapolation with multiplane images. In: CVPR (2019)

    Google Scholar 

  58. Suvorov, R., et al.: Resolution-robust large mask inpainting with Fourier convolutions. In: WACV (2022)

    Google Scholar 

  59. Tan, F., et al.: Humangps: geodesic preserving feature for dense human correspondences. In: CVPR (2021)

    Google Scholar 

  60. Tewari, A., et al.: State of the art on neural rendering. In: Computer Graphics Forum (2020)

    Google Scholar 

  61. Thies, J., Zollhöfer, M., Theobalt, C., Stamminger, M., Nießner, M.: IGNOR: Image-guided Neural Object Rendering. In: ICLR (2020)

    Google Scholar 

  62. Tretschk, E., Tewari, A., Golyanik, V., Zollhöfer, M., Lassner, C., Theobalt, C.: Non-rigid neural radiance fields: reconstruction and novel view synthesis of a dynamic scene from monocular video. In: ICCV (2021)

    Google Scholar 

  63. Wang, T., Sarafianos, N., Yang, M.H., Tung, T.: Animatable neural radiance fields from monocular RGB-D. arXiv preprint arXiv:2204.01218 (2022)

  64. Wiles, O., Gkioxari, G., Szeliski, R., Johnson, J.: Synsin: end-to-end view synthesis from a single image. In: CVPR (2020)

    Google Scholar 

  65. Xie, Y., et al.: Neural fields in visual computing and beyond (2021)

    Google Scholar 

  66. Xu, H., Alldieck, T., Sminchisescu, C.: H-nerf: neural radiance fields for rendering and temporal reconstruction of humans in motion. In: NeurIPS (2021)

    Google Scholar 

  67. Yoon, J.S., Kim, K., Gallo, O., Park, H.S., Kautz, J.: Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera. In: CVPR (2020)

    Google Scholar 

  68. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

    Google Scholar 

  69. Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification: Learning view synthesis using multiplane images. TOG (2018)

    Google Scholar 

  70. Zitnick, C., Kang, S.B., Uyttendaele, M., Winder, S., Szeliski, R.: High-quality video view interpolation using a layered representation. TOG 23, 600–608 (2004)

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank Albert Para Pozzo, Sam Johnson and Ronald Mallet for the initial discussions related to the project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Phong Nguyen-Ha .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4398 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nguyen-Ha, P., Sarafianos, N., Lassner, C., Heikkilä, J., Tung, T. (2022). Free-Viewpoint RGB-D Human Performance Capture and Rendering. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13676. Springer, Cham. https://doi.org/10.1007/978-3-031-19787-1_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19787-1_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19786-4

  • Online ISBN: 978-3-031-19787-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics