Abstract
The pixel-aligned implicit functions (IFs) enable the reconstruction of 3D human with complete and detailed clothing from a single RGB image. To enhance robustness for poses, existing work introduce the parametric body model as prior, but this limits the recovery of the geometry details and makes it challenging to handle loose clothing. Our goal is to reconstruct both clothing and pose that highly align with the input image, even in cases of peculiar poses and complex clothing. To achieve this, we propose a multi-scale features-based implicit method, called RICH, which combines the flexibility of implicit function and the powerful prior of parametric body model. RICH introduces a 3D human body model as prior knowledge and adopts local feature to constrain human body generation. Furthermore, RICH employs a pretrained image encoder to extract global pixel-aligned feature, which contributes to high-precision and complete reconstruction of clothing geometry and of the external appearance such as hair and accessories. Besides, by establishing connections with the joints of the body model, RICH utilizes an attention mechanism to construct relative spatial feature, thereby increasing the robustness for poses. Finally, RICH takes as input local, relative, and global feature to IF to query occupancy and the clothed human is represented by the 0.5 iso-surface of the 3D occupancy field. Quantitative and qualitative evaluation on the THuman2.0 and CAPE datasets shows that RICH outperforms the state-of-the-art methods. In particular, RICH demonstrates strong generalization ability on in-the-wild images, even under the scenarios of challenging poses and complex clothing. The code and supplementary material will be available at https://github.com/lyk412/RICH.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alldieck, T., Magnor, M., Bhatnagar, B.L., Theobalt, C., Pons-Moll, G.: Learning to reconstruct people in clothing from a single RGB camera. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1175–1186 (2019)
Alldieck, T., Pons-Moll, G., Theobalt, C., Magnor, M.: Tex2Shape: detailed full human body geometry from a single image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2293–2303 (2019)
Alldieck, T., Zanfir, M., Sminchisescu, C.: Photorealistic monocular 3D reconstruction of humans wearing clothing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1506–1515 (2022)
Bhatnagar, B.L., Tiwari, G., Theobalt, C., Pons-Moll, G.: Multi-Garment Net: learning to dress 3D people from images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5420–5430 (2019)
Cui, Y., Chang, W., Nöll, T., Stricker, D.: KinectAvatar: fully automatic body capture using a single kinect. In: Park, J.I., Kim, J. (eds.) Computer Vision-ACCV 2012 Workshops: ACCV 2012 International Workshops, Daejeon, Korea, 5–6 November 2012, Revised Selected Papers, Part II 11. LNCS, vol. 7729, pp. 133–147. Springer, Cham (2013). https://doi.org/10.1007/978-3-642-37484-5_12
He, T., Collomosse, J., Jin, H., Soatto, S.: Geo-PIFu: geometry and pixel aligned implicit functions for single-view human reconstruction. Adv. Neural. Inf. Process. Syst. 33, 9276–9287 (2020)
Hu, Q., et al.: RandLA-Net: efficient semantic segmentation of large-scale point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11108–11117 (2020)
Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3334–3342 (2015)
Lin, S., Zhang, H., Zheng, Z., Shao, R., Liu, Y.: Learning implicit templates for point-based clothed human modeling. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part III, pp. 210–228. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20062-5_13
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. (TOG) 34(6), 1–16 (2015)
Lorensen, W.E., Cline, H.E.: Marching cubes: a high resolution 3D surface construction algorithm. In: Seminal Graphics: Pioneering Efforts that Shaped the Field, pp. 347–353 (1998)
Ma, Q., Saito, S., Yang, J., Tang, S., Black, M.J.: SCALE: modeling clothed humans with a surface codec of articulated local elements. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16082–16093 (2021)
Ma, Q., et al.: Learning to dress 3D people in generative clothing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6469–6478 (2020)
Ma, Q., Yang, J., Tang, S., Black, M.J.: The power of points for modeling humans in clothing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10974–10984 (2021)
Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: learning 3D reconstruction in function space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4460–4470 (2019)
Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: DeepSDF: learning continuous signed distance functions for shape representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 165–174 (2019)
Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10975–10985 (2019)
Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2304–2314 (2019)
Saito, S., Simon, T., Saragih, J., Joo, H.: PIFuHD: multi-level pixel-aligned implicit function for high-resolution 3D human digitization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 84–93 (2020)
Su, Z., Yu, T., Wang, Y., Liu, Y.: DeepCloth: neural garment representation for shape and style editing. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 1581–1593 (2022)
Xiu, Y., Yang, J., Tzionas, D., Black, M.J.: ICON: implicit clothed humans obtained from normals. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13286–13296. IEEE (2022)
Yu, T., Zheng, Z., Guo, K., Liu, P., Dai, Q., Liu, Y.: Function4D: real-time human volumetric capture from very sparse consumer RGBD sensors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5746–5756 (2021)
Zhang, H., et al.: CloSET: modeling clothed humans on continuous surface with explicit template decomposition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 501–511 (2023)
Zhang, H., et al.: PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11446–11456 (2021)
Zhang, Y., Qu, Y., Xie, Y., Li, Z., Zheng, S., Li, C.: Perturbed self-distillation: weakly supervised large-scale point cloud semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15520–15528 (2021)
Zhang, Y., Xie, Y., Li, C., Wu, Z., Qu, Y.: Learning all-in collaborative multiview binary representation for clustering. IEEE Trans. Neural Networks Learn. Syst. 1–14 (2022). https://doi.org/10.1109/TNNLS.2022.3202102
Zheng, Z., Yu, T., Liu, Y., Dai, Q.: PaMIR: parametric model-conditioned implicit representation for image-based human reconstruction. IEEE Trans. Pattern Anal. Mach. Intell. 44(6), 3170–3184 (2021)
Zhu, H., Zuo, X., Wang, S., Cao, X., Yang, R.: Detailed human shape estimation from a single image by hierarchical mesh deformation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4491–4500 (2019)
Acknowledgments
This work was supported in part by the Shenzhen Key Laboratory of next generation interactive media innovative technology (No. ZDSYS20210623092001004), in part by the China Postdoctoral Science Foundation (No. 2023M731957), in part by the National Natural Science Foundation of China under Grant 62306165.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Lin, Y., Li, R., Lyu, K., Zhang, Y., Li, X. (2024). RICH: Robust Implicit Clothed Humans Reconstruction from Multi-scale Spatial Cues. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14426. Springer, Singapore. https://doi.org/10.1007/978-981-99-8432-9_16
Download citation
DOI: https://doi.org/10.1007/978-981-99-8432-9_16
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8431-2
Online ISBN: 978-981-99-8432-9
eBook Packages: Computer ScienceComputer Science (R0)