High-Detail Animation of Human Body Shape and Pose From High-Resolution 4D Scans Using Iterative Closest Point and Shape Maps

: In this article, we present a method of analysis for 3D scanning sequences of human bodies in motion that allows us to obtain a computer animation of a virtual character containing both skeleton motion and high-detail deformations of the body surface geometry, resulting from muscle activity, the dynamics of the motion, and tissue inertia. The developed algorithm operates on a sequence of 3D scans with high spatial and temporal resolution. The presented method can be applied to scans in the form of both triangle meshes and 3D point clouds. One of the contributions of this work is the use of the Iterative Closest Point algorithm with motion constraints for pose tracking, which has been problematic so far. We also introduce shape maps as a tool to represent local body segment deformations. An important feature of our method is the possibility to change the topology and resolution of the output mesh and the topology of the animation skeleton in individual sequences, without requiring time-consuming retraining of the model. Compared to the state-of-the-art Skinned Multi-Person Linear (SMPL) method, the proposed algorithm yields almost twofold better accuracy in shape mapping.


Introduction
Measurement and modeling of human body movement is a research field with many visual, medical, and monitoring applications. The data source initially used by scientists was video material. Many methods of pose tracking have been developed on the basis of image sequences, but a lack of information about 3D geometry has been a significant limitation. With the emergence of cheap RGBD sensors, there has been growing interest among scientists in the analysis of unidirectional 3D scans [1,2]. A number of publications have focused on the analysis of unidirectional RGBD data, but the related methods still yield low-resolution output data and suffer from a number of problems, particularly with respect to position estimation from partial views, due to sensor noise and the problem of occlusion [2,3]. The use of 4D scanners (3D scanners capable to capture geometry multiple times per second) is cost-intensive and, so, has remained reserved to a small group of researchers who have access to the necessary equipment. With the emergence of public scanning datasets, the topic of high-resolution reconstruction of motion and deformation of the human body has gained popularity. Reconstructing the movement and shape of a body on the basis of a sequence of 3D scans is a challenging task, due to the deformations of body shape deviating from rigid body dynamics and due to the amount and nature of the input data. The measurement data take the form of a series of point clouds or triangle meshes, where the vertices are not correlated with each other between individual moments in time (i.e., the number of vertices/points can change from frame to frame). In computer graphics applications, a homogeneous topology of geometry in consecutive sequence frames is a basic requirement, in view of rendering performance. Character animation also requires the definition of an animation skeleton for skinning the mesh.
The aim of this work is to transfer changes in both the body shape and the pose of the animation skeleton on the basis of 4D scans with high spatial and temporal resolution.
The main contributions of this work are as follows: • The development of a method to transfer human body surface deformations with geometrical accuracy better than current solutions (one of which is taken as a reference method); • The introduction of shape maps as a tool to transfer local deformations; • The incorporation of the Iterative Closest Point method (in the form of Point-to-Plane [4,5], with rejection and optimization, using the Levenberg-Marquadt solver [6,7]) with motion constraints for pose tracking in 4D scan sequences with high temporal and spatial resolution; • Enabling the selection of a different mesh topology and resolution, as well as a skeleton, for each measurement sequence, without needing to retrain the model; • Higher-resolution output meshes than state-of-the-art methods.
The remainder of the article is organized as follows: in Section 2, we give an overview of research on 4D data analysis and related topics. In Section 3.1, we provide details about the scanning data used in this work; in Section 3.2, we present the general outline of our algorithm; and, in Sections 3.3 and 3.4, we describe the skeleton tracking and shape transfer methods, respectively, in detail. In Section 4, we present the results of an evaluation of the reconstruction quality of our method, compared to the reference method. The article concludes with a summary and description of future work, in Section 5.

Related Work
The appearance of low-cost RGBD cameras on the market (e.g., Kinect) has contributed to a growing interest in the subject of 4D data analysis by researchers worldwide [1][2][3][8][9][10][11][12][13][14]. The data obtained by such sensors are heavily noisy, but their low price and additional information about the depth correlated with the RGB image has created new opportunities. The authors of Reference [3] used three such sensors, together with pressure sensors placed in shoes, for pose estimation and registration of the triangle mesh. Barros et al. [13] presented a method for pose estimation based on scans of a human body, using two opposite RGBD cameras and a pre-defined skeleton model. Once the skeleton base point is initialized with Principal Component Analysis (PCA), the individual scan parts are iteratively segmented and fitted based on Expectation Maximization (EM). In the skeleton model used, the geometric relationships between individual skeleton nodes are strictly defined. Some works also focused on hand movement tracking [9,15,16]. Tsoli and Argyros [15] proposed a joint optimization method through energy minimization to track motion of hand in contact with a cloth surface. They track both deformations of the object and pose of the hand interacting with it, based on data from a Kinect2 sensor. One must also note the research in robotical environment geometry discovery [17][18][19][20]. One of the best-known groups of methods regarding this field is called Structure from Motion (SfM) [21,22], where geometry of the environment is computed from a series 2D images taken from different viewpoints. To this group belongs the algorithm developed by Glannarou and Yang [17] which focuses on the reconstruction of a surgical environment envisioned by using an endoscopic camera. The authors incorporated Unscented Kalman Filter, along with data from Inertial Measurement Unit (IMU), to achieve deformable Structure from Motion. A work by Gotardo et al. [18] solves a rigid SfM problem by estimating the smooth time-trajectory of a camera moving around an object. Introduction of a parametrization in the Discrete Cosine Transform, along with the assumption of smooth camera trajectory, enabled the researchers to perform a non-rigid SfM in the presence of occlusions.
To date, few works have focused on high-resolution data, due to high cost of obtaining such a system. One of the first collections of this type-which is available for a fee-is the CAESAR dataset (Civilian American and European Surface Anthropometry Resource Project) [23], containing about 4400 scans of different people. The pioneering work in the analysis of high-resolution 3D scans includes the SCAPE method (Shape Completion and Animation of People) [24], which is based on the model of body shape deformation as a function of the basic body shape of a given person and their pose at a specific moment. The authors presented the use of their method for the completion of unidirectional scans, as well as for the generation of mesh deformations in a sequence, based on a static scan of a person and a sequence of marker motions obtained by using the Motion Capture system. Another dataset of scans acquired with the use of a 3DMD scanner has been made publicly available, under the name Dynamic FAUST (DFAUST) [25], which has provided important motivation for research development in the field of high-resolution 4D scan analysis techniques [23,[25][26][27][28][29][30][31][32][33]. This dataset consists of thousands of 3D scans of various people in different poses. The authors of the dataset developed it, using the Skinned Multi-Person Linear (SMPL) model [34], which is a trained model of various body shapes and their deformations with pose changes. Next, the authors used this model for pose and shape estimation from 2D images through the incorporation of a Convolutional Neural Network (CNN) [27,28]. The researchers also managed to develop an analogous model for infants (SMIL, Skinned Multi-Infant Linear model) using information from RGBD sequences [35,36]. Dyna [37] is a model which describes soft-tissue deformations, depending on the body shape and motion dynamics. A linear PCA sub-space is used for representation of soft-tissue deformations, whereby (as pointed out in [38]) the model has difficulty in reflecting highly non-linear deformations. DMPL (Dynamic-SMPL) [34] is based on a similar concept to Dyna, increasing the reconstruction quality by defining the model with vertices instead of triangles (as was the case in the Dyna model). SoftSMPL [38] improved on these models by introducing a non-linear deformation model and a new motion descriptor.
Further studies have involved deep learning for pose and shape estimation in 4D scans [38][39][40][41][42][43][44]. One of the directions in this area is the use of a Variational Autoencoder (VAE) to encode the 3D scan shape [40][41][42][43][44][45][46]. Litany et al. [42] used a VAE for body shape completion based on a partial view. In the aforementioned work, a neural network operates directly on the vertex positions in 3D space, instead of manually selected features (as in [44]). Jiang et al. [45,46] proposed a VAE-like network for learning the shape and pose of a human body from a sequence of triangle meshes with the same topology. For this purpose, they used measurement datasets (i.e., SCAPE, FAUST, MANO (hand Model with Articulated and Non-rigid defOrmations), Dyna, DFAUST, and CAESAR), applying to them a novel homogenous mesh registration method based on as-consistent-as-possible (ACAP) representations, proposed originally by Gao et al. [47]. Another work [48] extended the SMPL model with skeleton information, determining point features in the input data using PointNet++ and mapping them to the joints. Due to this operation, the authors improved the process of learning parameters in the SMPL model.
In summary, the existing methods for pose and shape tracking in human 3D scan sequences generally focus on the concept presented by the SMPL method, which uses a model trained on an extensive dataset of sample measurements. The works extending this model, such as SoftSMPL or the work of Jiang et al., have introduced improvements in model training, but still yield low-resolution output meshes. In these methods, it is also impossible to easily change the topology of the output mesh and the skeleton from sequence to sequence, as it is necessary to retrain the model on at least several thousand scans.

Materials
We used data from the Dynamic FAUST dataset [25], which presents sequences of scans of human body surface in motion in the form of a sequence of triangle meshes recorded at a frequency of 60 fps. The meshes are not correlated to each other in consecutive frames (i.e., the number of triangles and vertices changes from frame to frame). To evaluate the reconstruction quality, two sequences of 100 frames were used (Figures 1 and 2).

Method Overview
The presented method uses a sequence of 3D scans of the human body in motion and translates this information into a computer animation of a virtual character. In the measurement dataset used, each measurement frame contains data in the form of triangle mesh; however, in the proposed process, data in the form of 3D point clouds can be used, as well.
The proposed method consists of three stages: • Pose tracking based on the displacement of individual segments of an input scan, using a variant of the Iterative Closest Point method (hereafter referred to as tracking); • Mapping of the scan shape in a sequence, using shape maps (hereafter referred to as mapping); • Morphing (registration) of a uniform template mesh to the generated shape maps (hereafter called morphing).
The input data for the algorithm are as follows: • A sequence of 3D scans of human body in motion, in the form of either meshes or point clouds; • An animation skeleton pose for the first frame; • A template mesh skinned to the animation skeleton for the first frame (we used automatic weights generated by the Heat Bone Weighting method [49] implemented in Blender).
The general outline of the proposed method is as follows (Figure 3): the tracking algorithm, based on two measurement frames-(t) and (t + 1)-and the skeleton for frame (t) calculates the segmentation of scan (t) for the skeleton and the pose for frame (t + 1). The scan segmentation in frame (t), together with the scan, is then used by the mapping and morphing algorithms to transfer the shape to the template mesh. After obtaining the shape maps for all the segments in a given frame, the template mesh is posed according to the skeleton in frame (t) and morphed based on the shape maps, resulting in the final triangle mesh for frame (t). This process is repeated for subsequent pairs of frames (t + 1, t + 2) (note that tracking produced the skeleton for frame (t + 1)), until the end of the measurement sequence is reached. After processing the entire scanning sequence, a series of triangle meshes with the same topology as the template mesh together with a series of skeleton poses is obtained as the output of the algorithm. The shapes of the meshes reflect the scan geometry in individual frames, while the skeleton matches the body pose change in subsequent frames of the sequence. From left to right: sequence of 3D scans (gray) and skeleton for first frame; pose tracking step; resulting skeleton for frame (t + 1) and segmentation (multicolor) for frame (t); shape maps computation step; template mesh (blue); template mesh deformation step; and final reconstructed mesh for frame (t).

Skeleton Tracking
Skeleton tracking (Figure 4) operates on a sequence of the 3D scans and a skeleton for the first frame. Starting from the first frame, the current frame (t) is segmented based on the skeleton pose for that frame. Then, each of the generated segments is fitted to the point cloud of the next frame (t + 1) using the Iterative Closest Point (ICP) method. On the grounds of the obtained transform change of the segment, the transform of the relevant bone is updated by applying motion constraints specific to that bone. After analyzing all the segments (bones) in this way, the pose of the next frame (t + 1) is obtained. This procedure is iteratively repeated for successive frames of the sequence. . Bone update procedure. From left to right: input scan for frame (t) with skeleton (bottom), scan for frame (t + 1) (upper); extracted segment for bone (B) (green) in frame (t); Iterative Closest Point (ICP) step; segment for bone (B) fitted to frame (t + 1), using ICP; constraints step; obtained transformation change of the bone (B) in frame (t + 1); and resulting skeleton for frame (t + 1), along with scan for frame (t + 1).
The purpose of segmentation ( Figure 5) is to assign each point from the input scan to a skeleton bone. In the devised method, a point is assigned to the nearest bone which meets the cutoff plane, normal vector compatibility, and coplanarity conditions. Filtration with the cutoff plane consists of defining, for each bone (B n ) in the skeleton, a cutoff plane (Z n ) hooked at the end of a preceding bone, B n − 1 , whose normal vector (N Zn ) is directed according to the sum of the vectors of the given bone (B n ) and the preceding bone (B n − 1 ) (see Figure 6). A point (P) cannot be assigned to the bone (B) if it is behind this plane.  The normal vector parallelism condition allows us to exclude incorrect assignment of points from large segments (e.g., torso) to the bones of smaller segments (e.g., hands). According to this filtering, a point (P) may be assigned to bone (B) if the scalar product of its normal vector (N) and the radius vector (R) from the bone to the point is less than a defined threshold (Figure 7, Equation (1)): where R is the radius vector from the bone to point (P), N is the normal vector in point (P), and α is the threshold. The next condition, coplanarity, is defined as follows (Equation (2)): where R is the radius vector from the bone to point (P), N is the normal vector in point (P), B is the binormal vector in point (P), and β is the threshold. Among the bones in the skeleton that passed the three above tests against the point (P), the nearest one is selected and assigned to P. After completing the operation for all points, the values for unassigned points are populated using a median filter. Finally, a median filter with a fixed minimum number of neighbors equal to half the number of vertices in the neighborhood is applied. With this last filtration, small groups of incorrectly assigned vertices are fixed.
On the grounds of the above segmentation, the scan in frame (t) is divided into fragments corresponding to individual bones. In order to track the bone transform in the next frame, the segment for the bone (B) is fitted to the scan in frame (t + 1) using the Iterative Closest Point method. Among the many variants of this algorithm, a Point-to-Plane [4,5] version with rejection and optimization using the Levenberg-Marquadt solver [6,7] was chosen. For the estimation of the correspondence between the set of segment points and the next frame scan, a random subset of the segment's point count is used. After establishing pairs of corresponding points in both clouds, pairs with distance further than a defined threshold are rejected. The choice of the Point-to-Plane objective function, as pointed out by Rusinkiewicz [5], ensures faster convergence than the standard Point-to-Point algorithm.
After fitting the segment to the scan in the next frame, the bone position is updated based on the input and output transforms of the segment from ICP, taking into account the motion constraints of the given bone. We consider three types of constraints: no constraint, translation constraints, and kinematic chain constraints. The first type is applicable in a case of the hip segment, from which the entire skeleton tree begins. The skeleton root can move freely, reflecting the change in position and orientation of the scanned subject. The second type, translation constraints, apply to all bones located lower in the skeleton tree, as the continuity of the bone chain must be maintained. The last type of constraint finds application in the case of leg and arm helper bones which do not take part in the segmentation process.
In this case, position changes of the bone start point are allowed, but must maintain the distance to the preceding spine bone. In this way, the bone moves on the surface of a sphere with radius equal to the helper bone length and the helper bone transform is adjusted to close the skeleton chain.

Shape Mapping and Morphing
The transfer of deformations consists in the usage of shape maps, the values of which are computed based on the scan, where a template mesh is deformed according to these maps. The shape map construction schema is presented in Figure 8. The shape map of a given segment is defined by a parametric mapping in the local co-ordinate system of the segment (e.g., a spherical mapping or capsule mapping [50][51][52]). The parametric mapping converts the 3D co-ordinates of the segment measurement points into 2D co-ordinates on the shape map and the map value at these points. The value of the shape map at a given point corresponds to the third co-ordinate of the parametric mapping (i.e., the distance to the center of the mapping; Equation (3)): where u,v are the shape map co-ordinates, r is the value of the shape map, f is the parametric mapping, x,y,z are the 3D coordinates of the segment point, and C is the mapping center. Direct meaning of u,v co-ordinates depends on the parametric mapping of choice. In the example of spherical mapping as a parametric mapping, the (u,v) co-ordinates denote polar angle and azimuthal angle, respectively (Figure 9). In this case function f unrolls to following (Equation (4)): After mapping all segment points from frame (t), map values in the entire domain are computed. The shape map is divided into a grid with a predefined resolution and, for each cell, the map value is established by averaging values from points falling into this cell. Next, in order to fill missing values in cells without measurement points, a mipmapping technique is used, which assigns the value of a mipmap with lower resolution for such values. This helps to avoid artifacts and holes in the morphed mesh, simultaneously maintaining the high resolution of the basis shape map and, thus, keeping the reconstruction error low. Finally, a Gaussian filter is applied to the shape map, in order to achieve smoother transitions between grid cells. Figure 10 presents a probed shape map for the arm segment in the time frame t 0 of the "jumping jacks" sequence.  Left: mesh generated with parameter space sampling; right: shape map directly shown with color-coded values. One can see a blue rectangular area in the right bottom part of the shape map that corresponds to the upper left part of the mesh. No points were present in this area of the arm segment, and, so, the shape map values are zero. The shape map used in this example has a resolution of 100 × 120. The segment was mapped by using capsule mapping.
In order to achieve a mesh with homogenous topology in the entire sequence as output, we used a template character mesh and morphed it according to the shape maps. The template mesh had a skin correlated with the skeleton from the tracking algorithm. The skin is defined using default values for the bones, in compliance with the Heat Bone Weighting method [49] (implemented in an open-source 3D-creation software called Blender). First, the segmentation of the mesh to the skeleton in pose t 0 is performed, in order to know which shape map should be used to morph each vertex from the template. Next, the mesh is posed based on the pose in frame (t), using the skinning method. The initially fitted mesh prepared this way is morphed according to the value of the shape map of the given segment ( Figure 11). The vertices of the template mesh are mapped into the parametric space, and then a new value of the distance in this point is read, and a reverse mapping is applied (i.e., from the parametric to the 3D space). In order to reduce visible faults on the output mesh in the transition areas between individual grid cells of the shape map, the map value at point (u,v) is established through bilinear interpolation of values of the cells adjacent to (u,v). It should be noted that the resolution of the template mesh is independent of the resolution of the shape maps, such that it is possible to obtain preview results on a lower resolution mesh, with significantly smaller calculation time. Figure 11. From left to right: template mesh segment for torso, template after registration, and probed shape map. The map was sampled with resolution 650 × 600 in parameter space.
The process above is repeated for each frame, resulting in a series of morphed meshes with topology and vertex count congruent to the template mesh. Finally, these meshes can be applied as shape keys to the template mesh. In Figure 12, we present reconstructions of frames selected from the test sequence.

Results and Discussion
An evaluation of the fitting error was made, in order to determine the reconstruction quality between the output mesh and the input scan. The fitting error was defined as distance from a given vertex of the input scan to the nearest vertex in the reconstructed mesh. For each frame of the analyzed sequence, the reconstruction error for all vertices in the scan was computed (Table 1, full listing included in Appendix A) and then, based on that data, the average and median error for each frame was also calculated. Furthermore, in order to compare the results of our method with a state-of-the-art reference method, we carried out the analogous procedure for the reconstructed meshes shared by the authors of the Skinned Multi-Person Linear (SMPL) model method. In Figures 13 and 14, we present the average and median reconstruction errors for both methods in subsequent frames of testing sequence. Figure 15 depicts reconstruction error histograms for both methods in chosen frames of both sequences. Additionally, illustrations allowing for visual assessment of the reconstruction quality in chosen frames were prepared (compare Figures 12 and 16a). Figure 16b shows influence of the template mesh resolution on result reconstruction. On the grounds of the obtained error statistics, it can be seen that the proposed method features a better reconstruction quality than the reference method. The reconstruction errors of the proposed method were fundamentally lower than those in the case of SMPL, whereas the error distribution was shifted towards smaller values, indicating better reconstruction in a wider part of the mesh. The course of the average error in the "jumping jacks" sequence increased for frames 7065-7090 ( Figure 13), where the motion was significantly faster and the arms were raised almost vertically (compare to Figure 1), increasing reconstruction difficulty in terms of arm, shoulder, and head occlusions. The median error did not change much, confirming the local character of increase of the reconstruction error in the mentioned frames. Figure 17 shows that our algorithm performs well with different or non-ideal skinning weights.
The weaknesses of the proposed method include the uneven distribution of the resulting vertices in the end areas of the parametric mapping for given segments (Figure 18a). The reason for this lies in the mapping function, where the co-ordinates thicken around the poles, thus bringing more vertices into a similar area on the shape map. Moreover, in some frames, the output mesh was distorted due to data loss in the shape map (Figure 18b). Reconstruction based on such an erroneous map introduced artifacts in the resulting mesh. This problem was minimized by including a small fragment of adjacent segments into the shape map calculation; however, this solution is limited by the ability of the parametric mapping to map complex geometry into the 2D parametric space, without overlapping. Regarding pose tracking, one must admit, that the usage of the ICP method involves a limitation in robustness to input measurement frequency. In the presence of input data captured at a much smaller frequency than 60 fps, along with fast movement of the scanned person, the pose tracking algorithm may return inadequate pose. However, in this paper, we focus on high-resolution data and reconstruction accuracy rather than robustness to low-frequency data.

Conclusions and Future Work
In summary, in this paper, we have presented a method of analysis for human body 3D scan sequences that allows for the generation of skeletal animations, along with body shape deformation animations. The algorithm consists of three stages: pose tracking, shape mapping, and template mesh morphing. We performed quality tests on the obtained morphed meshes for two hundred frames from two sequences of the Dynamic FAUST dataset. We compared our method to the SMPL method by performing a similar test. The resulting reproduction quality of our method was demonstrated by the almost twice-smaller reconstruction error, compared to that of the state-of-the-art SMPL method. Moreover, an important advantage of our method is the possibility to use a template with any mesh topology and resolution, thanks to which one can generate an efficient preview animation with less detail. Furthermore, the skeleton used can also be customized.
The directions of future improvement are the transition areas between subsequent segments, where artifacts caused by the loss of data in shape map calculation (only points from a given segment and a small surrounding area participate in the shape map calculation) and vertex squeezing after parametric mapping of the shape map may appear. Furthermore, with small resolutions of the shape map, modest faults appear on the surface of the reconstructed mesh that are caused by value averaging in shape map cells.
In the future, we would like to achieve even better reconstruction quality by replacing averaging and mipmapping in the shape maps with polynomial approximation or another mathematical model. This may result in a smoother surface of the reconstructed mesh and better handling of areas without measurement points. The second field for development in the future is the template mesh morphing algorithm. Our goal will be to achieve a smoother output mesh though the application of a more advanced method for fitting the template mesh to the shape map. Last but not least, we would like to pursue more tests on robustness of the pose tracking algorithm with lower frequency capture.  Declarations: Source code will not be available; data will not be available.