Fusion of Multiple Lidars and Inertial Sensors for the Real-Time Pose Tracking of Human Motion

Today, enhancement in sensing technology enables the use of multiple sensors to track human motion/activity precisely. Tracking human motion has various applications, such as fitness training, healthcare, rehabilitation, human-computer interaction, virtual reality, and activity recognition. Therefore, the fusion of multiple sensors creates new opportunities to develop and improve an existing system. This paper proposes a pose-tracking system by fusing multiple three-dimensional (3D) light detection and ranging (lidar) and inertial measurement unit (IMU) sensors. The initial step estimates the human skeletal parameters proportional to the target user’s height by extracting the point cloud from lidars. Next, IMUs are used to capture the orientation of each skeleton segment and estimate the respective joint positions. In the final stage, the displacement drift in the position is corrected by fusing the data from both sensors in real time. The installation setup is relatively effortless, flexible for sensor locations, and delivers results comparable to the state-of-the-art pose-tracking system. We evaluated the proposed system regarding its accuracy in the user’s height estimation, full-body joint position estimation, and reconstruction of the 3D avatar. We used a publicly available dataset for the experimental evaluation wherever possible. The results reveal that the accuracy of height and the position estimation is well within an acceptable range of ±3–5 cm. The reconstruction of the motion based on the publicly available dataset and our data is precise and realistic.


Introduction
Understanding human motion is key for intelligent systems to coexist and interact with humans. Motion tracking is a technique to track and localize the three-dimensional (3D) orientation of a human body joint [1]. Human motion tracking is widely used for activity analysis in many areas and is a current research topic due to the advancement in micro-electro-mechanical system (MEMS) sensors with wireless communication technologies [2]. Human motion tracking and recognition is a challenging problem as the human body is very flexible and has 244 kinematic degrees of freedom [3].
In recent years, scientific research has significantly emphasized pose tracking, motion capture, activity recognition, and the reconstruction of human motion. Recreating full-body human motion accurately on a 3D stick/avatar model is a challenging task. Several techniques have been proposed to capture data that can be reconstructed and recognized accurately. Pose tracking is classified into two main categories: Marker-based and Marker-less systems. Marker-based pose tracking is a traditional method where the angle between the markers placed near the joints provide the orientation and positional details of the person. The marker-based system is bounded within a geographical range of the tracking device's field of view (FoV). Therefore, this method is only applicable in an indoor environment. Additionally, along with the prolonged setup time of the markers on the body (palpation error [4]), the markers may move due to skin stretching and suit displacement [5] contributing to errors in the reading.
The marker-less systems are emerging as more feasible and are becoming increasingly pervasive in applications that span health, arts, digital entertainment, and sports [6]. The marker-less motion capture systems (MMSs), such as depth-sensing sensors [7], are widely used for human motion tracking and reconstruction. These kinds of MMSs have disadvantages, such as limited FoV, depth, and so on, that are similar to those of marker-based systems. The depth-sensing sensors are limited to the size of the tracking volume. Due to this limitation, single-sensor approaches were mostly constrained to tracking body posture, physical therapy, rehabilitation [8], physical fitness in elderly individuals [9], ergonomics [10], anthropometry [11], and so on. Some researchers [12][13][14][15] have proposed setups with multiple depth-sensing sensors to cover a more considerable distance. For example, Müller et al. [16] used six depth-sensing camera sensors to achieve 9 m distance of tracking. However, a need exists for precise position tracking that is easy to set up, covers an extended range of distance, and is flexible to various environmental conditions.
The implementation of the proposed system considers human body joint orientation based on the inertial measurement unit (IMU) and light detection and ranging (lidar) generated for 3D position tracking. The fusion of the sensor data is reconstructed on a virtual 3D avatar as depicted in Figure 1. The current proposed work is validated in an indoor environment. The paper is structured as follows: Section 2 discusses the related work on various advancements in MMSs. Section 3 details the proposed system for position and orientation estimation. Next, the proposed system implementation and experimental results are presented in Sections 4 and 5. Finally, the work is concluded with a discussion of future work.

Related Work
Several methods exist to capture and recognize human motion depending on the data capture equipment (depth cameras, IMUs, and lidars). Depth cameras are widespread primarily due to ease of use and availability of open-source tools and community [17] (e.g., the Microsoft Kinect depth camera). Depth cameras convert depth data into RGBZ data. This helps detect human joints [18] and extract rotational information from the skeletal structure. However, the methods suffer from occlusion [19]. Multiple depth sensors strategically positioned in the environment [20] can reduce the body occlusion issue but do not fully compensate for it. In [21], the accuracy of the Kinect was evaluated in terms of detecting the human body center of mass using the length of body links in the Kinect skeleton model.
Kinect is inaccurate in recognizing the center of joints while measuring short links of the body, such as the foot [22], and [23] assessed the accuracy of the Kinect in lower-extremity gait analysis. In these studies, the accuracy of the Kinect was assessed using a commercial MMS as the gold standard. They reported considerable errors in tracking ankle joint angles using both versions of Kinect, which indicates some inherent challenges in this sensor. Some recent research study [24][25][26] used machine learning-based pose estimation methods to track human pose. These method uses two-dimensional RGB cameras to recognize human motion.
The IMU sensors offer the accurate orientation of a rigid body in the form of quaternions, Euler angles, and axis angles. Quaternions are a better and gimbal-lock-free representation, unlike Euler angles [27]. Therefore, most MMSs use IMUs, capturing the data in the form of quaternions. A human body comprises various interconnected bones and joints, and it is imperative to understand and set up a hierarchical and kinematic model of a human body before attaching IMUs on a person. Thus, most of the motion databases include hierarchical information along with rotational data [28], solving the body occlusion problem.
However, IMU-based pose tracking is not mature enough to detect accurate positional data for individual joints [29] and is majorly used for motion analysis in rehabilitation and physiotherapy [30,31]. To counter this, a merger of IMU data with depth cameras has been attempted [32][33][34]. In [32], the fusion of sensors is adapted to validate the acquired movement data in two steps (generative and discriminative). In the generative process, the sensor provides human pose data, whereas the discriminative process validates the data. In other research [33], the purpose of sensor fusion is to complement each other for accurate results. In the lidar-IMU fusion experiment, the IMU sensor provides orientation information, whereas the lidar is used to filter the data. A similar approach was proposed by [34], where the IMU sensor detects human rotation and a laser sensor detects the human body position to correct the drift over time. The approach presents only the trajectory of the human motion in outdoor environment but the full skeleton pose is not described.
The current work focuses on full-body tracking with an easy multi-sensor set up (lidars and IMUs), that enables an estimation of joints' position, bone segment orientation, and reconstruction on a 3D avatar in real-time.

Method Overview
In this section, we discuss more details on the sensor fusion system based on lidar and IMU for position and orientation estimation and reconstruct the motion on the 3D avatar. In the proposed approach, the process of human body tracking includes the following as depicted in Figure 2: (1) Initial data retrieval (reference and full-body clouds), (2) base pose detection, (3) skeleton construction, (4) real-time pose tracking, and (5) reconstruction using the avatar.  In the proposed method, two 3D lidar sensors were used to track the position of the human body, and IMU sensors were used to estimate the orientation and position of each joint during the human body activity in real time. Figure 3 illustrates the complete setup of multiple lidars and IMUs (the laser ray depicted in the figure is only in y direction of lidar, i.e., the vertical FoV). The 3D lidar-based human body tracking process includes the segmentation of raw data and the classification of objects of interest. The lidar sensor used in this system has a distance range and accuracy of up to 100 m and ±3 cm. For human body tracking, the maximum range is 14 to 17 m [35], which is well within the 100 m range. Therefore, in the current work, two lidars (L 1 and L 2 ) were used within an operating range of an 8 × 4 × 3 m indoor environment ( Figure 3).

Initial Data and Pose Extraction
To track the user in real time, two separate sets of point cloud data (P (i) {x, y, z}, where i = 0 to n point data) were initially acquired in the calibration step. One set, with the user (P f (i) {x, y, z}; full-body cloud) in the FoV, primarily computes the actual height of the user and constructs the skeleton structure. The second set of point cloud data, without the user (Pr (i) {x, y, z}; reference cloud) in the FoV, filters the user from the background data. The reference cloud has information about the environment in which human motion is detected. To compute the actual height and construct a skeleton structure, the user must stand at an optimal distance away so that both lidars covers the full body ( Figure 3) within their collective FoVs.
Thus, the acquired full-body cloud was compared against the reference cloud to extract the position of the user point cloud in a real environment (x, y, and z-axes). An Octree-based change detection algorithm [36] was adopted to filter out the user point cloud (P t(i) {x, y, z}) from the full-body cloud, as depicted in the second step of Figure 2. The main aim in this section is to extract the point cloud corresponding to the user and the accuracy of the extraction directly affects the following process and correctness of the result.
The ground point g{x, y, z} is the actual floor location from L 1 , considering the maximum point in the y-axis, as illustrated in Figure 3. A slight inclination occurs in L 1 due to its mounting. Therefore, the resulting point cloud has an inherent slope (m). Considering the actual floor is at g, the slope m is given by Equation (1).
where g x and g y are the x and y component of g, and max x and max y are the maximum x and y component of P t .
The user may be located at any point on this slanted floor. Therefore, the slope of the floor due to the inclination of L 1 should be factored into computing the actual height (A h ) of the user, as indicated in Equation (2): where min y is the minimum y component of P t , and c x is the x component of the centroid of P t .

Identifying the Human Skeletal Structure
The maximum (max y ) and minimum (min y ) in the P t provide the actual height of the user with an accuracy of ±3-5 cm. The actual height (see Equation (2)) is the baseline for calculating each body part proportion to construct the skeleton of the user. An average person is generally 7.5 times the height of his or her head [37]. To construct each bone segment in the skeleton, we considered the head height (H h ) to be the standard measurement proportion (i.e., H h = A h /7.5), which is used to parameterize the lengths of each segment [38]. Figure 4 depicts the constructed skeleton from the point cloud with 15 segments (b 1 to b 15 ) and 16 connecting joints. At this step, we know the relative joint positions of the human skeleton, which aids in the estimation of real-time pose tracking.
Segments and Joints → Figure 4. Generated human skeleton using the three-dimensional lidar point cloud.

Real-Time Pose Tracking
We captured the initial position and generated the human skeleton in the previous subsections. Along the same lines, the movement of the user in real time was acquired (position) as the person moves from the initial position. In the current work, the real-time motion of the full-body position and orientation was estimated using 10 IMU sensors attached to the human body (bone segments, Figure 3). Concurrently, the pose from the lidar data was estimated and fused with the IMU sensor data because the pose estimate of the IMU sensor is affected by the displacement drift. In the following section, we discuss more details regarding the position and orientation estimation.

Position and Orientation from Inertial Sensors
The IMU sensors were used to estimate body segment position and orientation changes in real time (segments are connected by joints), and the changes were updated on a biomechanical model of avatar segments. The IMU sensors used in our work output orientation data in the form of quaternions (q = (q w , q x , q y , q z )). Full-body motion was captured over time for 10 joint-bone segments. Moreover, the orientation of 5 segments (red color in Figure 4) rely on the torso (i.e., b 3 , b 4 , and b 7 ) and pelvis (i.e., b 10 and b 13 ) bone joint sensors. All segments are hierarchically connected in the avatar, as presented in Figure 4. The position and orientation of the joint estimation process are illustrated in Figure 5. The IMU sensors provide the orientation with respect to a global coordinate frame (x-axis pointing north, z-axis against gravity, and y-axis pointing west). For each bone segment, all kinematic parameters were expressed in a common coordinate global frame, which is the right-handed Cartesian coordinate system ( Figure 6). The sensors were calibrated and aligned to the global frame to compute the rotation of the individual joint-bone segment, as given in Equation (3): where q i is the continuous frames of quaternion data from the IMU sensors, q −1 0 is the inverse of first q i , and Aq i denotes the aligned quaternion data. After the alignment of the sensors to the global frame, the joint position and segment rotation were computed. We considered each joint position to be a unit vector in the direction parallel to the respective bone axis in the attention pose ( Figure 6). For instance, if we consider the foot joint axis parallel to the z-axis, then a unit vector for the foot joint can be determined asv = (0, 0, 1), and it is represented as q v = (0, 0, 0, 1) in quaternion form: where Rv is the rotated joint vector after quaternion multiplication in quaternion form. Next, we extracted the rotated vector from R v = (q w , q x , q y , q z ) (i.e., joint vectorĴ v = (q x , q y , q z )) and updated the respective joint position in the skeleton by considering the neighbor joint, scaling it to the respective segment length as given in Equation (5): where P cJoint is the updated position of the current joint, p nJoint is the position of the neighboring joint to P cJoint , and S length is the length of the respective bone segment. Figure 6 illustrates an overview of the positional relation of the bone joints with the adjacent joint, denoted by a directional vector (describing the unit vector at that joint with its direction) with the length of the individual segments. In Figure 6a

Position Tracking from Lidars
With the efficient extraction of the base position in the initial stage (Section 3.1), locating the real time position using lidar data has two simple steps. The first step is extracting the full-body cloud (P tr ) of a user in real time (similar to the procedure followed in Section 3.1). The second step is detecting all bone segments by their geometry using the particle filter [39] and tracking only the legs to locate the real-time foot positions. The detected foot positions are aided to correct the displacement drift in the positions calculated using the bone orientation.
In the P t data (Section 3.1), the point clouds corresponding to the lower leg (P leg t ) are clustered with the aid of the joint positions and bone segment lengths. Our approach employs a similar technique to detect the leg position in the point clouds, as proposed by [34]. In the current work, we used a particle filter [39] to track the lower leg-bone point cloud. The particle filter tracks the observation cloud (P leg t ) within the measured point cloud (P tr ). Thus, the computed foot positions from the output of the particle filter are used to correct the displacement drift within a threshold distance (δ = 10 cm).

Implementation Details
Our proposed system consists of the following sensory setup and calibration steps.
(1) Velodyne VLP-16 lidar: A Velodyne lidar is used to estimate the initial position and height of the subject and to track the real-time position. It offers 16-channel lidar scanning with a 360 • horizontal and ±15 • vertical FoV, as illustrated in Figure 3. The sensor has low power consumption, scans the environment in three dimensions at a frequency of 10 to 20 Hz, and generates 600,000 points per second with a maximum range of 100 m with a claimed accuracy of ±3 cm. Due to the frequency difference between lidars and IMUs, we adopted a linear interpolation of the positional data of the lidar to match the IMU body orientation data.
To obtain a dense point cloud, two lidars are perpendicularly positioned, as presented in Figure 3. The lidar at the top (L 1 ) is used to track a person from the top view (which primarily aids in tracking the position when a person poses parallel to the ground (sleeping condition)), and is also used to estimate the height of a person (with an error within ±3 to 5 cm) and the ground position (floor). Another lidar is used to create dense point data that are located on the front side of the user. To integrate multiple lidar data into a single frame, we followed the procedure in the work by [40]. The normal distributions transform (NDT) algorithm [39] is used for point cloud registration.
(2) Xsens IMU: The MTw motion tracking system is a miniature IMU [41] ( Figure 5). It is a small, lightweight, wireless inertial sensor-based 3D motion tracker manufactured using MEMS technology. This sensor returns the 3D orientation, acceleration, angular velocity, static pressure, and earth-magnetic field intensity at a frequency of 60 Hz. Only the 3D orientation is considered for the proposed work. The real-time motion of the full-body position and orientation is estimated using 10 IMU sensors attached to the human body segments, except b 3 , b 4 , b 7 , b 10 , and b 13 , as shown in the skeleton structure with the maroon color. Before tracking and capturing the data, the sensors must be calibrated to avoid the incorrect estimation of the base position and to reduce sensor drift. These issues lead to the misalignment of the bone segments, which results in the mismatching of the avatar to the user in real time. The calibration routine has one step with an attention pose.

Experiments
We conducted an experimental evaluation of the proposed fusion system by considering various poses (involving changes in full-body joint position and segment orientations) by conducting a statistical analysis of the acquired real-time data. Multiple key poses were considered, which affect multiple joint segments, both in position and orientation. The first objective is to investigate the accuracy of the proposed fusion system concerning the position estimation against the ground truth. The second objective is to compare the proposed system against a publicly available 3D pose estimation dataset, the TotalCapture dataset [42].

Height Accuracy
The accuracy of the joint position highly depends on the length of the bone segments, as a result of the user's height computed from the lidar data, as described in Section 3.1. The heights (estimated) of seven different users are compared against their known actual heights (ground truth). Considering the inherent error in the lidar and the error due to mounting, an error of ±3 cm in the calculated height is shown in Table 1. As H h is the standard measurement proportion for skeleton construction, the error in the length of individual segments trickles down to less than 1 cm. Therefore, this difference in the height is insignificant for the construction of the skeleton and has a minimal effect on the position estimation of joints. Table 1. Accuracy of user height estimated from the lidar data against the ground truth (in cm).

Orientation Accuracy
To provide the validation for orientation accuracy of motion reconstruction on the avatar, we compare against the ground truth angle data. To formulate ground truth angle data, we selected a physically measurable angle between two bone segments using measurement apparatus (Goniometer) [43] as highlighted and depicted in Figure 8. Few common poses are chosen for different bone segments and manually noted as ground truth angle data. Simultaneously IMU sensors are attached and orientation data (quaternion) is recorded from respective bone segments. The angle between two bone segments is estimated (i.e., inverse cosine of the dot product of two quaternions) and compared against the ground truth values as shown in Figure 8. The estimated mean error of the measured angle is within the ±5 • for the proposed system.

Full-Body Position Accuracy
To validate position accuracy, 14 different poses that affect all 16 joint positions were considered, as indicated in Figure 9. The point cloud data captured from the lidar have positions corresponding to different joints. The data are manually annotated for 14 different poses using the CloudCompare tool (3D point picking list feature) [44], as depicted in Figure 10. The labeled data are used as the ground truth for measuring the accuracy of the estimated joint position. Figure 9 presents a visual comparison of the reconstruction of poses against the ground truth, which is a reasonably realistic reconstruction. The Figure 11 demonstrates the 14 different poses captured at 60 fps, with a total of 4480 frames. The standard error in the position for individual joints and the error in the position concerning the ground truth were both well within 5 cm, as depicted in Figure 11a. Figure 11b displays the linear average positional error for all joints over time.

Position Estimation Using the TotalCapture Dataset
The TotalCapture dataset [42] contains orientation information acquired from multiple Xsens IMU sensors attached to bone segments. Joint position data were acquired from a multiple viewpoint video. Various motions, such as walking, acting, and freestyle, and range of motion (ROM) were available as part of the dataset. For the current study, we considered multiple movements within the orientation data that affect all joints. The positions of joints were estimated using the proposed method and were compared against the position data in the TotalCapture dataset. Table 2 lists six different motion types, the respective observed joints, the standard deviation, and the mean difference from the ground truth. The results reveal that the estimated positions are at an average standard deviation of 0.24 cm and an average mean difference of 0.86.

Accuracy of Reconstruction on the Avatar
The system estimates bone segment orientation in 3D and full-body joint positions using IMU and lidar sensor data fusion. This enables users to track their pose while performing motion in real time. In this section, we validate the accuracy of our 3D model for motion reconstruction. The 3D model avatar was developed using a visualization toolkit (in C++) [45]. Section 3.3.1 details how the model is updated. The TotalCapture dataset has various ROM, which were applied directly to the 3D avatar to validate the reconstruction accuracy. Figure 13a presents a few selected reconstructed poses from the TotalCapture dataset against their ground truth images. Figure 13b shows multiple poses reconstructed on the same model using our data against the ground truth images. The results reveal that the reconstruction is reasonably accurate.

Discussion
In the previous section, the results demonstrate that the pose tracking of human motion with the estimation of the orientation and position is reasonably accurate and within the range of ±3-5 cm. The position estimation of the pelvis using the lower body orientation and the estimation of the full-body joint position is an effective approach. The reconstruction of the motion on the 3D avatar is realistic and delivers results comparable to state-of-the-art pose-tracking systems, such as TotalCapture [42]. The position of the foot is continuously corrected for displacement drift due to the position estimation from the lower body orientation. Furthermore, the approach uses fewer sensors with a relatively easier installation setup and has minimal environmental dependencies. We use a simple calibration where the user starts at an attention position. The proposed system can be adopted for real-time pose-tracking applications, such as in rehabilitation, athletic performance analysis, surveillance, human-machine interfaces, human activity tracking and recognition, kinesiology, physical fitness training and therapy, human-computer interaction, virtual reality, and so on.
Nevertheless, during the bottom-up update, while estimating the pelvis position from the fixed foot, the right and left legs were translated to the ground before computing the pelvis position. As the foot positions are fixed at every step on the ground and the right and left legs are independently considered, human activities involving jumping, running, and locomotion, such as hand walking, cannot be reconstructed realistically on the 3D avatar. During such activities, the avatar suffers occlusion with the ground. To counter such issues, multiple kinematics and rigid body constraints can be applied to the model, and acceleration from the IMU sensors could be used to estimate the position of the joints to increase the efficiency and accuracy of the system.

Conclusions
The results of our experimental evaluation demonstrated that the overall lidar and IMU fusion-based system exhibited better accuracy in estimating the joint position and bone segment orientation. The experimental setup of the proposed system was relatively more accessible and flexible concerning sensor locations.
The proposed method was efficient and accurate for human pose-tracking system by fusing lidar and IMU sensors. The system estimated body joint orientation and position in 3D using IMU sensors and used lidars to compensate for the displacement drift. The lidar data were also instrumental during the initial calibration and user height estimation for skeleton construction.
The TotalCapture dataset is used wherever possible for validating the proposed approach and the accuracy of the reconstruction on the 3D model. Multiple experiments were conducted to validate the proposed system against the ground truth. All results indicated that the proposed system could be used in real-time applications as stated above. Future work involves the consideration of complex human activities, such as running, jumping, hand walking, dancing, and so on that have more spatio-temporal changes in the orientation and position of the bone segments and joints.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: lidar Light Detection and Ranging IMU Inertial Measurement Unit FoV Field of View MMS Marker-less Motion Capture System ROM Range of Motion