A Novel Method of Human Joint Prediction in an Occlusion Scene by Using Low-Cost Motion Capture Technique

Microsoft Kinect, a low-cost motion capture device, has huge potential in applications that require machine vision, such as human-robot interactions, home-based rehabilitation and clinical assessments. The Kinect sensor can track 25 key three-dimensional (3D) “skeleton” joints on the human body at 30 frames per second, and the skeleton data often have acceptable accuracy. However, the skeleton data obtained from the sensor sometimes exhibit a high level of jitter due to noise and estimation error. This jitter is worse when there is occlusion or a subject moves slightly out of the field of view of the sensor for a short period of time. Therefore, this paper proposed a novel approach to simultaneously handle the noise and error in the skeleton data derived from Kinect. Initially, we adopted classification processing to divide the skeleton data into noise data and erroneous data. Furthermore, we used a Kalman filter to smooth the noise data and correct erroneous data. We performed an occlusion experiment to prove the effectiveness of our algorithm. The proposed method outperforms existing techniques, such as the moving mean filter and traditional Kalman filter. The experimental results show an improvement of accuracy of at least 58.7%, 47.5% and 22.5% compared to the original Kinect data, moving mean filter and traditional Kalman filter, respectively. Our method provides a new perspective for Kinect data processing and a solid data foundation for subsequent research that utilizes Kinect.


Introduction
The development of robotics technology is driving the application of robots from industrial production to the military, medical, and service fields [1][2][3]. In industrial production lines, industrial robots can replace workers in various tasks, such as assembly, handling, pick-up and welding, which can greatly improve work efficiency [4]. In the military, robots can be operated to perform dangerous tasks, such as bomb and mine defusing [5]. However, in the service field, robots are often used to handle more complex tasks that require people's involvement [6]. Therefore, the combination of robot control technology and human-computer interaction technology can effectively improve the working ability and intelligence of civil robots [7,8].
At present, the control of civilian robots has been transformed from the traditional manual control mode, such as remote control and operation handling, to the vision-based robot control mode [9]. Visual-based robotic somatosensory control methods are gaining increasing applications, such as the treatment of children with autism, robot classroom teaching, and assisting robots [10][11][12]. This type of control mode is simple to operate, more in line with the human mindset and easy to perform by even The angle between continuous displacement vectors can be described as: where d min is the minimum distance value of an acceptable displacement vector. d min is used to avoid a large change in angle caused by small changes when the joint position is basically stationary. In our experiment, the distance value of a displacement vector is approximately 0.01 m when the joint position is basically stable. By contrast, when the joint position is unstable, the distance of the displacement vector increases. Therefore, the d min value is set to 0.02 m in our experiment. The vibration degree reliability is defined as: where θ max and θ min are the extremities of human body movement. θ min is the lower limit of the angle change when there is jitter between each frame, and θ max is the upper limit of the angle change that we consider. Based on Morasso [34], which is concerned with kinesiology, we set θ min =45 • and θ max =135 • . However, the setting of the threshold values here are empirically determined, and this limitation is expected be overcome in our future research.

Reliability Threshold
The main advantages of Kinect sensors are their low price, ease of use and adaptability to the environment. However, all sensors produce measurement errors and noise when measuring physical quantities. The Kinect sensor is an inaccurate system that provides joint measurement data with certain measurement errors and noise [35]. These errors and noise are generated by various factors, which can be classified into two main types. The first type is the lack of joint position information caused by occlusion and the part of the human body that leaves the measurement range. The Kinect sensor estimates the missing joint using the estimation algorithm and can obtain erroneous data. The second type is the systematic error introduced by quantization noise and sensor stability. The first type of data error may cause joint data to significantly deviate from the true value, which affects the accuracy of the joint data; the second type of error has a small amplitude but appears more frequently, which results in Sensors 2020, 20, 1119 4 of 12 uneven joint data [36]. Therefore, this paper classifies the two types of joint data and performs the corresponding processing after classification. We applied the vibration degree introduced in Section 2.1 to evaluate the reliability of the joint and determine the reliability threshold to divide the two types of joint data. Joint data with lower reliability than the threshold are recognized as abnormal data and are called erroneous data, and data with higher reliability than the threshold are identified as data to be optimized and are called noise data. This paper used the common approximation method in mathematics to obtain the joint reliability threshold, as follows.
First, the occlusion marker was artificially set in the experimental scene. Second, we obtained the motion data through occlusion from the Kinect and calculated the joint reliability. Third, we simultaneously collected the human motion color image information to manually mark wrong joint data frame by frame, as shown in Figure 1. Finally, we used the approximation idea to determine the joint reliability threshold. When the current threshold identification error data are lower than the manual labeling, the current threshold is set to the lower threshold. When the current threshold identification error data are higher than the manual labeling, the current threshold is set to the upper threshold. The approximation algorithm stops when the threshold judgment and manual labeling error are within 10 percent of each other. performs the corresponding processing after classification. We applied the vibration degree introduced in Section 2.1 to evaluate the reliability of the joint and determine the reliability threshold to divide the two types of joint data. Joint data with lower reliability than the threshold are recognized as abnormal data and are called erroneous data, and data with higher reliability than the threshold are identified as data to be optimized and are called noise data. This paper used the common approximation method in mathematics to obtain the joint reliability threshold, as follows. First, the occlusion marker was artificially set in the experimental scene. Second, we obtained the motion data through occlusion from the Kinect and calculated the joint reliability. Third, we simultaneously collected the human motion color image information to manually mark wrong joint data frame by frame, as shown in Figure 1. Finally, we used the approximation idea to determine the joint reliability threshold. When the current threshold identification error data are lower than the manual labeling, the current threshold is set to the lower threshold. When the current threshold identification error data are higher than the manual labeling, the current threshold is set to the upper threshold. The approximation algorithm stops when the threshold judgment and manual labeling error are within 10 percent of each other.  In the present paper, wrist joint motion data of five subjects were collected. Each subject repeated five experiments, and each experiment collected 150 frames of data. We manually marked the number of frames of wrong joint and used the approximation algorithm to determine the reliability threshold. The results are shown in Table 1. Generally speaking, the data difference is not big enough, which may lead to doubts about the rationality of classification. However, in our opinion, the difference is a relative concept. Whether the difference is significant or not depends on the specific application. For example, if the proposed method in this paper is applied in the simulation of physical exercise such as table tennis playing, the data difference we provided is not big enough since the amplitude of the arm of the player in such kind of motion is quite big. In contrast, if the proposed method in this paper is applied in the simulation of rehabilitation training of patients with Parkinson's, the data difference we provided is very big since the amplitude of the arm of the patients in such kind of motion is quite small. Therefore, we determined the reliability threshold is the average value of the experimental data from 25 groups of 0.70 based on the results in Table 1 eventually. We defined erroneous data as joint data with a reliability threshold below 0.70 and noise data as joint data with a reliability threshold above 0.70. In the present paper, wrist joint motion data of five subjects were collected. Each subject repeated five experiments, and each experiment collected 150 frames of data. We manually marked the number of frames of wrong joint and used the approximation algorithm to determine the reliability threshold. The results are shown in Table 1. Generally speaking, the data difference is not big enough, which may lead to doubts about the rationality of classification. However, in our opinion, the difference is a relative concept. Whether the difference is significant or not depends on the specific application. For example, if the proposed method in this paper is applied in the simulation of physical exercise such as table tennis playing, the data difference we provided is not big enough since the amplitude of the arm of the player in such kind of motion is quite big. In contrast, if the proposed method in this paper is applied in the simulation of rehabilitation training of patients with Parkinson's, the data difference we provided is very big since the amplitude of the arm of the patients in such kind of motion is quite small. Therefore, we determined the reliability threshold is the average value of the experimental data from 25 groups of 0.70 based on the results in Table 1 eventually. We defined erroneous data as joint data with a reliability threshold below 0.70 and noise data as joint data with a reliability threshold above 0.70.

Algorithm to Handle Noise Data
Joint data with a reliability threshold above 0.70 are defined as noise data, and a Kalman filter is used to smooth the noise of the data. Except for separately obtaining each joint coordinate, we used Kinect to collect the sound source angle of the subject. Therefore, the state vector is taken to be the true 3D coordinates of the skeleton joint and their velocities and is written as The measurement vector is taken to be the true 3D coordinates of the skeleton joint and sound source angle and is written as Y = [x, y, z, arctan(x/z)] T . The state transition process is modeled as a linear dynamic system, and the measurement is modeled as a nonlinear dynamic system, where the next state at time instance k+1 is expressed in terms of the previous state at the kth instance and mathematically represented as: where X k and Y k are the state vector and measurement vector, respectively, at time instant k; Q k and R k are the process noise and measurement noise, respectively; F is the state transition matrix; and h is the state transformation function. Matrix F is given in block form by: For state transformation function h, we adopted the extended Kalman Filter to linearize h and replace matrix H in the filter with the Jacobian of h, which is evaluated at the current state estimate as: Kalman filter estimatesX k from X k with the knowledge of measurement vector Y k in two steps: prediction and update. The standard Kalman filtering prediction step can be written as: Sensors 2020, 20, 1119 6 of 12 where P − k is the covariance matrix associate with predictionX − k for an unknown true state X k and is expressed as: The updated state based on the measurement is expressed as: where K k is the Kalman gain matrix. The Kalman filter minimizes the mean square error between the estimatedX k and true X k , providing smoother coordinates.

Algorithm to Handle Erroneous Data
We define erroneous data as joint data with a reliability threshold below 0.70, and a Kalman filter with human model constraints is used to correct the error of the data. To illustrate the algorithm to handle erroneous data, we assume that the wrong joint is wrist joint B at the kth frame and that its parent joint is elbow joint A (X 1 , Y 1 , Z 1 ), as shown in Figure 2.
where − k P is the covariance matrix associate with prediction − k X for an unknown true state k X and is expressed as: The updated state based on the measurement is expressed as: where k K is the Kalman gain matrix. The Kalman filter minimizes the mean square error between the estimated k X and true k X , providing smoother coordinates.

Algorithm to Handle Erroneous Data
We define erroneous data as joint data with a reliability threshold below 0.70, and a Kalman filter with human model constraints is used to correct the error of the data. To illustrate the algorithm to handle erroneous data, we assume that the wrong joint is wrist joint B at the kth frame and that its parent joint is elbow joint A ) , , (

Z Y X
, as shown in Figure 2. First, the Kalman filter algorithm was used to estimate the motion trend between frames to obtain the error joint position estimate ) , , ( Z Y X P . Then, we established the constraint equation. Since the length of the human skeleton is constant, it is estimated that the error joint should be on the spherical surface with radius at the center of the parent node. The constraint equation is as follows:  First, the Kalman filter algorithm was used to estimate the motion trend between frames to obtain the error joint position estimate P( X, Y, Z). Then, we established the constraint equation. Since the length of the human skeleton is constant, it is estimated that the error joint should be on the spherical surface with radius l AB at the center of the parent node. The constraint equation is as follows: Finally, the estimated joint position ( X, Y, Z) is optimized. By establishing a spatial linear equation between P( X, Y, Z) and A(X 1 , Y 1 , Z 1 ), we can acquire optimized joint position B(X,Ŷ,Ẑ), which is on the constraint equation and closest to the estimated joint position P( X, Y, Z), as shown in Figure 3.
The constraint equation intersects the linear equation at two points. The solution with the smallest coordinate distance from the joint estimated position point P is selected as the final estimated error of the joint optimization estimated position.

Experimental Setup
Our experiment is based on Kinect version 2.0, which provides pose estimations for 25 "skeleton" joints at 30 Hz and enables the tracking of a user's skeleton on a subset of joints [21]. A schematic of the Kinect, its sensor locations and its right-handed coordinate system is shown in Figure 4. The Kinect base sits parallel to the (x, z) plane, and the origin of the coordinate is at the center of the infrared camera. The X-axis runs parallel through the video and audio sensor arrays, the Y-axis runs perpendicular to the Kinect base, and the Z-axis defines the illumination direction. The coordinate unit is meter (m).

Experimental Setup
Our experiment is based on Kinect version 2.0, which provides pose estimations for 25 "skeleton" joints at 30 Hz and enables the tracking of a user's skeleton on a subset of joints [21]. A schematic of the Kinect, its sensor locations and its right-handed coordinate system is shown in Figure 4. The Kinect base sits parallel to the (x, z) plane, and the origin of the coordinate is at the center of the infrared camera. The X-axis runs parallel through the video and audio sensor arrays, the Y-axis runs perpendicular to the Kinect base, and the Z-axis defines the illumination direction. The coordinate unit is meter (m).

Experimental Setup
Our experiment is based on Kinect version 2.0, which provides pose estimations for 25 "skeleton" joints at 30 Hz and enables the tracking of a user's skeleton on a subset of joints [21]. A schematic of the Kinect, its sensor locations and its right-handed coordinate system is shown in Figure 4. The Kinect base sits parallel to the (x, z) plane, and the origin of the coordinate is at the center of the infrared camera. The X-axis runs parallel through the video and audio sensor arrays, the Y-axis runs perpendicular to the Kinect base, and the Z-axis defines the illumination direction. The coordinate unit is meter (m). In general, the accuracy of Kinect is evaluated by comparing data collected by Kinect with data acquired by optical motion capture devices (such as VICON). However, as described in Section 2.2, all sensors produce a few measurement errors when measuring physical quantities. Thus, we may not be able to obtain the most accurate joint position trajectory. Since the precise trajectory is difficult to measure, this paper abandoned the use of an optical motion capture instrument to obtain human skeleton joint positions as the ground truth. Instead, we adopted the trajectory acquisition method presented in [19], which first set the fixed point on the ground as the center of the special motion trajectory in the (X, Z) plane. The present paper selected the quarter circular trajectory. First, we determined a point as the center of the quarter circular trajectory, which implies that we fixed the Y-direction coordinate of the human joint position. Then, we took a piece of a tape measure and attached it to the fixed point. Finally, we instructed the subject to face the Kinect at all times and move along the quarter circular path while holding the other end of the tape at the skeleton wrist joint. The obtained quarter circular trajectory of the wrist joint is considered to be the ground truth. Unlike [19], we added an obstruction to the joint trajectory to generate incorrect data. In this experiment, a total of five subjects' upper limb movement data were collected, and the experiment was repeated five times for each subject with 120 frames of experimental data. The experimental scene is shown in Figure 5. acquired by optical motion capture devices (such as VICON). However, as described in section 2.2, all sensors produce a few measurement errors when measuring physical quantities. Thus, we may not be able to obtain the most accurate joint position trajectory. Since the precise trajectory is difficult to measure, this paper abandoned the use of an optical motion capture instrument to obtain human skeleton joint positions as the ground truth. Instead, we adopted the trajectory acquisition method presented in [19], which first set the fixed point on the ground as the center of the special motion trajectory in the (X, Z) plane. The present paper selected the quarter circular trajectory. First, we determined a point as the center of the quarter circular trajectory, which implies that we fixed the Ydirection coordinate of the human joint position. Then, we took a piece of a tape measure and attached it to the fixed point. Finally, we instructed the subject to face the Kinect at all times and move along the quarter circular path while holding the other end of the tape at the skeleton wrist joint. The obtained quarter circular trajectory of the wrist joint is considered to be the ground truth. Unlike [19], we added an obstruction to the joint trajectory to generate incorrect data. In this experiment, a total of five subjects' upper limb movement data were collected, and the experiment was repeated five times for each subject with 120 frames of experimental data. The experimental scene is shown in Figure 5.  Figure 6 below shows the performance of tracking the wrist trajectory using the original Kinect, the moving mean filter algorithm, the traditional Kalman filter algorithm and our method compared to the ground truth. The trajectory shown in the black circle in Figure 6 is erroneous data caused by occlusion.

Results and Discussion
From Figure 6, it is observed that the algorithm proposed in this paper is superior to the other algorithms. The idea of our method is to separate erroneous data from noise data, perform targeted processing of the identified erroneous data, discard the original erroneous data and estimate the new joint position by combining the human constraint and filtering prediction as the current joint position. Therefore, the algorithm presented in this paper is less affected by external measurement data and maintains a similar trend to the real trajectory near the erroneous data. The ordinary filtering method applies the erroneous data to the smoothing process, which is greatly affected by the external  Figure 6 below shows the performance of tracking the wrist trajectory using the original Kinect, the moving mean filter algorithm, the traditional Kalman filter algorithm and our method compared to the ground truth. The trajectory shown in the black circle in Figure 6 is erroneous data caused by occlusion.

Results and Discussion
From Figure 6, it is observed that the algorithm proposed in this paper is superior to the other algorithms. The idea of our method is to separate erroneous data from noise data, perform targeted processing of the identified erroneous data, discard the original erroneous data and estimate the new joint position by combining the human constraint and filtering prediction as the current joint position. Therefore, the algorithm presented in this paper is less affected by external measurement data and maintains a similar trend to the real trajectory near the erroneous data. The ordinary filtering method applies the erroneous data to the smoothing process, which is greatly affected by the external measurement data. Therefore, the movement trend of the measurement data will remain in the vicinity of the erroneous data, and the deviation is large.
Sensors 2019, 19, x FOR PEER REVIEW 9 of 13 measurement data. Therefore, the movement trend of the measurement data will remain in the vicinity of the erroneous data, and the deviation is large. To measure accuracy, the average error of the estimated joint position and the true trajectory were calculated using the following formula: where ( ) and ( ) are the x and z components of the human joint position coordinate of the i th frame processed by different algorithms, respectively; ( ) and ( ) are the x and z components of the human joint position coordinate of the i th frame in the true trajectory, respectively. Table 2 shows the error of the original joint movement trajectory acquired by Kinect; the joint trajectories processed by the moving mean filter algorithm, the traditional Kalman filter algorithm and the algorithm proposed in this paper; and the true geometric trajectory. Table 2 shows that the joint data processing algorithm proposed in this paper is superior to the other algorithms in regard to the overall average error comparison. Based on the original Kinect data, the data accuracy was improved by 21.3% after moving mean filter. After the traditional Kalman filter processing, the data accuracy is increased by 46.7%, and after the algorithm proposed in this paper processing, the data accuracy is increased by 58.7%. To measure accuracy, the average error of the estimated joint position and the true trajectory were calculated using the following formula: where x(i) and z(i) are the x and z components of the human joint position coordinate of the i th frame processed by different algorithms, respectively; x 0 (i) and z 0 (i) are the x and z components of the human joint position coordinate of the i th frame in the true trajectory, respectively. Table 2 shows the error of the original joint movement trajectory acquired by Kinect; the joint trajectories processed by the moving mean filter algorithm, the traditional Kalman filter algorithm and the algorithm proposed in this paper; and the true geometric trajectory. Table 2 shows that the joint data processing algorithm proposed in this paper is superior to the other algorithms in regard to the overall average error comparison. Based on the original Kinect data, the data accuracy was improved by 21.3% after moving mean filter. After the traditional Kalman filter processing, the data accuracy is increased by 46.7%, and after the algorithm proposed in this paper processing, the data accuracy is increased by 58.7%.
As for computational efficiency, though the algorithm complexity of our method is higher than other algorithm like moving mean filter algorithm, it is not obvious in terms of the difference. Since the moving mean filter algorithm is simple, we observed that it takes roughly 0.975 s to process one experiment sample, whereas it takes roughly 1.248 s for traditional Kalman filter to process one sample. Our proposed method adds classification algorithm before extend Kalman filter so that it takes roughly 1.592 s to adapt one sample. All the algorithms are performed on the MATLAB 2016b platform with 3.1 GHz Intel Core i5 Processor.

Conclusions and Future Work
Regarding the accuracy of Kinect, few studies have focused on improving the inherent skeleton tracking accuracy of Kinect. These studies simply intended to show that applications based on Kinect could be significantly improved by applying optimal techniques. In this process, the researchers ignored the generalization of methods to improve the accuracy of Kinect. This tendency is susceptible to the embarrassing situation that the method is suitable for posture assessment but not home rehabilitation. We proposed a novel algorithm to improve the accuracy of Kinect skeletal joint coordinates for improving the inherent accuracy of Kinect. Our method introduced a skeletal joint data classification algorithm to divide noise data and erroneous data. Furthermore, we proposed two different algorithms to smooth noise data and correct erroneous data to accurately the track dynamic trajectory joint center location over time. Our method can potentially expand the way to process Kinect data in applications based on Kinect because we separate Kinect data processing from applications. Thus, our method is suitable for most applications related to Kinect.
The present paper evaluated the algorithm in an occlusion experiment. The results of these experiments are significant. The results show that the algorithm substantially smooths the skeleton joint position estimates of Kinect; more importantly, the experiments demonstrate that the tracking accuracy is significantly increased. In this study, we compared the results of our method with the original Kinect data, the moving mean filter algorithm and the traditional Kalman filter algorithm and obtained an accuracy improvement of 58.7%, 47.5% and 22.5%, respectively. As a result, using the skeletal joint data classification algorithm and two different data-processing algorithms to smooth noise data and correct erroneous data reduce the average estimation error for tracking human dynamic skeleton joints.
However, there are limitations to this study. Our proposed algorithm for Kinect skeletal joint data classification only considers the vibration between frames. However, there is also a limited relationship between the coordinates of each joint point in the same frame. For future work, we plan to enrich the skeletal joint data classification algorithm by incorporating the limit relationship. Furthermore, the setting of the values like reliability threshold is a shortcoming. We should consult with expertise from physiology, rehabilitation or even neuroscience to determine the reference threshold, and we should have conducted some preliminary experiments to verify the rationality of the data difference. In addition, we only considered the tracking of the (x, z) coordinate of the wrist joint of the subject with a known quarter circular trajectory. The results must be verified based on more complex motions. However, the true trajectories are difficult to measure and may require more sophisticated and expensive equipment, which will be conducted in possible future research.

Data Availability:
The data used to support the findings of this study are included within this paper. It is also available from the corresponding author upon request.