Comparative Study of 3D Reconstruction Methods from 2D Sequential Images in Sports

The process of 3D reconstruction is a basic problem in Computer Vision. However, recent researches have been 
successfully addressed by motion capture systems with body worn markers and multiple cameras. To recover 
3Dreconstruction from fully-body human pose by single camera still remains a challenging problem. For instance, 
noisy background, variation in human appearance and self-occlusion were among these challenges. This thesis 
investigated methods of 3D reconstruction from monocular image sequences in dynamic activities such as sports. 
Six recent methods were selected based on they focused on recovery fully automated system for estimating 3D 
human pose for 2D joint location. These researches have been developed the algorithm that be able to solve illposed problem. Evaluation of the methods was divided in two sections. First, the theoretical and comparative 
study of each method was disclosed to identify the technique used, the problems that enquired and the results 
achieved in their approach. After that, the advantages and disadvantages of each method were listed. Also, several 
factors such as accuracy, self-occlusion and so on have been compared amongst these methods. In Second stage, 
based on the advantages found in the first stage of evaluation, three methods were chosen to be evaluated using 
specific data set. Initially, the codes of the three methods on PennAction dataset (tennis) were run and the 
performance of the methods in 3D reconstruction is showed. Then, the methods were tested on a mixed activities 
sequence from the CMU motion capture database. The novel of this study is evaluation of recent methods based 
on the accuracy of their performance on the specific dataset of tennis player. Also, we proposed a technique which 
combining specific advantages of each method to create a more efficient method for 3D reconstruction of 2D 
sequential images in the context of outdoor activities.


INTRODUCTION
Multimedia equipment can capture video or multi-photographs in real time in the course of a sport activity that can be replayed to an athlete player after the game to identify and rectify faults in technique. However, although this technique is flexible, the images shown provide only a single perspective (single camera view) which reduces considerably the ability to conduct an in-depth analysis (Thompson et al 2014). To address this issue, multiple cameras can be used for simultaneous capture of the player's performance, but this will incur high cost and be complicated. It will also require post processing and thus limit the time for motion capture. On the other hand, multiple challenges must be simplified in 3D reconstruction of human body area from sequential images.
In this article some considerations are taken into account of the different methods analyzed in order to determine the most suitable method to be applied for the sport of tennis. First, "realistic human body" has been targeted due to the complexity in modeling based on variations in individual body shape and different clothes. Second, the accurate recognition of self-occlusion where some limbs block other body parts in the images and obstruct the vision when stationary camera is studied. Third, finding proper image descriptors can be more helpful in resolving many pose ambiguities and usually require trial and evaluation procedure to determine most competitive representations. Finally, special attention was given to the inclusion of real-world conditions such as cluttered background, uncontrolled scenes, noisy data, speed of moving person in sequential frame (Saima et al. 2018). Therefore, this research identified the best method which improves the ill-posed problem and to the handle outdoor conditions in order to be implemented in the tennis court environment. This paper has three objectives which are as follows; first, we evaluate different methods for 3D reconstruction of the human body from a sequence of monocular images to determine which one performs efficiently under occlusions, noise on the real world data. Second, we compare between the developed and implemented 3D reconstruction methods, with the identification of the advantages and drawbacks of each. Third, we propose a new technique combining the advantages of different methods studied that is more accurate in a particular application (tennis sport) with fixed and ordinary camera.

LITERATURE REVIEW
Considerable research has addressed the challenge of human motion capture from imagery such as Gotardo & Martinez (2011), Ramakrishna et al. (2012) and Wandt et al. (2016) allow reconstructing 3D human motion using feature tracks in monocular image sequences and combining random camera motion depending on prior trained base poses. Also, they focuses on any motion; periodic and non-periodic.
The review of the methods was conducted following the method proposed by Ramakrishna et al. (2012), Atul (2014), Akhter et al. (2015), Xiaowei et al. (2016), Wandt et al.(2016) and Du et al.(2016). These six of the most recent 3D reconstruction algorithms were selected for the analysis based on the performance and result described in their research. The theoretical approach of all methods is discussed and the detail performance of the mathematical model was identified. Ramakrishna et al. (2012) offered a model that was not activity-dependent to retrieve the 3D configuration of a human figure from 2D locations of anatomical points in a single image, leveraging a large motion capture corpus as a substitute for visual memory. Atul (2014) developed three principled approaches to enhance particle filtering by integrating bottom-up information either as proposal density for obtaining more diverse particles or as complementary cues to improve likelihood computation during the correction step. In addition, he also demonstrated that a feedback mechanism from top-down modeling can further adapt and enhance the bottom-up predictors to improve tracking performance. Akhter et al. (2015) modeled how joint-limits differ with pose for getting valid poses. They collected a motion capture dataset that explored a multiplicity of human poses and developed a pose-dependent model of joint limits that forms their prior. Xiaowei et al. (2016) proposed the integration of a sparsity-driven 3D geometric prior and temporal smoothness when the image locations of the human joints are provided and when they are unknown, and this was extended by programming the image locations of the joints as latent variables by considering several ambiguities in 2D joint locations.
The approach suggested by Wandt et al. (2016) aims to address the issue of predicting non-rigid human 3D shape and motion from image sequences captured by non-calibrated cameras. They factorized 2D observations in camera parameters, base poses and mixing coefficients, in the same way as other state-of-the-art solutions. The novelty of this method compared with existing methods is that it can handle arbitrary camera motion without the need to use predefined skeleton or anthropometric constraints whereas other methods require adequate camera motion during the sequence to obtain a proper 3D reconstruction. Du et al. (2016) proposed in their method the goal to make the 3D motion reconstruction more accurate, and so more built-in knowledge was added, such as height-map, was introduced into the algorithmic scheme of reconstructing the 3D pose/motion in a single-view calibrated camera.
Finally, our approach was a comparative study of 3D reconstruction methods of human body from 2D image sequence of tennis player. We focused on the evaluation of different methods that studied on sports poses by analyzing several factors such as accuracy of human pose estimation, self-occlusion, and noisy background that are still not fully resolved. So, we run code of their algorithm of these methods in MATLAB on PennAction dataset to get 3D reconstruction result. After collecting all the results and comparing them together, we proposed new technique which is combining of some approaches of three methods Xiawoei method, Wandt method and Du method. The novel method proposed be able to improve 2D joint location and occlusion that can recreate 2D images into 3D images with realistic results, minimum requirements and effective results.

METHODOLOGY
The methodology used in this research consisted of four phases described in Figure 1 Phases one, consisted in the analysis of multiple methods recently published for 3D reconstruction in order to identify six methods that showed to be the most relevant for the purpose of this research. Phase two, consisted in the comparison of the experimental result of each method presented by the authors based on several factors such as projection, camera, realistic reconstruction, self-occlusion, accuracy, noisy background and process speed in order to shorten the list to three highlighted methods. In phase three, the evaluation of these three selected methods was studied using a specific sequential images of tennis player data set and the results were compared. Finally, phase 4 consisted in the proposal of a new improved method for 3D reconstruction from 2D sequential images that combine the robustness of each method evaluated.  Figure 2 Displayed the pathway selected to evaluate the performance of the three selected methods. The first step for the analysis was conducted by analysing the mathematics described for each method. Following this, the code was diggited using MATLAB and the performance of each method was assessed using the specific dataset proposed by each author to verify that the codes working without error. However, when the code was not provided additional work was required and the mathematical analysis of the code was used to program the method as described by the author. Specific factors for these methods were evaluated on our particular data set to evaluate their performance and to compare their accuracy in 3D reconstruction (Li et al. 2013) . Output of running codes was compared using tennis player's dataset (PennAction). Finally, these methods were evaluated on CMU dataset to understand their performance in 3D reconstruction error, accuracy percentage and to compare the results. The final stage of this study consisted in the compilation of the advantage found in the evaluated methods. Specific advantages were integrated in the core method (i.e. The method that show the best performance on proposed data set) to overcome disadvantages found and to improve the efficacy for 3D reconstruction. The final method proposed include the highlights and provided a novel approach for 3D reconstruction from 2D sequential images in tennis sport.

RESULTS
The review of each selected method is presented in Table 1. The advantages and disadvantage of each one of their techniques are described. After identifying the weaknesses and strengths of each algorithm, their experimental results of different parameters are presented in Table 2 Three methods were fund suitable for the proposed application of this study as described below. Among these methods, two methods (i.e. Xiaowei and Bastian) showed successful outcomes in noisy background, self-occlusion and realistic reconstruction, which makes them ideal to be further evaluated for 3D reconstruction. Furthermore, from the other methods compared, Yu du method was selected due to its outstanding results the parameter required for the analysis of the dataset chose. No significant advantages were found in the other methods analyzed and were discarded as they presented lack of noise background reduction, realistic reconstruction and both. Finally, these three selected methods showed to be the most suitable methods to be analyzed using their result in database of tennis's player. This section demonstrates the application of the selected approaches for pose estimation with in-the-wild images sequence. Results are presented utilizing action from the PennAction dataset. The "tennis forehand" was selected for evaluation due to it is not simple pose. It also has some challenging such as the large pose variability, self-occlusion, and image blur because of fast motion. We selected six frames (2,8,14,20,25,30) from 31 images sequence of dataset that we were able to evaluate main factors. Table 3 to 8 illustrated the 3D results of each method on frames.  Analysis of frame #8 is shown that Du method didn't have accurate result in right leg (violet color) because of right leg occluded by left leg that it causes problem of self-occlusion.  Analysis of frame #20 is shown that Xiaowei method had the best result in this specific angle with less missing data. Analysis of frame #25 is shown that Wandt method had poor result in arms and shoulder (yellow color) but had better result in the parts of legs compare with Du method.

Input
Frame #30 Xiaowei Wandt Du Analysis of frame #30 is shown that Wandt method is sensitive to angle of image and couldn't reconstruct with accuracy. Table 9 shows a summary of results on tennis player dataset. The conclusion of these results is shown that the method proposed by Xiaowei et al. (2016) and Du et al. (2016) had several similarity in 3D result. But Xiaowei algorithm is more robust to noise and also able to handle occlusions and reconstruct the occluded body parts correctly. Although, Bastian's method revealed better performance to reconstruct in the part of legs.  Table 10 showed the processing time of the three methods evaluated. It is clear that the method proposed by Xiaowei is fastest method to process 3D reconstruction according to the specifications mentioned in the computing section. Algorithms usually converge in 20 iterations with average CPU time below 150s for a sequence of 31 frames. These methods were evaluated by testing them on a sequence of mixed activities from the CMU motion capture database. Care was taken to make sure that the motion capture frames were not those utilized in the training of the shape bases. It could be seen that the reconstruction results of the jumping sequences were inferior in comparison with the other sequences. This was because the difference between jumping motions of various individuals was much larger than between running motions. As such, a new, untrained jumping motion was insufficiently explained by the base poses, whereas each new running pattern was the same as those in the training data. Second, the evaluation of 3D motion recovery was carried out with the groundtruth 2D joint locations. The 3D reconstruction errors in millimeters are reported in Table 11. The standard evaluation per joint error (mm) in 3D was computed between the reconstructed pose and the ground truth in the camera frame and their root locations aligned. This table is shown 3D reconstructions of Xiaowei are highly realistic, which was shown by the 3D error.

DISCUSSION
The result from the evaluation conducted indicated that the method proposed by Xiaowei et al. (2016) is highly recommended for 3D reconstruction of tennis player images. In this method, 2D joint heat maps capturing positional uncertainty are generated with a deep fully CNN. These heat maps are combined with a sparse model of 3D human pose within an Expectation-Maximization framework realized the 3D parameter estimation over the entire sequence. However, this method provided a solution for most of the challenges in 3D reconstruction such as large pose variability, self-occlusion, and image blur caused by fast motion. But, it needs manually labeled for 2D joint location that reduces percentage of accuracy. To improve this issue, we proposed 3D human pose estimation frame-work that presented by Wandt et al. (2016) method. It consists a synthesis between discriminative image-based and 3D reconstruction. It treated 2D joint locations as latent variables whose uncertainty distributions are given by a deep fully convolutional neural network. The unknown 3D poses are modeled by a sparse representation and the 3D parameter estimates are realized via an Expectation-Maximization algorithm, where it is shown that the 2D joint location uncertainties can be conveniently marginalized out during inference. Further, to improve robustness of method against occlusion and reconstruction ambiguity, 3D temporal smoothness prior is imposed on the 3D pose and viewpoint parameters which Du et al. (2016) method considered. Therefore, the usage of the method proposed by Xiaowei et al. (2016) as core-base and the integration of the advantages described for the method by Wandt et al. (2016) and Du et al. (2016) might provide an effective method for 3D reconstruction of images sequence on specific dataset. The novel method proposed might be able to improve 2D joint location and occlusion that can recreate 2D images into 3D images with realistic results, minimum requirements and effective results.

CONCLUSION
This paper is a comparative study of 3D reconstruction methods of human body from 2D image sequence of tennis player. Among all the sports, we chose tennis sport because this exercise developed at very high-speed and requires the development of technical skills. Also, this sport presents a challenge for the 3D reconstruction due to factors such as self-occlusion and occlusion that occurs during the development of the game. Many technologies tried to help raise the level of athletic techniques; as well as reduce arbitration errors and physical damage but, they faced multiple problems such as high cost, time consuming and heavy equipment (Norshaliza, 2016). We believe that the simulation of tennis players' movement taken by arbitrary camera through 3D reconstruction of sequential images reduces might be economical viable and simplify the time when compared with traditional technologies. Moreover, this method can help the players and coaches to significantly improve skills. On the other hand, the increasing demand of 3D reconstruction especially for the human body can provide multiple additional applications such as movies, gaming and medical purpose. The achievement of this research can also help other industries as well. Specially, generating 3D poses from a sequence of images is much cheaper than marker-base technologies.
Modeling of 3D human body from image sequences is a challenging problem and has been a research topic for many years (Ashraf et al. 2014). Important theoretical and algorithmic results were achieved that allow to extract even complex poses of human body form. Research in the area of human pose has been approached from many different issues in an attempt to implement a robust, accurate and automatic fully-body system.
In this paper we focused on the evaluation of different methods that studied on sports poses by analyzing several factors that are still not fully resolved in this area. For instance, realistic scenes background clutter, variation in human appearance, and self-occlusion are challenges that require in deep investigation. Also, we identified the most suitable method which improves the ill-posed problem and can handle outdoor conditions in order to be implemented in the tennis court environment with high speed process (Shingade and Ghotkar, 2014). To reach this goal, there are two step of evaluation. First of all, we have chosen six recent methods based on their focus on several features such as image sequence, camera, sport poses in real world and so on. These methods have improved several challenges in old methods and some recommendations for future work. Advantages and disadvantages of methods of Ramakrishna et al (2012), Atul (2014), IAkhter et al (2015), Xiaowei et al. (2016), Wandt et al. (2016) and Du et al. (2016) were discussed and compare theoretically. Some factors such as accuracy of human pose estimation, self-occlusion, and noisy background were analyzed in their experimental results.
In the next step of evaluation, three top methods are selected for further and deeper analyze. We run code of their algorithm in MATLAB on PennAction dataset to get 3D reconstruction result. The codes of Xiaowei et al. (2016) and Du et al. (2016) obtained from the Internet. We were implemented code of Wandt et al. (2016) method by ourselves. To get the final and definitive results, we also tested these methods on database CMU MoCap. After that, it was decided that among them, the methods proposed by Xiaowei might be the most suitable method to be implemented for the 3D reconstruction applied to tennis. This method proved to be faster than the other method evaluated and produced outstanding results in terms of accuracy. Subsequently, the method proposed by the Wandt showed to provide better accuracy when dealing with self-occlusions. Finally, the method proposed by Du showed the lowest accuracy and poor performance when occlusion was involved.
Eventually, we proposed new technique which is combining of some approaches of three methods Xiaowei method, Wandt method and Du method. It proposed a 3D human pose estimation framework from a monocular image that consists of a novel synthesis between a deep learning-based 2D part regressor, a sparsity-driven 3D reconstruction approach of Wandt method, and a 3D temporal smoothness prior in Du method. This joint consideration combines the discriminative power of state-of-the-art 2D part detectors, the expressiveness of 3D pose models, and regularization by way of aggregating information over time. So, it can go directly from 2D appearance to 3D geometry. Proposed method can improve 2D joint locations for tennis player poses in outdoor condition from sequence images taken by arbitrary camera.