Design of smart human following on rail inspection using human pose estimation marker-less motion capture based on blazepose Rancang Bangun Smart Human Following Pada Inspeksi Rel Kereta Api Menggunakan Human Pose Estimation Marker-Less Motion Capture Berbasis BlazePose

Advances in artificial intelligence (AI) technology today have a significant impact in various aspects of human life. One example is the evolution of robotics that has achieved the ability to follow human movements. To achieve this, AI technology utilizes image recognition through Computer Vision and the Human Pose Estimation method with the help of the BlazePose library, which is able to recognize 33 keypoints in human body poses. Research in this area aims to develop an automatic control system that can be used on inspection carts, enabling them to follow human body movements while walking. The results showed a detection accuracy rate of 84.82% with an optimal detection distance between 4 to 8 meters from the camera, with an average detection accuracy of 89.862%. On the motor control aspect, the system is set to turn off the motor when the distance between the device and the object is in the range of 1-2 meters, and turn it on at a distance of 3-12 meters. However, it is important to note that the accuracy achieved is greatly affected by the color segmentation capabilities of the software, the lighting conditions in the environment, as well as the resolution of the camera used.


INTRODUCTION
Motion capture has been a topic of interest in several studies.One of the applications of motion capture techniques is in human pose estimation.Before many studies adopt the marker-less approach for human pose estimation, a number of studies still used markers to generate human skeletons.
Optical motion is one of the tools used to generate human skeleton structures, with motion capture results that have a higher level of accuracy.However, optical motion has limitations in terms of practically, as it requires the intended user to wear clothing that has markers.Lately, an increasing number of researchers have been focusing on this trending subject.
In addition to optical motion, some studies have also used Kinect devices.However, sometimes in its use, the Kinect device faces challenges in obtaining the human skeleton structure (Liu et al., 2016).This arises do to self-occlusions (concealment of joints by other body parts) and also errors that may come from the Kinect sensor itself, because Kinect devices are basically intended for consumers, so the level of accuracy and reliability tends to be lower.
Lately, an increasing number of researchers have been focusing on this trending subject of marker-less motion capture, from an algorithmic standpoint.Marker-less motion capture can be divided into two primary categories, discriminative approaches and generative approaches (Chao, 2016).
Discriminative approaches utilize data-driven machine learning techniques to transform the motion capture challenge into a regression or pose classification task (Hong, 2016), making them suitable for applications involving human-computer interaction where efficiency outweighs precision.On the other hand, in the context of generative approaches for motion capture, the ultimate objective is to determine the body's pose and shape, achieved by fitting the model to information extracted from images.These methods can generate a series of model parameters like body shape, bone lengths, and joint angles.In contrast to discriminative approaches, generative methods typically rely on temporal data and address a tracking problem.This trend is being developed continuously.Human pose estimation provides human joint information where each key point in humans can be used for robot and human interaction.
In (Cheng, 2021), built a modular interactive framework based on RGB images, which aims to overcome the problems of high dependence on depth cameras and limited distance adaptation in existing human-robot interaction frameworks.However, most of the existing mainstream methods still rely on depth cameras to obtain human joint information.Existing interaction frameworks are affected by the infrared detection distance and thus cannot properly adapt to a variety of different interaction distance.
As human pose estimation is always evolving, its scope is expanding to include approaches that utilize data-drive.Research by (Ming-Hwa et al., 2023), deep learning has undergone rapid development.The use of deep learning covers various field, one of which is human pose estimation.The need to improve accuracy in human pose estimation is also growing, and there are several challenges to overcome.Firstly, how to get the right human pose estimation considering different clothing variations, body shape, variations, and pose variations that may occur.In addition, there is a demand to obtain effective human pose estimation even when applied to many individuals at once.
As we know, BlazePose is an instant human pose detection method capable of recognizing human poses in images or videos.It functions in a single-mode setup, catering to the detection of a single human pose.In simple terms, BlazePose is a sophisticated deep learning model that permits the estimation of human pose through the identification of body segments like elbows, hips, wrists, knees, and ankles.These segments are interconnected to form a skeletal structure that depicts the pose.This model is designed to be efficient, utilizing depth-wise separable convolution to enhance network depth, minimize parameters and computational load, and enhance accuracy.BlazePose provides a comprehensive collection of 33 keypoints, covering areas ranging from the nose to the left foot index.
From some existing research, the input of this research in the form of a single view from fixed position view, and simultaneously taking the object.From the camera view it proceeds to ║ Design of Smart Human Following on Rail--JOGE, Vol.2, No. 2, November 2023: 106 -118 108 the 2D joint location detection process.The camera calibration process is needed to obtain the intrinsic and extrinsic features of the camera.With these features and the result of 2D joint location, we get 3D human pose estimation using BlazePose approach.
In this research area, the researchers integrated a human pose estimation system into a transport system as a supporter of the inspection process on the railway, which we often know as an inspection train.This interaction mimics the interaction between humans and robots, but the scope is at the transport level.This design system follows human movement based on human pose estimation along the railway track.

RESEARCH METHOD
The pipeline in this human pose estimation based on BlazePose system mainly includes four stages, as shown in Figure 1.The first stage is motion capture using a camera device, where this device acts as the input and output of BlazePose-based human pose estimation.That second is the use of machine learning, which will lead to the architecture of BlazePose itself.In BlazePose there are two types of machine learning used, namely estimators and detectors.The third stage is joint detection based on 33 predefined keypoints that are marker-less motion capture referring to the Vitruvian Man model.

A. Human Model
Our human body model consists of kinematic 'skeletons' of articulated joints controlled by angular joint parameters   , covered by 'flesh' built from superquadric ellipsoids with additional tapering and bending parameters.A typical model has around 33 joint parameters, plus internal proportion parameters   encoding the position of the hip, clavicle and skull tip joints, plus 9 deformable shape parameters for each body part, gathered into a vector   .The state of a complete model is thus given as a single parameter vector  = (  ,   ,   ).We note, however, that only joint parameters are typically estimated during object localization and tracking, the other parameters remaining fixed.

B. Parameter Estimation
We aim towards a probabilistic interpretation and optimal estimates of the model parameters by maximizing the total probability according to Bayes rule: The above formulation reflects a Bayesian approach where our prior knowledge of the human pose parameters (p(x)) is updated based on the observed data (r̅ ) to obtain a more accurate estimate (p(x│r̅ )).This approach provides a probabilistic framework to approach human pose estimation and maximize the probability of fitting those parameters to the acquired data.
p(r̅ │x) is the probability distribution of the observed data or information we have about the human pose (r̅ ), given the parameters (x) used to describe the pose.It measures the degree to which those parameters match the observed data.
In this context,   and   can be interpreted concretely as activation energy and error energy, respectively.The activation energy reflects the extent of complexity or cost in determining the human pose parameters, while the error energy measures the degree to which the parameters do not match the observed data.In this framework, we are committed to optimizing the total energy   and   to fully match the available data.

C. Observation Maximum Likelihood Estimation
In the context of parameter estimation, the likelihood is naturally viewed as function of the parameters .The joint probability of a set observations, conditioned on a choice for repeated here: Since good predictions are better, a natural approach to parameter estimation is to choose the set of parameter values that yields the best predictions-that is, the parameterthat maximize the likelihood of the observed data.This value is called the Maximum Likelihood Estimate (MLE), defined formally as: In nearly all cases, the MLE is consistent (Cramer, 1964), and gives intuitive results.In many common cases, it is also unbiased.For estimation of multinomial probabilities, the MLE also turns out to be the relative-frequency estimate.Figure 2 visualizes an example of this.The MLE is also an intuitive and unbiased estimator for the means of normal and Poisson distributions.The input is either a real-time video capture.The region in the video frame where a person has been detected.Represented as a 2562563 array with aligned whole human body, centred on the middle of the hip in vertical body pose and rotation distance (-10,10).The channel order RGB with values in [0, 0, 1, 0].The breakdown of the process can be seen in Figure 3.The outputs include a 335 tensor corresponding to the screen-projected keypoints (x, y, z, visibility, presence), a 333 tensor corresponding the metric scale coordinates of the 3D world (world x, world y, world z), and a scalar in the range [0.0 ; 1.0] corresponding to presence indicating the probability of a person being present on a passed image.

E. BlazePose Architecture
BlazePose consist of two machine learning models, a detector and an estimator.The detector cuts out the human region from the input image, while the estimator takes a 256256 resolution image of the detected person as input and outputs the keypoints.The detector is an Single-Shot Detector (SSD) based architecture.There are two ways to use the detector.In box mode, the bounding box is determined from its position (, ) and size (, ℎ).In alignment mode, the scale and angle are determined from (1, 1) and (2, 2), and bounding box including rotation can be predicted.The estimator uses heatmap for training, but computes keypoints directly without using heatmap for faster inference.The z-values are based on the person's hips, with keypoints being between the hips and the camera hen the value is negative, and behind the hips when the value is positive.
The visibility and presence are stored in the range of [min_float, max_float] and are converted to probability by applying a sigmoid function.The visibility returns the probability of keypoints that exist in the frame.

F. Rail Inspection System
The design of this system uses distance detection and human pose estimation captured by a webcam camera then programming is carried out using the BlazePose method and forwarded to the ESP32 microcontroller, after which it is connected to the PLC using Modbus RS485 communication as TX and RX inputs, while the Rotary Encoder which is a speed sensor is connected to the PLC at input address S8 which has a High-Speed Counter function.
The two inputs will be processed using a PID controller which is output in the form of a PWM signal.The PWM signal issued by the PLC at R7 is then converted by the PWM to Voltage module and forwarded to the BLDC Motor Controller.The voltage entering the BLDC Motor Controller will be processed and forwarded to the BLDC Motor to adjust the motor speed.The 48V 10Ah Lithium Ion Battery is used as the main power source of the BLDC Motor Controller and BLDC Motor which enters the battery socket on the BLDC Motor Controller.The system can be seen in Figure 5.

RESULTS AND DISCUSSION
From this research, the results of estimating the human body using BlazePose are obtained.However, the results of this research will be divided into two categories.The first category is the result of human pose estimation using BlazePose by validating the keypoints value obtained.And the second, bread on the scope of integration of the human pose estimation system for rail inspection.
To evaluate model, researchers checked the coordinates of 33 keypoints (x, y, z).The result of the 3D projection can be seen in Table 1.The coordinate results from the 3D projection are then validated to determine whether the point in the human pose estimation is in accordance with the joint keypoint.This can be seen in Figure 6.It can be seen that the keypoints in Figure 6 include several points in the face and chest area.The results of the taping can be seen in Table 2.These results are stated to be in accordance with the reference keypoints index.After the design system has been matched with the keypoints, the human pose estimation is then integrated with the rail inspection system through the motor, which further regulates the range distance, so that the system will get information on the optimum joint distance.The system will manage the rail inspection in which direction to move through the human movement on the rail track.If a good estimation of the human body is detected (in standing condition) then the motor condition is on, and landmarks are detected.If the inspection direction is backward, the direction will be reverse.Conversely, if the inspection direction is forward, the direction will be forward.The results of the integration provide information on the optimal coverage distance for rail inspection of objects (human poses).In this condition, the researcher uses a distance of 1-12 meters, where this distance is the optimal range based on triangulation (image projection).In addition, this distance can estimate the object detection range which is useful for activating the motor status.In Table 4, it can be seen and analysed that the ideal distance for the system to detect and state the motor ON is at a distance of 4, 5, 7, and 8 metres.It indicates that the graph with the the fourth cluster of distances condition shows a good initialisation between frame count and accuracy compared to the other distances.However, in the initialisation condition or at a distance of 1 meter and 2 meters the motor status is OFF.It happens that the detection of human pose estimation is too close to the distance of the camera, and the range of the camera still cannot see the position of the rail.The results of the distance accuracy graph from 1 metre to 12 metres can be seen in the following figure.(e).Accuracy graph with a distance of 5 metre.
(f).Accuracy graph with a distance of 6 metre.

Figure 1 .
Figure 1.The pipeline in human pose estimation.

Figure 2 .
Figure 2. The likelihood function for the binomial parameter  for observed data where  = 10 and  = 10.The MLE is the Relative Frequency Estimate (RFE) for the binomial distribution.Note that this graph is not a probability density and the area under the curve is much less than 1.

JGFigure 4 .
Figure 4. Tracking network architecture of regression with heatmap supervision.The first output estimator is (1, 195) landmarks, the second output is (1, 1) flags.The landmarks are made of 165 elements for the (x, y, z, visibility, presence) for every 33 keypoints.The z-values are based on the person's hips, with keypoints being between the hips and the camera hen the value is negative, and behind the hips when the value is positive.The visibility and presence are stored in the range of[min_float, max_float]  and are converted to probability by applying a sigmoid function.The visibility returns the probability of keypoints that exist in the frame.

║
Figure 5. Integrating system to rail inspection.

Figure 9 .
Figure 9. Motor condition is ON, overall landmark is DETECTED, direction is FORWARD.

Figure 10 .
Figure 10.Motor condition is OFF, overall landmark is UNDETECTED, direction is FORWARD.
(a).Accuracy graph with a distance of 1 metre.(b).Accuracy graph with a distance of 2 metre.(c).Accuracy graph with a distance of 3 metre.(d).Accuracy graph with a distance of 4 metre.

Table 3 : Integrating motor to system based on poses.
Figure 7. Motor condition is ON, overall landmark is DETECTED, direction is REVERSE.Figure 8. Motor condition is OFF, overall landmark is UNDETECTED, direction is REVERSE.