Dual-Hand Motion Capture by Using Biological Inspiration for Bionic Bimanual Robot Teleoperation

Bionic bimanual robot teleoperation can transfer the grasping and manipulation skills of human dual hands to the bionic bimanual robots to realize natural and flexible manipulation. The motion capture of dual hands plays an important role in the teleoperation. The motion information of dual hands can be captured through the hand detection, localization, and pose estimation and mapped to the bionic bimanual robots to realize the teleoperation. However, although the motion capture technology has achieved great achievements in recent years, visual dual-hand motion capture is still a great challenge. So, this work proposed a dual-hand detection method and a 3-dimensional (3D) hand pose estimation method based on body and hand biological inspiration to achieve convenient and accurate monocular dual-hand motion capture and bionic bimanual robot teleoperation. First, a dual-hand detection method based on body structure constraints is proposed, which uses a parallel structure to combine hand and body relationship features. Second, a 3D hand pose estimation method with bone-constraint loss from single RGB images is proposed. Then, a bionic bimanual robot teleoperation method is designed by using the proposed hand detection and pose estimation methods. Experiment results on public hand datasets show that the performances of the proposed hand detection and 3D hand pose estimation outperform state-of-the-art methods. Experiment results on a bionic bimanual robot teleoperation platform shows the effectiveness of the proposed teleoperation method.


Introduction
With the rapid development of robotics in recent years, robots can replace or assist humans to complete some specific tasks in the fields of industry, medical care, aerospace, and educational services [1].Since the structures of the bionic bimanual robots [2][3][4] are similar to that of the human dual arms and dual hands, they can replace humans to complete more complex tasks.The movement of the bionic bimanual robot requires a high degree of coordination and complex manipulation.Bionic bimanual robot teleoperation [5] can transfer the operation skills of the human dual hands to the robots to increase the success accuracy and reliability of the manipulation, which is the first choice for smart programming solution when dealing with fast decisions and corner cases.
Currently, teleoperation methods for degree-of-actuation robots like bionic bimanual robots require precise motion capture of human dual hands.Then, the captured motion information is mapped to the motion of robots to realize the teleoperation.The hand motion capture methods mainly include the methods based on data gloves [6], surface electromyography wristbands [7], and optical markers [8].These methods are very expensive and require long and difficult calibration work.At present, there are already several commercialized software mark-free apps for motion capture, such as Kinect developed by Microsoft Co. several years ago that was a powerful and useful mark-free tool, and recently, numerous apps of motion capture for smartphones have been developed [9].However, these apps have some disadvantages.For example, Kinect only has motion capture of the human body and does not include fine hand motion capture.Most apps for smartphones can only estimate 2-dimensional (2D) hand pose.These methods are not suitable for teleoperation of bionic bimanual robots.Based on these issues, this work provides a convenient, marker-free, and low-cost teleoperation strategy for bionic bimanual robots.The pipeline of the teleoperation method is shown in Fig. 1.The positions and poses of the pilot's dual hands are captured by visual sensors.Then, the dual-hand motion information is mapped to the movement of the bionic bimanual robot.In this system, a visual dualhand motion capture method is proposed, which includes a DuHandLocaNet and a 3DHandPoseNet.The DuHandLocaNet is used to detect and localize the dual hands to obtain the hand presence, handedness, and hand positions, which can be mapped to the end positions of the bionic bimanual robot.The 3DHandPoseNet is used to estimate the 3-dimensional (3D) poses of the dual hands from the located hand RGB images, and these 3D hand poses can be mapped to the gestures of the 5-finger dexterous manipulators.
The dual-hand motion capture plays an important role in the bionic bimanual robot teleoperation.However, it is still a great challenge in the computer vision.For one thing, the detection and distinction of dual hands are affected by the similarity of the left and right hands, multiple gestures, and small sizes.So, it is difficult to obtain high accuracies.For another, the 3D hand pose estimation from single RGB images is also difficult due to the self-similarity, self-occlusion, and lack of depth information.To deal with these issues, this work proposes a biologically-inspiration-based dual-hand motion capture method, which includes a dual-hand detection network (DuHandLocNet) and a 3D hand pose estimation network (3DHandPoseNet).The DuHandLocNet uses a parallel network structure to combine the features of the hands with the relationship between the dual hands and body.By introducing the biological constraints of the relationship between the dual hands and body into the loss function to increase the hand detection accuracy and distinguish the left and right hands.The 3DHandPoseNet adopts a cascaded structure to extract the joint point features of each hand part and introduces hand bone constraints into the loss function to improve the 3D hand pose estimation.
The contributions and innovations of this work are shown as follows.
1.A DuHandLocNet is proposed to deal with dual-hand detection and localization, which introduces the biological constraint of the structure relationship between the dual hands and body by using a parallel network.2. A 3DHandPoseNet is proposed to deal with 3D hand pose estimation from single RGB images, which is based on the hand biological inspiration by using a cascaded structure and a bone-constraint loss.3. A visual dual-hand-based teleoperation system for bionic bimanual robots is designed by using the proposed DuHandLocNet and 3DHandPoseNet.It can realize free and convenient teleoperation.
The remainder of this paper is organized as follows.Related work introduces related works.Dual-hand detection network and 3D Hand pose network introduce the proposed dual-hand detection method (DuHandLocNet) and 3D hand pose estimation method (3DHandPoseNet), respectively.Bionic bimanual robot teleoperation system introduces the bionic bimanual robot

Related work
In the following, the works of visual hand-based teleoperation, dual-hand detection, and 3D hand pose estimation are discussed, which are closely related to our work.

Visual hand-based teleoperation
Compared with the hand teleoperation methods based on data gloves [6], surface electromyography wristbands [7], and optical markers [8], the visual hand teleoperation is still immature.However, due to its huge potential, there are still some works that have made some progress.Li et al. [10] proposed an endto-end neural network (TeachNet), which used a consistent error formula to control a Shadow hand, which is a 5-finger manipulator.Then, Li et al. [11] also proposed a multimodal mobile robot arm teleoperation system, which consists of a novel vision-based gesture regression network (Transteleop) and an inertial measurement unit-based arm tracking method.Handa et al. [12] designed a low-cost and vision-based teleoperating system (DexPilot), which can achieve full control of 23 degree of actuation's robotic system by observing the bare hands.Gomez-Donoso et al. [13] designed a HandLocNet for hand detection and localization on RGB images and HandPoseNet for 3D hand pose estimation.It also conducted teleoperation experiments for AR10 and Shandow hands in a virtual environment.Sivakumar et al. [14] designed a Robotic Telekinesis for robotic hand teleoperation by watching humans on Youtube.These teleoperation methods are all used for single-arm/ hand robots and not suitable for the bionic bimanual robots.

Visual dual-hand detection and distinction
The visual dual-hand detection and distinction is a great challenge in computer vision due to the similarity of the left and right hands, the variety of gestures, and the small sizes of hands.Some traditional methods use hand color, optical flow, and shape features [15]; they rely heavily on limited conditions and lack of generalization in practical applications [16].With the development of deep learning in object detection, some deep learning networks like Faster R-CNN [17], SSD [18], and YOLO [19] have been introduced to deal with hand detection and dual-hand distinction.For example, Hoang Ngan Le et al. [20] proposed Multiple Scale Region-based Fully Convolutional Networks to realize robust hand detection in vehicles.However, this method is only suitable for a few hand gestures such as the driving gestures.Gao et al. [21] improved the SSD to achieve robust real-time hand detection and localization.However, this method did not discuss the distinction of left and right hands.Gao et al. [22] also proposed a dual-hand detection method by using parallel network.This method requires postprocessing and cannot be used in a real-time system.

3D hand pose estimation from single RGB images
The 3D hand pose estimation from single RGB images is also a great challenge in computer vision due to the hand self-similarity, self-occlusion, and lack of depth information.
To deal with this issue, Ge et al. [23] proposed a 3D hand pose estimation method by using 3 CNN subnetworks, which are used for hand segmentation, 2D hand pose detection, and 2D to 3D hand pose derivation, respectively.Ge et al. [24] also proposed an end-to-end trainable hand pose and mesh generation approach based on Graph CNN [25].Lin et al. [26] proposed a MEsh TRansfOrmer method to reconstruct 3D hand pose and mesh vertices from a single image by using the transformer [27].These methods are all directly transferred from the 3D body pose estimation methods.Although they have achieved high accuracy, they do not consider the structural features of the hand.

Dual-hand detection network
The DuHandLocNet is designed for dual-hand detection and distinction on RGB images.The pipeline of the DuHandLocNet is shown in Fig. 2. It uses a parallel network, where one subnetwork is designed for hand detection and the other subnetwork is designed for body pose estimation, which can output the estimated dual-hand positions and handedness by using a body forward kinematic (FK) tree [22].The estimated dualhand positions can help to increase the hand detection accuracy by introducing it into the loss function.The handedness can help to distinguish left and right hands from the hand detection results.

Hand detection subnetwork
The hand detection network adopts the anchor-free structure [28,29].Compared with the anchor-based structure [17][18][19], there is no need to preset anchors.It only needs to regress the center point, width, and height of objects with different scales, which greatly reduces the consumption.In addition, the structure draws on the keypoint-based structure of CenterNet [29], which can improve the detection accuracies of hands with small sizes.In the network, the deep layer aggregation (DLA-34) [30] is adopted as the backbone.DLA-34 is one of the most advanced methods for image semantic segmentation currently.It utilizes separable convolution blocks to accelerate computation speed while preserving image semantic information.To apply it to hand detection, we add additional skip connections from the bottom layers to the original DLA-34 structure.Additionally, we upgrade each convolutional layer during the upsampling stage.Then, 3 models are followed as the heads, which are center detection module (CeDM), offset detection module (ODM), and size detection module (SDM).For an input image I, the CeDM is used to predict the hand center coordinates (x, y).The SDM is used to predict the hand size (w, h), and the ODM is used to predict the hand localization offset (ox, oy).As a result, the bounding box of the detected hand Bh can be formulated as For training, the CeDM uses the focal loss as its loss function L Ce ; the equation is shown as (1) where Y i, j is the score at point (i, j) in the heat map of the predicted hand center point, and Ŷ i,j is the corresponding ground truth.N is the hand number in the image I, and α and β are focal loss hyperparameters.
The SDM uses the L1 regression loss as its loss function L S ; the equation is shown as where ω i, j and h i, j are the width and length of the predicted hand bounding box, and ̂ i,j and ĥi,j are the corresponding ground truths.
The ODM also uses the L1 regression loss as its loss function L O , and its equation is shown as where c is the predicted hand center point and ĉ is the corresponding ground truth.S = 4 is the number of the downsampling.O i, j is the predicted hand localization offset.

Body pose estimation subnetwork
The structure of the proposed body pose estimation network draws on the structure in [31], which introduces 3 deconvolution layers behind the C5 layer of the ResNet [32].These deconvolution layers all use ReLU activation and batch normalization.Each layer has a 4 × 4 convolution kernel and 256 filters.Then, a 1 × 1 convolution layer is introduced at the end of the network to generate the predicted heatmaps h 1 , h 2 , …, h k for the k body joint points.The heat-map loss L hm uses the mean squared error, and its equation is shown as where j ∈ [1, k] is means the jth joint point.h j and ĥj are the heatmaps of the predicted jth joint point and the corresponding ground truth, respectively.
As shown in Fig. 2, since the body keypoint topology has no dual-hand joint points, the body FK tree is used to estimate the coordinates of the left and right hands according to the body skeletal structure.As shown in Fig. 3, first, the body upper-limb model is made up of the upper-limb joint points from the body keypoint topology and defined left and right joint points (17 and 18).Then, the dependency graph including parent and child nodes are built according to the upper-limb model.So, the dual-hand joint points P17 and P18 can be obtained from the dependency relationship between the parent and child nodes.

Outputs
From the hand detection subnetwork, the hand center point coordinate (P h ) can be obtained.From the body pose estimation subnetwork, the dual-hand joint coordinates and their handedness (P l = P 17 , P r = P 18 ) can be obtained.Then, a biological constraint loss L b is introduced, which is shown as follows.
The L b represents the biological structure relationship between the dual hands and the body.
Therefore, the total loss function L is defined as Then, the distinction between left and right hands is determined by the position relationship between the coordinates of the detected hand center point and the estimated dual hands.Its equation is shown as follows. (2) (5) P 17 = s ⋅ h 7,9 + P 9

3D Hand pose network
After the dual-hand detection, the located hand RGB image is input to the proposed 3DHandPoseNet to obtain the 3D hand pose.To improve the 3D hand pose estimation performance, the network is introduced a cascaded structure and a boneconstraint loss.The structure of the network is shown in Fig. 4, which includes a feature extraction module (FEM), a cascade module (CaM), and a 3D pose regression module (PRM).The FEM is used to extract the 2D hand joint point features, the CaM is used to fine-tune the 2D hand joint point features, and the PRM is used to regress the 3D hand pose.The detailed introduction is shown as follows.

Feature extraction module
The FEM uses 2 stacked hourglass modules [33] as its backbone to extract the 2D hand joint (2D hand pose) features.The 2D heat maps of 21 hand joints are used as the output of the FEM, which not only improves the generalization ability but also reduces the learning capacity.The heat-map loss L FE is utilized for training, and its loss function is formulated as  where A is the number of hand joint points.H a n and Ĥa n are the estimated and ground-truth 2D heat maps of the nth hand joint point, respectively.The 2D heat map resolution is 64 × 64 pixels.(u, v) is the coordinate in the image coordinate system.After the feature extraction, the extracted feature maps are concatenated with the heat maps and the concatenated feature maps are output to the CaM.

Cascade module
It can be found by analyzing the structure of the hand that the motions of joint points in different fingers and palm are independent, while the motions of joint points in the same finger or palm are dependent.Therefore, as shown in Fig. 4, the cascade structure is proposed to divide the feature extraction into multiple subnetworks, which are palm, thumb, index finger, middle finger, ring finger, and little finger subnet works.By using this cascade structure, more appropriate weight parameters for hand joint feature maps can be obtained.In each subnetwork, a residual block with a max-pooling layer is used to fine the feature maps.The cascade loss function L cs (s = p, f1, f2, …, f5) is formulated as where S is the number of joint points of palm (P) and 5 fingers (F1 to F5).As shown in Fig. 4, P contains 6 joint points, and F1 to F5 contain 3 joint points, respectively.H s n and Ĥs n are estimated and ground-truth heat maps, respectively.(u, v) is the coordinate in the image coordinate system.
Therefore, the total loss function L Ca in the CaM is formulated as

3D Pose regression module
The refined feature maps from the CaM are input to the PRM to regress the 3D hand joint points (3D hand pose).The PRM also uses the cascade structure to regress the fingers and palm joint points in 6 subnetworks.As shown in Fig. 4, each subnetwork uses 2 full-connected layers with dropout.In addition, the feature maps from all subnetworks are concated and input to another full-connected layer with dropout to obtain the 3D pose of the entire hand.The regression loss function L rs (s = a, p, f1, …, f5, ) is formulated as where S is the number of joint points of the entire hand (A), palm (P), and 5 fingers (F1 to F5).As shown in Fig. 4, A contains 21 joint points, P contains 6 joint points, and F1 to F5 contain 3 joint points, respectively.J s n and Ĵs n are estimated and ground-truth 3D hand joint coordinates.Therefore, the sum of these regression loss function L PR is formulated as To improve the 3D hand pose estimation performance, bone-constraint losses are also introduced to the loss functions.
By analyzing the hand bone structure, it can be found that even though the errors of the hand joint points are small, there is a case where the errors of the bone lengths and the bone orientations are large.Therefore, 2 bone-constraint losses are proposed in the training stage, which are bone length and the bone orientation losses.The bone length loss function L len imposes translation constraints on the joints to provide a more rigid and natural hand skeletal structure, which is formulated as where b i, j = J i − J j is the estimated bone vector between joint i and j, and bi,j is the corresponding ground-truth bone vector.The bone orientation loss imposes rotational constraints on the joints so that the estimated hand pose looks undistorted.Its loss function L dir is formulated as As a result, the total loss function L pose of the 3DHandPoseNet in the training stage is formulated as where λ hm , λ pr , λ len , and λ dir are factors for the trade-off between the above losses.

Bionic bimanual robot teleoperation system
The structure of the bionic bimanual robot teleoperation system proposed in this work is shown in Fig. 1.It includes a pilot hand manipulation space, a dual-hand motion capture module, a mapping module, and a bionic bimanual robot movement space.

Pilot hand manipulation space
In the pilot hand manipulation space, a pilot can operate with dual hands by observing the motion state of the bionic bimanual robot.A RGB-D camera is used to capture the RGB and depth frames of the pilot's dual hands.

Dual-hand motion capture module
Then, the RGB and depth dual-hand frames are input to the dual-hand motion capture module.The proposed DuHand-LocaNet and 3DHandPoseNet are applied to locate and distinguish dual hands and estimate 3D hand pose from the RGB frames.The depth frames are used to obtain the depth information for the 3D hand positions and correct the depth information for the 3D hand pose.In addition, the 3D hand orientations are also required for the bionic bimanual robot teleoperation.We followed the method in [34] to calculate the 3D hand orientations.As shown in Fig. 5, the hand normal vector ⃖� Z is calculated as where the vector ⃖� Z needs to be normalized to a unit vector, and × is cross product.The hand normal vector ⃖�� X is calculated as a connection vector from the hand joint point 0 (P 0 ) to the mean position (P M ) of the hand joint points 5, 9, and 13.The equation is shown as where the vector ⃖�� X is also normalized to a unit vector.Finally, the hand normal vector is calculated as the cross product of the normalized vector ⃖�� X and ⃖� Z.
After that, the hand orientation O H is expressed in a quaternion (qω, qx, qy, qz), which is shown as

Mapping module
The mapping relationship between the pilot dual hands and bionic bimanual robot is shown in Fig. 6.Function f1 maps the pilot handedness to the bionic bimanual robot handedness, which is formulated as Function f2 maps the pilot hand position P H = (x h , y h , z h ) to the bionic bimanual robot manipulator position P R = (x r , y r , z r ), which is formulated as where (x i , y i , z i ) is the position in the ith frame, and (x i−1 , y i−1 , z i − 1 ) is the position in the (i − 1)th frame.k p is the position scaling factor.Function f3 maps the pilot hand orientation O H = (qω h , qx h , qy h , qz h ) to the bionic bimanual robot manipulator orientation O R = (qω r , qx r , qy r , qz r ), which is formulated as where (qω i , qx i , qy i , qz i ) and (qω i −1 , qx i −1 , qy i −1 , qz i −1 ) are the orientations in the ith and (i − 1)th frames, respectively.k o is the orientation scaling factor.Function f4 maps the 3D hand pose Pose H to the manipulator gesture G R = (rθ 1 , bθ 1 , bθ 5 , bθ 9 , bθ 13 , bθ 17 ), which is formulated as where rθ 1 is the rotational angle of the manipulator thumb, bθ k are the bending angles of the manipulator 5 fingers.k g is the angle scaling factor.

Bionic bimanual robot movement space
The bionic bimanual robot movement space contains a bionic bimanual robot with 2 5-finger dexterous manipulators, which (18) P M = 1 4 is shown in Fig. 7.Among them, the bionic bimanual robot chooses to use the Baxter robot, which consists of 2 robotic arms with 7 degrees of freedom.The 5-fingered dexterous hand chooses to use the Inspire dexterous hand, which contains 6 degrees of freedom and 21 joints.The movement of the bionic bimanual robot is based on the information output from the mapping module.In addition, the movement status of the bionic bimanual robot can be fed back to pilot space.

Results and Discussion
In this section, the experiment results and validation of the proposed DuHandLocNet and 3DHandPoseNet are provided in Validation of DuHandLocNet and Validation of 3DHandPoseNet.The experiment details of the dual-hand-based teleoperation for the bionic bimanual robot is shown in Validation of bionic bimanual robot teleoperation.

Dataset description
For hand detection validation, the public hand datasets Egohands [35] and Oxford Hands [36] were chosen for training and validation.The Egohands contains 4,784 hand RGB images of egocentric interactions with more than 15,000 pixel-level ground-truth hand annotations.Among them, 4,807 images were chosen for training and the remaining 821 images for validation.The Oxford Hands contains 5,628 daily life images with 13,050 hand annotations from public datasets like Buffy Stickman and PASCAL VOC 2007.Among them, 4,807 images and the remaining 821 images were chosen for training and validation.
For dual-hand detection validation, a custom dual-hand detection dataset (DualHands) was used for training and validation.It contains 2000 RGB images collected from 8 samples with 4,000 annotations of "left hand" and "right hand".These images contain human bodies or upper limbs with different size and hand gestures (American sign language gestures [37]).Both training and validation datasets have 1,000 images.

Implementation details
Our experiments were performed on a machine with NVIDIA RTX2080Ti GPU.The Pytorch platform [38] was used for all training and evaluation experiments.For training parameters, batch size was set to 32, initial learning rate was set to 1.25e-4, training epoch was set to 140, and the learning rate dropped by 10 times at epoch 90 and epoch 120, respectively.The accuracy percent with intersection over union (IoU) = [0.50:0.95]was used as the metrics to evaluate the performance of hand detection.

Quantitative comparison for hand detection
The comparative experiments for hand detection were conducted on the Egohands [35] and Oxford Hands [36].The results of the DuHandLocNet were compared with those of some stateof-the-art methods and the compared results are shown in Table 1.The accuracy percent with IoU = [0.50:0.95] is chosen as the metric.
It can be seen from the Table 1 that the proposed DuHandLocNet can achieve 0.797 and 0.448 accuracies on the Egohands and Oxford Hands, respectively.These results are better than that of other state-of-the-art methods.Therefore, it is proved that the DuHandLocNet achieves great hand detection performance.In addition, the hand detection subnetwork in the DuHandLocNet used the CenterNet structure.Compared with the result of the CenterNet (0.781 on the Egohands and 0.438 on the Oxford Hands), the DuHandLocNet improved the accuracies by 0.016 on the Egohands and 0.010 on the Oxford Hands, respectively.It is proved that the proposed parallel network with biological inspiration effectively improves the accuracy of hand detection.

Quantitative comparison for dual-hand detection
The comparative experiments for dual-hand detection were con duc ted on the custom DualHands.The results of the DuHandLocNet were compared with those of some state-ofthe-art methods and the compared results are shown in Table 2.The mean average precision with IoU = 0.5 is chosen as the metric.It can be seen from the Table 2 that the proposed DuHandLocNet can achieve 0.9843, 0.9786, and 0.9863 accuracies for left hand, right hand, and hand detection, respectively.These results outperform the results of [20,22].It is proved that the DuHandLocNet can effectively distinguish between left and right hands by introducing the biological constraint of the structure relationship between the dual hands and body.
To better demonstrate the performance of the DuHandLocNet for dual-hand detection and distinction, some results on DualHands are visualized and shown in Fig. 8.It can be seen from Fig. 8 that the DuHandLocNet performs excellently in detecting and distinguishing the left and right hands with different sizes and gestures.

Implementation details
For training parameters, batch size was set to 8, initial learning rate was set to 10-3, training epoch was set to 100, and the learning rate dropped by 10 times at epoch 30 and epoch 60, respectively.The factors of the bone-constraint loss function λ hm , λ pr , λ len , and λ dir were set to 0.1, 1, 0.001, and 0.1, respectively.The percentage of correct 3D keypoints (PCK) and the area under the PCK curve (AUC) with thresholds ranging from 20 to 50 mm were used as the metrics to evaluate the performance of 3D hand pose estimation.

Quantitative comparison for 3D hand pose estimation
The comparison experiments for 3D hand pose estimation were evaluated on the RHD, DO, and ED datasets.The results of the proposed 3DHandPoseNet were compared with some of the state-of-the-art methods, which are shown in Table 3.In Table 3, the AUC with thresholds ranging from 20 to 50 mm is chosen as the metric." * " denotes that the model was trained on this dataset, and "-" denotes that no results were reported on this dataset.In addition, it is necessary to notice that not all the compared methods were trained on the same datasets.
It can be seen from the Table 3 that the proposed 3DHand-PoseNet can achieve 0.952, 0.927, and 0.934 accuracies on the DO, ED, and RHD datasets, respectively.These results outperform the results of the state-of-the-art methods.Therefore, it is proved that the 3DHandPoseNet can effectively estimate the 3D hand pose.

Ablation study for 3DHandPoseNet
The ablation experiments for 3D hand pose estimation were evaluated on the FreiHAND dataset.Four models were evaluated for comparison.(a) As shown in Fig. 4, the backbone     It can be seen from the Fig. 9 that the 3D PCK of the second model (0.929), the third model (0.938), and the 3DHandPoseNet (0.955) were improved by 0.036, 0.045, and 0.062 compared with that of the baseline (0.893).These results prove that the proposed cascade structure and bone constraint loss functions can effectively improve the 3D hand pose estimation.
To better demonstrate the effectiveness of the proposed 3DHandPoseNet, some experiment results are visualized in Fig. 10.

Validation of bionic bimanual robot teleoperation
To verify the performance of the proposed bionic bimanual robot teleoperation, 2 types of physical tasks including box carry and cup insertion were conducted to test the bimanual teleoperation.For the box carry task, the bimanual robot carried the box and moved on the table from one place to another using its dual arms.This task makes a great challenge to the coordination between 2 arms.For the cup insertion task, a smaller cup in the left hand is inserted into a larger cup in the right hand, which makes a challenge to operation accuracy requirements.Before this experiment, 2 pilots went through a warm-up training phase.Next, each pilot conducted 3 consecutive test trials for each task as shown in Fig. 11.Both tasks got 100% success accuracy because the object dropped in one of the experiments.The time spent on the box carry task is within 100 s, while the cup insertion task is within 90 s.The teleoperation tracking error is about 9.3 mm, and the delay time is about 0.68 s.These results show that the proposed dual-hand-based bionic bimanual robot teleoperation method has good performance in the dexterity, accuracy, and coordination of operation.

Conclusion
In this paper, to deal with the bionic bimanual robot teleoperation, a visual dual-hand motion capture based on biological inspiration and a dual-hand-based bionic bimanual robot teleoperation method were proposed.The contributions and innovations are summarized as follows.(a) A DuHandLocNet was proposed to deal with dual-hand detection and localization, which introduced the biological constraint of the structure relationship between the dual hands and body by using a parallel network.(b) A 3DHandPoseNet was proposed to deal with 3D hand pose estimation from single RGB images, which was based on the hand biological inspiration by using a cascaded structure and a bone-constraint loss.(c) A visual dual-hand-based teleoperation system for bionic bimanual robots was designed by using the proposed DuHandLocNet and 3DHandPoseNet.Experiments on public hand datasets demonstrated the effectiveness and superiority of proposed DuHandLocNet and 3DHandPoseNet.Experiments on a bionic bimanual robot platform demonstrated that the proposed teleoperation method can realize free and convenient teleoperation.
In the future, (a) more robust and accurate dual-hand motion capture methods should be researched.The video-based hand detection and hand pose estimation method will be used to improve the accuracy and robustness of the hand motion capture.(b) MixedReality and shared control method will be applied to the bionic bimanual robot teleoperation to increase teleoperation immersion and efficiency.

Fig. 3 .
Fig. 3. Dependency graph.The left image is body keypoint topology, which includes 10 keypoints.The middle image is body upper-limb model, where the keypoints of left hand (keypoint 17) and right hand (keypoint 18) are 2 added keypoints.The right image is the dependency graph obtained from body upper-limb model.

Fig. 8 .
Fig. 8.The results of DuHandLocNet.From left to right are original images of pilots with different hand sizes and gestures, hand detection results from the hand detection subnetwork, body pose estimation results from the body pose estimation subnetwork and body FK tree, and dual-hand detection and distinction results from the DuHandLocNet.

Fig. 10 .
Fig. 10.The visualization of 3DHandPoseNet.From left to right are original images, output 2D hand poses, and output 3D hand poses from the camera viewpoint and output 3D hand poses from another viewpoint.
of the first model used 2 hourglass modules.Then, a residual block with max pooling was followed to refine the features and 2 full-connected layers with dropout were followed to regress 3D hand pose.This model was set as the baseline.(b) Based on the baseline, the proposed cascade structure was added following Fig.4in the second model.(c) Based on the baseline, the proposed bone-constraint loss functions were added following Fig. 4 in the third model.(d) Based on the baseline, both cascade structure and bone-constraint loss functions were added in the fourth model, which is the 3DHandPoseNet.The 3D PCK is used as the metric.The results are shown in Fig. 9.

Table 1 .
Compared results for hand detection on the Egohands and Oxford Hands.

Table 2 .
Compared results for dual-hand detection on the Dual-Hands.

Table 3 .
Compared experiment results for 3D hand pose estimation.