A Markerless 3D Tracking Framework for Continuum Surgical Tools Using a Surgical Tool Partial Pose Estimation Network Based on Domain Randomization

3D tracking of single‐port continuum surgical tools is an essential step toward their closed‐loop control in robot‐assisted‐laparoscopy, since single‐port tools possess multiple degrees‐of‐freedom (DoFs) without distal joint sensors and hence have lower motion precision compared to rigid straight‐stemmed tools used in multi‐port robotic laparoscopy. This work proposes a novel markerless 3D tracking framework for continuum surgical tools using a proposed surgical tool partial pose estimation network (STPPE‐Net) based on U‐Net and ResNet. The STPPE‐Net estimates the segmentation and a 5‐DoF pose of the tool end‐effector. This network is entirely trained by a synthetic data generator based on domain randomization (DR) and requires zero manual annotation. The 5‐DoF pose estimation from the STPPE‐Net is combined with the surgical tool axial rotation from the robot control system. Then, the entire pose is further refined via a region‐based optimization that maximizes the overlap between the tool end‐effector segmentation from the STPPE‐Net and its projection onto the image plane of the endoscopic camera. The segmentation accuracy and 6‐DoF pose estimation precision of the proposed framework are validated on the images captured from an endoscopic single‐port system. The experimental results show the effectiveness and robustness of the proposed tracking framework for continuum surgical tools.


Introduction 1.Motivation
Robot-assisted minimally invasive surgery (RAMIS) is gradually becoming a standard surgical procedure for improved postoperative outcomes and short recovery time. [1]With the potential to benefit patients with further reduced trauma and improved recovery, robot-assisted single-port laparoscopy (SPL) has started to gain more interests.
Among various design approaches for robot-assisted SPL, the continuum surgical tool designs have consistently drawn attentions on account of their design compactness, proximal actuation scheme, structural compliance, and distal dexterity. [2]n example of a SPL system based on continuum surgical tools is shown in Figure 1A. [3]A continuum surgical tool in the SPL setting usually possesses six degrees-of-freedom (DoFs), which include four bending DoFs from the two continuum segments with two bending DoFs each, a tool axis translation DoF, and a tool axis rotation DoF, as illustrated in Figure 2.
Although several modeling methods have been proposed, [4] the motion precision of a multi-joint 2-segment continuum surgical tool for SPL will not be comparable to that of a rigid straight-stemmed tool for multi-port robotic laparoscopy, due to the fact that the SPL surgical tools do not possess distal joint sensors for closed-loop control.In order to fully utilize the robotic system's capability (using the integrated stereo camera instead of integrating additional hardware), therefore, a key step toward the closed-loop control is to realize the real-time 3D spatial pose tracking of the 2-segment continuum surgical tools for SPL.
Integrating external sensors for 3D spatial pose tracking is possible.For example, thin fiber bragg grating (FBG) sensors [5] can be used for shape sensing then to calculate the pose; electromagnetic (EM) sensors [6] can also be used.However, the FBG sensing and the EM tracking methods need additional hardware so as to increase system complexity and cost.What's more, the EM tracking method brings magnetic interference problems.
Due to the inconvenience in adopting external sensors, the image-based approach is becoming a more popularity-gaining solution to the tracking of medical devices.Different medical imaging modalities were employed in these tracking methods.For instance, Wang et al. implemented a 2D ultrasound imaging system for tracking an anchoring robot. [7]The ultrasound Doppler imaging technique was used for the tracking and navigation of a magnetic microswarm. [8]Kim et al. also proposed to DOI: 10.1002/aisy.2023004343D tracking of single-port continuum surgical tools is an essential step toward their closed-loop control in robot-assisted-laparoscopy, since single-port tools possess multiple degrees-of-freedom (DoFs) without distal joint sensors and hence have lower motion precision compared to rigid straight-stemmed tools used in multi-port robotic laparoscopy.This work proposes a novel markerless 3D tracking framework for continuum surgical tools using a proposed surgical tool partial pose estimation network (STPPE-Net) based on U-Net and ResNet.The STPPE-Net estimates the segmentation and a 5-DoF pose of the tool endeffector.This network is entirely trained by a synthetic data generator based on domain randomization (DR) and requires zero manual annotation.The 5-DoF pose estimation from the STPPE-Net is combined with the surgical tool axial rotation from the robot control system.Then, the entire pose is further refined via a region-based optimization that maximizes the overlap between the tool endeffector segmentation from the STPPE-Net and its projection onto the image plane of the endoscopic camera.The segmentation accuracy and 6-DoF pose estimation precision of the proposed framework are validated on the images captured from an endoscopic single-port system.The experimental results show the effectiveness and robustness of the proposed tracking framework for continuum surgical tools.control a telerobotic neurointerventional platform under realtime fluoroscopic imaging. [9]s shown in Figure 1A, the SPL system involved in this work already contains a continuum endoscope.It is desired to fully utilize the existed imaging capability of the endoscopic camera.Under endoscopic imaging, using fiducial markers for surgical tool pose estimation has been realized. [10]But installing markers hinders surgical procedures and adds sterilization issues.Therefore, this work focuses on developing an endoscopic-image-based markerless solution to the tracking of surgical tools.

Related Works
The 3D spatial pose estimation of surgical tools based on endoscopic images is the key step to enable markerless tracking.Existing endoscopic-image-based surgical tool tracking methods can be generally grouped into two categories, 1) maximizing the alignment between the known tool model and the tool information extracted from the image, and 2) using deep-learning methods to realize the tracking.
For the first category, an effective method is to extract 2D features from the images for 3D matching.Ye et al. generated online tool parts templates based on their CAD models and robot kinematics for 2D templates matching and used efficient perspective-n-points algorithm for 3D pose estimation. [11]owever, this method relies on truthful tool kinematics while the continuum tool kinematics is less accurate.Another method to realize 3D pose estimation is to use the segmented region in the image.Prisacariu et al. proposed a region-based method to solve the 6-DoF pose estimation by aligning the projection of a 3D model prior to the 2D segmented region. [12]This method cannot be straightly applied to the tracking of surgical tools, since the projection of a slender surgical tool is found to be insensitive to its axis rotation.Allan et al. extended the region-based method to 3D pose estimation of surgical tools by involving random forests (RFs) and the optical flow. [13]However, the determination of keypoints for optical flow may fail due to possible motion blur or blood stain.Besides, the involved RF-based segmentation needs extra manual initialization for the segmentation.
As for the deep-learning-based methods, real-world training dataset with tool 3D pose ground truth is hard to obtain.Many tracking investigations for surgical tools are limited to the 2D image planes. [14]In order to enable 3D tracking, Hasan et al. trained a convolutional neural network (CNN) with manually annotated geometric primitives (edge lines, mid-line, and tip point) of the surgical tool region, and estimated the 3D pose of surgical tools based on the geometric primitive outputs. [15]However, this method not only requires the tool shaft being cylindrical, but also needs a sufficient amount of manually annotated training data.Toward mitigating the need for such a considerable number of manual annotations, Sestini et al. proposed a self-supervised way to directly estimate the joint values of two continuum surgical tools from an endoscopic image with an auto-encoder structure bottlenecked by the physical model of the tools and the endoscopic camera. [16]However, the tracking precision of this method was not validated on an actual multisegment tool.Yoshimura et al. created an automatically annotated training dataset of synthetic images with rendered shafts, and trained an improved single-shot detection (SSD) network for 5-DoF pose estimation. [17]However, this method achieved an intersection over union (IoU) of 0.197 on real-world endoscopic images when only synthetic images are used for training.To reduce the domain gap, a semi-synthetic dataset is proposed, [18] which composes collected real-world tool foreground images and tool-free background images.However, it is still laborious to collect enough tool foreground images.Also, it's difficult to obtain 3D pose ground truth of the foreground tools for training.
To improve the generalization of synthetic datasets on realworld images, domain randomization (DR) method recommends to focus on the unaltered features of the objects rather than their appearance by generating synthetic data with sufficient variations. [19]Instead of maximizing the realism of the synthetic training data, DR aims at motivating the trained network to treat the real-world data as just another variation of the synthetic data.This technique has been tested in simple object position estimation [19] and car detection, [20] showing a compelling performance.Moreover, the trained network can achieve even better performance after fine-tuning with real data than trained with real data only. [20]Therefore, this work explores the possibility of applying DR into surgical scenarios, since the surface appearance of the surgical tools may vary because of blood or illumination, but the 3D shapes of the tools are not changed.

Contributions
Referring to Figure 1, this article proposed a novel vision-based markerless 3D tracking framework to achieve robust 6-DoF tool pose estimation.The contributions of this article primarily lie on the proposal of the surgical tool partial pose estimation network (STPPE-Net) that simultaneously generates the tool end-effector segmentation and a 5-DoF pose estimation from a single redgreen-blue (RGB) endoscopic image.A synthetic data generator S based on DR is proposed to automatically generate synthetic data to train the STPPE-Net.Compared to the previous works, [13,15] the proposed framework requires zero manual annotations.What's more, the experimental results show that the proposed framework performs more robustly while facing tool motion blur, blood strain, and illumination variation.After combining the 5-DoF pose estimation of the surgical tool from the STPPE-Net with the tool axis rotation from the control system and form an entire 6-DoF pose estimation using a kinematics module K, a region-based optimization module ℛ is presented to refine the 6-DoF pose estimation by taking the kinematics and the structural constraints of a 2-segment continuum surgical tool into consideration, maximizing the overlap between the projection of the tool end-effector on the image plane and the previous segmentation from the STPPE-Net.
The surgical tool end-effector segmentation accuracy and 3D spatial pose estimation precision of the proposed framework are validated on real-world endoscopic images.The qualitative and quantitative experimental results verify the effectiveness and robustness of the proposed 3D tracking framework.
The remainder of this article is organized as follows.Section 2 introduces the nomenclatures and kinematics of the involved 2-segment continuum surgical tools in this work.The proposed 3D tracking framework is introduced in Section 3. The experimental implementation details and results are presented in Section 4. Finally, a conclusion is summarized in Section 5.

Nomenclatures and Kinematics
The involved 2-segment continuum surgical tool, as Figure 2 shows, is composed of two continuum segments, a straight base stem, a straight rigid stem, and an end-effector.The involved nomenclatures in this work are shown in Table 1, and the frames are defined as follows.1) The Base Ring Frame {ib} = {x ib , y ib , z ib } is attached to the center of the base cross section of the i th segment with z ib perpendicular to the base cross section.2) The End Ring Frame {ie} = {x ie , y ie , z ie } is attached to the center of the end cross section of the i th segment with z ie perpendicular to the base cross section and x ie points to the same segment surface region as x ib .3) The Base Frame {b} = {x b , y b , z b } is attached to the trocar with z b aligned with the tool base stem's axis.{1b} is obtained from {b} via a rotation angle φ about z b .4) The End-Effector Frame {e} = {x e , y e , z e } is attached to the end-effector base.5) The Endoscope Base Frame {cb} = {x cb , y cb , z cb } is attached to the trocar with z cb aligned with the endoscope base stem's axis.
6) The Endoscopic Camera Frame {c} = {x c , y c , z c } is attached to the endoscopic camera.
In this work, the 2-segment continuum surgical tools are deployed in the 3 rd configuration. [21]A 6-dimension configuration vector ψ for the involved 2-segment continuum tool can be expressed as: where L 1 represents a tool axis translation DoF, φ a tool axis rotation DoF.θ i and δ i represent the two bending DoFs for the i th segment.Based on the constant curvature assumption, the forward kinematics of the involved continuum surgical tool can be derived. [21]

Methods
During RAMIS procedures, only part of the continuum tool can be seen in the endoscopic images.Therefore, this work focuses on tracking the end-effector.

Framework Overview
The proposed framework with its four main modules is shown in Figure 1.Module S automatically generates synthetic tool foreground images with real-world background based on DR.And it simultaneously provides segmentation and 6-DoF pose annotations along with the generated images.
Trained entirely on the synthetic data from S, the STPPE-Net takes in RGB endoscopic images and outputs the 5-DoF pose estimation of the tool end-effector with respect to frame {c} with its segmentation in real-time, as illustrated in Figure 1B-D.As Figure 1E shows, module K solves a 6-DoF pose estimation based on the 5-DoF pose estimation from the STPPE-Net and the tool axis rotation φ from the control system.Module ℛ is employed to produce a refined 6-DoF pose estimation, as Figure 1F shows, maximizing the overlap between the tool end-effector segmentation from the STPPE-Net and its projection onto the image plane of the endoscopic camera with the end-effector opening angle α from the control system.

STPPE-Net
The illustration of the STPPE-Net is shown in Figure 3A.
This network takes in an RGB endoscopic image containing continuum surgical tools, and outputs an estimated segmentation mask m for all tool end-effectors, and a translation vector c b p e and a z-axis orientation c b z e for each tool end-effector with respect to frame {c}.

Mask Generator
U-Net, [22] which possesses an encoder-decoder architecture and has been proven to be one of the leading approaches in medical semantic segmentation, is chosen to model ℳ and generate the estimated segmentation mask m from x.If their exists multiple tools, ℳ will output the mask containing all tool end-effectors masks.

Partial Pose Estimator
The detailed structure of the partial pose estimator is shown in Figure 3B.
First, a subnetwork P is composed of a feature extractor and a position regressor.Inspired by the PoseNet [23] which modifies its last layer for feature generalization, the last layer of the ResNet34 [24] pre-trained on ImageNet [25] is replaced by a fully connected layer of size 2048 with ReLU layer to form a localized feature vector.Another fully connected layer with a 3-dimension output is added as the final regressor.To ensure the estimated position is in the camera view and to accelerate the learning of P, this work adds a sigmoid layer after the last fully connected layer to output a normalized vector ½μ x μ y μ z T instead of directly outputting c b p e ¼ ½ c b x e c b y e c b z e T .The position encoding between ½μ x μ y μ z T and c b p e can be written as: where d max and d min represent the maximum and minimum depth of the end-effector base in frame {c}.γ is the diagonal field of view (FoV) of the endoscopic camera, and w, h are the width and height of the images.The encoding ensures that the estimated position is in the endoscopic camera view.
In contrast, a different subnetwork Z estimates the orientation of the tool end-effector.It was found difficult for the STPPE-Net to learn an accurate 3-DoF rotation of the end-effector in actual images, due to the network's insensitivity to the tool end-effector's axis rotation.Therefore, this work focuses on learning the z-axis orientation c z e of the end-effector w.r.t frame {c} only.The structure of Z is identical to the structure of P.However, the final 3-dimension output is normalized to form an estimated z-axis unit vector c b z e ¼ ½b e x b e y b e z T .L 1 Inserted length of the 1 st segment.
The bending angle of the i th segment.
The bending direction of the i th segment. φ The rotation directly provided by the actuation unit to the 1 st segment. α The opening angle of the end-effector.

Training
The

Synthetic Data Generator
Normally, producing manual, pixel-accurate segmentation m is tedious and time-consuming.Measuring actual c T e accurately is even more complicated, which needs extra actual tracking implemented on the surgical tool for the ground truth.In this work, a novel synthetic data generator based on DR is proposed to automatically provide synthetic training data group {x, m, c T e }.
The synthetic images generation is based on the Mitsuba renderer. [26]This work employs the CAD models of the tool end-effectors, sets the same camera model in the renderer as the calibrated involved endoscopic camera, and collects realworld tool-free endoscopic scenes as background images.
To realizes DR, this work applies variation in, 1) the textures, randomly chosen from 1200 selected images in the describable textures dataset, [27] attached separately to different parts of the tool, 2) the background image, randomly chosen from 100 real-world endoscopic images, 3) the illuminance, 4) the 3D pose c T e of the end-effector, and 5) the opening angle α of the endeffector.By introducing DR, the network is expected to concentrate on the 3D shape rather than the color of the end-effector, possibly making the network robust to the changes in the appearance of the end-effector.
Module S is shown in Figure 4.For every generation process, a group of random { c T e , θ 1 , δ 1 , θ 2 , δ 2 , α} is given to the generator.
First, c T e will go through a translation feasibility check and a rotation feasibility check, according to the inverse kinematics of the continuum surgical tool.Then, a 3D model of the 2-segment continuum surgical tool corresponding to {θ 1 , δ 1 , θ 2 , δ 2 , α} will be generated with a continuum tool body and the CAD model of the end-effector.
With feasible c T e , the 3D model of the 2-segment continuum surgical tool, and a randomly set illuminance, the renderer can provide a synthetic surgical tool foreground image x1f , the corresponding segmentation mask m of the end-effector, and a mask of the foreground tool mf .After that, randomly chosen textures are attached to each part of the synthetic continuum surgical tool separately to form n f different domain-randomized foreground images x2f .To stimulate the network's learning of real-world situations, x2f will then be motion blurred into x3f in random level and moving direction.Finally, the semi-synthetic images x are generated from a composing procedure C, which can be written as: where the background image x b is randomly chosen from the real-world endoscopic image set built from 100 collected ex-vivo phantom images and in-vivo scenes without surgical tools from actual RAMIS videos.Finally, with n 0 groups of feasible { c T e , θ 1 , δ 1 , θ 2 , δ 2 , α}, n t different textured foreground images x2f , and n b background images x b , the synthetic dataset will have a size of n 0 Â n f Â n b .Because image composition is faster than surgical tool rendering in this work, n b is set to be larger than n f .

Kinematics Module
As mentioned in Section 3.2, the STPPE-Net only estimates a 5-DoF pose of the continuum surgical tool.To estimate all the 6 DoFs of the tool, this work employs the kinematics module K to estimation the complete 6-DoF pose of the tool.
where cb T b is fixed by the structure of trocar channels, and cb T c , which is assumed to be accurate in this work, can be calculated from the forward-kinematics of the relatively short and load-free continuum endoscope.Simplified from Equation (4), the estimated b p e and b z e can be written as: where c p e and c z e are the outputs from the STPPE-Net.Then, a kinematic component φ from the robot system is used to solve for a complete 6-DoF pose estimation b T e from b p e and b z e .Note that φ is assumed to be accurate in this work because it is directly controlled by a motor.For each frame, the instant φ is obtained from the robot control system and set to an instantaneous constant.Therefore, the configuration vector ψ of the 2-segment continuum tool in Section 2 undergoes a dimensionality reduction to: and the instantaneous kinematics can be expressed as: where ρ is the pose vector of the end-effector base w.r.t frame {b}, v the translational velocity, ω the rotational velocity, and J the 6 Â 5 dimension-reduced Jacobian matrix.J þ is Moore-Penrose pseudoinverse of J and with a singularity robust implementation.
Therefore, an estimated dimension-reduced configuration vector ψ 0 corresponding to b p e and b z e can be calculated using the dimension-reduced resolved-rate inverse kinematics for the 2-segment continuum surgical tool [28] based on Equation ( 6) and (7).For the first frame, the initial configuration vector of the resolved-rate algorithm is set to ψ 0 = 0 5 Â 1 , and for follow-up frames, initial ψ 0 is set to the output ψ 0 of last frame from K.
Finally, K outputs the estimated configuration vector ψ (combining ψ 0 and φ) and the corresponding 6-DoF pose estimation c T e of the end-effector w.r.t the {c} according to the robot forward kinematics and Equation (4).

Region-Based Optimization Module
Following the previous work, [13] this work presents a region-based optimization module ℛ for further refining the 6-DoF pose estimation.
With the output c T e from module K as the initial pose, an energy function E [29] as in Equation ( 8) is employed to measure the overlap between the projection of the end-effector CAD model (based on c T e and the end-effector opening angle α obtained from the robot control system) and the estimated segmentation mask m from the STPPE-Net in the image plane.
where H stands for the Heaviside function, [30] Ω the entire image region, (u, v) a pixel in Ω. Φ is the signed distance function (SDF) for the Euclidean distance from pixels to the projected contour as follows: Φðu, vÞ ¼ À min ðu, vÞ c jðu, vÞ À ðu, vÞ c j, ∀ðu, vÞ ∈ Ω f min ðu, vÞ c jðu, vÞ À ðu, vÞ c j, ∀ðu, vÞ ∈ Ω b (9)   where Ω f is the projected end-effector region as illustrated in Figure 5A, Ω b the background region, and (u, v) c the closest pixel coordinate on the contour of region Ω f to (u, v).Note that Ω f þ Ω b = Ω.P f and P b are the probability of (u, v) belonging to the end-effector segmentation region or the background region respectively, which can be written as: where Ω e indicates the segmented end-effector region and is designed as a filled area as illustrated in Figure 5B.
To facilitate the minimization of E, the differentiation of E w.r.t a quaternion q = [q w q x q y q z ] T and a translation t = [t x t y t z ] T is first calculated, as given in Equation (11).Here, q and t describe the pose of the 3D end effector model w.r.t frame {c}.
where β j is the j th element of β = [t x t y t z q w q x q y q z ] T , and ζ can be written as: where h denotes the derivative of H and only preserves the contour of Ω f .The differentials ∂Φ/ ∂u and ∂Φ/ ∂v are calculated using centered finite differences.Then, for each pixel (u, v) on the contour of Ω f , there exists one corresponding point (x, y, z) on the 3D end-effector model w.r.t frame {c}.Thus, Equation ( 14) can be obtained using the pinhole camera model.
where f x and f y stand for the focal lengths of the endoscopic camera.[x y z 1] T = c T e [x 0 y 0 z 0 1] T gives Equation (15).
where (x 0 y 0 z 0 ) the corresponding point of (x y z) w.r.t frame {e}.
c T e can be represented by q and t.
where τ E is a threshold value and experimentally set to 2.5 Â 10 5 in this article, k E the scale factor ranges in [0, 1], Δt max and Δq max are experimentally set to 0.1 and 0.01 respectively in order to limit the maximum increments.dt and dq will change c T e such that c T 0 e = c T e ΔT.As Equation (4) still holds, b T 0 e corresponds to c T 0 e , as in Equation ( 17).
The change from b T e to b T 0 e leads to an updated estimation ψ : 0 via the forward kinematics as in Equation (18).
where dω is the angle-axis representation of dq.Note that φ is still treated as an instantaneous constant in module ℛ. Repeating Equation ( 8)-( 18), when the change of energy E values is less than a threshold E threshold (5 Â 10 3 in this work) or the number of iterations is larger than a preset value (20 in this work), the iteration stops and the estimated pose of the end effector c T e is given by the last updated configuration vector and Equation ( 4).
The diagram of ℛ is shown in Figure 6.Referring to Figure 5D, ℛ may be caught into local optimum when the end-effector is open.To address this problem, the end-effector segmentation region Ω e and the end-effector CAD model projection region Ω f are both replaced by their filled convex hulls in the first five iterations in the optimization for each frame.Please note that the structural constraints of the continuum surgical tool are also integrated, limiting the bending angles to the design ranges (θ 1 ∈[0, π/2] and θ 2 ∈[0, 2π/3]) to improve the estimation accuracy.

Multi-Tool Tracking Strategy
With the total tool number N and the tool type information from the robot control system, the proposed framework can be applied into tracking multiple tools simultaneously as follows: 1) First, ℳ in the STPPE-Net outputs a segmentation mask m containing all the end-effectors masks in the endoscopic camera view.Based on the rough position of the surgical tools in the image plane from the forward kinematics, the masks for all the end-effectors are labeled as 1 to N according to their tool index from the robot control system.2) Then, different partial pose estimators P þ Z in the STPPE-Net were trained separately for each tool end-effector type.For each labeled individual end-effector mask from m, P þ Z is used to output its 5-DoF pose estimation.The correct trained parameters of P þ Z is chosen based on the labeled tool index and corresponding tool type information from the robot control system.Then, module K output its 6-DoF pose estimation, and ℛ refines the 6-DoF pose estimation based on the CAD model of the corresponding end-effector type.

Experimental Implementation
The involved endoscope has a stereo camera with a resolution of 1920 Â 1080 pixels and 90°FoV.Only the left images are used in this study.The STPPE-Net uses resized images (480 Â 270) as inputs for computational efficiency.Module ℛ uses full resolution images (1920 Â 1080) as inputs for better accuracy.d max and d min in Section 3.2 are set to 100 and 20 mm, respectively.
The STPPE-Net is implemented using PyTorch, and trained on a PC with NVIDIA GeForce RTX 4090 GPU, Intel Core i9-13900k CPU, and 64-GB RAM.Module S, K, and ℛ are implemented on the same hardware setting.
Then, module S generates 1 million groups of data for training the STPPE-Net, where n 0 , n t , and n b in Section 3.3 are set to 5000, 5, and 40, respectively.Each group of data contains 1) the composed synthetic image, 2) the masks for each end-effector and the overall mask for the image containing all the end-effectors, and 3) the ground truth 6-DoF pose for each end-effector.Three types of end-effectors are involved in this work, including the tissue forceps, the bipolar forceps, and the needle driver.Each composed synthetic image contains one or two of the three endeffectors types.In half of the composed synthetic images, this work also adds a textured 3D model of the wrist marker [10] as a distractor.Based on the synthetic dataset, ℳ was trained, and three P þ Z were trained separately for the three types of tool end-effectors.

Segmentation Tests
To demonstrate the segmentation accuracy of the proposed framework, this work picks 100 frames captured using the involved endoscopic single-port system in actual animal study.The animal studies were approved by the Shanghai Yinshe clinical center, which is a qualified company to issue ethics certification and offer sites for animal studies.These images contain one or two continuum surgical tools with two types of end-effectors (tissue forceps and bipolar forceps) as in Figure 7. Motion blur of the tools and complex illumination (overexposure and dimness on the tools and the tissues) can be seen in the images.
For all the 100 frames, this work manually annotates the end-effector region for validating the segmentation accuracy (IoU-sg) of the proposed framework.The segmentation results are shown in Figure 7. Validation results show that the proposed segmentation method possesses an accuracy of an IoU-sg of 0.801 AE 0.135 (mean AE std).Although trained entirely on synthetic images, the proposed segmentation method can handle real-world in-vivo endoscopic images with various illumination and motion blur because of the employment of DR and motion blur in the training samples.

Ex-Vivo Tracking Tests
In in-vivo endoscopic images, it's hard to obtain the ground truth of the tool 3D pose for evaluation.To quantitatively validate the 3D tracking precision of the proposed framework, an ex-vivo test is implemented.The test contains 162 continuous frames captured using phantoms as the background and the 2-segment continuum surgical tool (needle driver).For all the frames, this work also manually annotates the segmentation of the end-effector for evaluation.
Two ways are used to validate the tracking precision, 1) reprojecting the CAD model of the end-effector on the image plane A wrist marker [10] is used to provide the ground truth 6-DoF pose.The marker is implemented near the tool end-effector base, but not blocking the end-effector.What's more, to further test the robustness of the proposed framework when dealing with varying surgical tool end-effector appearance in real-world endoscopic scenarios, this work further daubs the end-effector with red paint to simulate blood stain, as shown in Figure 8.The spatial trajectory of the tool end-effector base under frame {c} in the frame sequence is shown in the top left corner in Figure 9.
Even with the simulated blood stain on the end-effector, the proposed segmentation method is still effective and achieves a mean IoU-sg of 0.929 AE 0.014.Using the 6-DoF pose information from the marker as the ground truth, this work obtains the pose errors of 1) the 6-DoF pose estimation from K and 2) the refined 6-DoF pose estimation from ℛ. Figure 9 shows the spatial trajectory of the end-effector base under frame {c}, the tracked position trajectory in x, y, and z axes, the position error, and the orientation error of the proposed tracking framework.
The test results are summarized in Table 2.The proposed framework achieves a 6-DoF pose estimation with a mean position error of 1.63 mm (0.38, 0.58, and 1.37 mm in x, y, and z axes respectively), a mean z-axis error of 4.62°, and a mean orientation  error of 5.02°.With ℛ, the 6-DoF pose estimation errors are refined to 1.24 mm (0.34, 0.42, and 1.02 mm in x, y, and z axes, respectively), 2.75°, and 3.20°.The time consumption for each module in this test is shown in Table 3. Module K requires 20 ms in the first frame but only about 3 ms in follow-up frames because the initial configuration in the kinematics iteration of each follow-up frame is given according to the configuration result of previous frame.At present, ℛ is not real-time but it can be computationally optimized.

In-Vivo Tracking Tests
Finally, to illustrate the potential of applying the proposed tracking framework to in-vivo multi-tool tracking, this work further used in-vivo video of 1000 frames with manual end-effector segmentation annotations and recorded kinematic information from the robot control system using the involved endoscopic single-port system.A tissue forceps (left) and a bipolar forceps (right) are included in this test.Based on the tracking strategy in Section 3.6, exemplary tool end-effectors segmentation and 6-DoF pose estimation results are shown in Figure 10.The proposed framework achieves an average IoU-sg of 0.874 AE 0.047 and an average IoU-rp of 0.814 AE 0.029 in this test.Ye et al. [11] 2.83 AE 2.19 -7.45 AE 5.73 -Allan et al. [13] 3.85 AE 3.64 -24.64 AE 14.90 -Hasan et al. [15] 2.59 5.94 --Sestini et al. [16] 1.46 a) -5.51 a) 0.738 Yoshimura et al. [17] ---0.494a) The position error and the orientation error are calculated from the joint value errors of two 1-segment continuum surgical tools according to the constant curvature assumption. [16]ble 3. Computational time for each frame.

Module Time [ms]
First frame Follow-up frames

Discussion
This work is an important step toward the closed-loop control of continuum single-port surgical tools.Without using manual annotation, the proposed framework is robust to varying surface appearance of the tool, and can be applied to different types of end-effectors.Both aiming at tracking continuum surgical tools, the proposed framework achieves comparable tracking precision with the kinematic bottleneck approach [16] (which is only evaluated in joint values on synthetic images and in reprojection IoU on real images).It should be noted that compared to the kinematic bottleneck approach, 1) the proposed framework does not require raw kinematic information from the robot system for training, 2) the pose estimation from the proposed framework is quantitatively evaluated on real-world images instead of synthetic images, and 3) this work uses endoscopic views more similar to an actual surgical setup (tracking the end-effector, instead of the segment body).

Synthetic Training Dataset Size
In this work, DR is mainly realized by applying randomization in 1) the attached textures on the surgical tools and 2) the background images.
Besides, the robustness of the proposed tracking framework depends mainly on the segmentation accuracy of ℳ in the STPPE-Net because the partial pose estimator and the optimization ℛ both focus on the segmented tool end-effector region.
Therefore, to study the correlation between the degree of randomization and the robustness of the proposed framework, the authors explore the influence of random texture number n f and random background number n b in Section 3.3 upon the segmentation performance of the proposed framework on the in-vivo endoscopic dataset in Section 4.2.
The results are shown in Table 4.With n 0 = 5000 groups of feasible { c T e , θ 1 , δ 1 , θ 2 , δ 2 , α}, six synthetic training datasets with different sizes are established by varying n f and n b .It's evident that more randomization can improve the segmentation performance.Considering both the segmentation performance and the time for dataset rendering and network training, dataset #5 with a size of 1 million is chosen in this work.The proposed tracking framework may be applied into more complicated surgical scenes if more randomization is added in the future.

Limitations
The proposed 3D tracking framework still holds a few limitations to be addressed.1) Although ℛ can generally refine the pose estimation (as shown in Figure 11A), in certain frames, the error of  the pose estimation from ℛ may be larger than that from K.This is mainly due to the error in the opening angle of the tool end-effector from the control system (as shown in Figure 11B), especially when the end-effector is in the process of opening or closing.Therefore, a strategy of whether to apply the optimization ℛ should be established in the future.2) There are external disturbances during actual surgical procedures that are not taken into account currently, including external forces applied to the surgical tools when contacting with tissues and occlusion of the surgical tools by tissues or smoke.

Conclusion
This work proposes a vision-based markerless 3D tracking framework for 2-segment continuum surgical tools.The proposed framework contains four main modules including 1) the STPPE-Net, 2) the synthetic data generator S, 3) the kinematics module K, and 4) the region-based optimization module ℛ.The STPPE-Net is trained entirely on synthetic data, which are generated from S based on DR.A tool end-effector segmentation mask and a 5-DoF pose estimation are output by the STPPE-Net in real time.K outputs a 6-DoF pose estimation using the 5-DoF pose estimation from the STPPE-Net and the tool axis rotation from the robot control system.ℛ can further optimize the 6-DoF pose estimation from K by maximizing the overlap between the segmentation mask from the STPPE-Net and the tool end-effector projection on the endoscopic image plane.Experiments are conducted using real-world (in-vivo and exvivo) endoscopic images from a single-port endoscopic system with continuum surgical tools with three different kinds of end-effectors.Results show that the proposed framework can generate 1) an effective tool end-effector segmentation (IoU-sg of 0.801 AE 0.135 in the in-vivo segmentation test, 0.929 AE 0.014 in the ex-vivo tracking test, and 0.874 AE 0.047 in the in-vivo tracking test) regardless of motion blur, overexposure, or dimness on the tool end-effectors, and 2) a reliable 6-DoF tool pose estimation (1.24 mm in position error, 2.75 in z-axis orientation error, and 3.20°in orientation error).What's more, the proposed framework also performs well while dealing with the end-effector contaminated with simulated blood stain.
Future works mainly include, 1) the strategy establishment of whether to apply ℛ for each frame in the tracking procedure, 2) combining kinematic-static model and the segmentation of the whole continuum surgical tool to take external forces into account, 3) additional data augmentation with partially occluded surgical tools and artificial smoke to enable tracking in complicated surgical environment, 4) the generation of new synthetic datasets to track surgical tools with other types of end-effectors, 5) software-optimization and hardware-acceleration of ℛ to realize real-time computation of the whole tracking framework.

Figure 1 .Figure 2 .
Figure 1.The overall tracking framework.A) A single-port continuum surgical robot system.B) The STPPE-Net takes in an RGB endoscopic image and outputs, C) the 5-DoF pose estimation with position (purple dot) and z-axis orientation (blue arrow) w.r.t the endoscopic camera frame, and D) an estimated segmentation mask of the end-effector.The STPPE-Net is entirely trained by the synthetic data generator S. E) A kinematics module K takes in the STPPE-Net outputs along with the tool axis rotation φ from the robot control system and outputs the 6-DoF pose estimation (x, y, and z axes respectively in red, green, and blue).F) A region-based optimization module ℛ further optimizes the output from K with the end-effector opening angle α from the robot control system to the refined 6-DoF pose estimation with end-effector computer-aided design (CAD) model reprojection overlay (yellow).

m$
T n , m R n , m z n , m p n The homogeneous transformation matrix, rotation matrix, z-axis unit vector, and translation vector of frame {n} w.r.t frame {m}.b Estimation of an item b. b Synthetic data of an item b. b Refined form of an item b after optimization.
three parts ℳ, P, and Z of the STPPE-Net are separately trained.For a training data group {x, m, c T e }, {x, m} is used to train ℳ, {x ⊙ m, c p e } is used to train, and {x ⊙ m, c z e } is used to train Z, where ⊙ denotes the Hadamard matrix product, c p e and c z e can be extracted from c T e .To train ℳ, a common cross entropy loss function is used.With a batch size of 16, ℳ is trained for five epochs.To train P and Z, this work employs Euclidean loss.With a batch size of 64, P and Z are trained for 10 epochs each.

Figure 3 .
Figure 3. A) Illustration of the proposed STPPE-Net.This network mainly consists of a mask generator ℳ, a position estimator P, and a z-axis orientation estimator Z. ℳ generates the segmentation mask of the end-effector, P and Z together form a partial pose estimator to estimate the 5-DoF pose of the endeffector.The partial pose estimator is separated into P and Z because of the dimensional difference between position and orientation.The symbol ⊙ denotes the Hadamard matrix product.B) The detailed structure of the partial pose estimator.

Figure 4 .
Figure 4. Illustration of module S.This module takes in randomly given pair { c T e , θ 1 , δ 1 , θ 2 , δ 2 , α}, and outputs a rendered semi-synthetic image x $ along with the segmentation mask m $ of the synthetic end-effector.C stands for the semi-synthetic image composing.

Figure 5 .
Figure 5. Illustration of module ℛ. A) The end-effector CAD model projection region Ω f (yellow).B) The end-effector segmentation region Ω e (green).C) The alignment of Ω e and Ω f .The overlap region between Ω e and Ω f is shown in white.D) A possible failure case because of local optimum when the end-effector is open.

6 .
The diagram of the module ℛ.using the estimated 6-DoF pose to calculate the overlap between the reprojection and the manual annotation of the end-effector segmentation (IoU-rp) and 2) calculating the position error, z-axis orientation error, and orientation error of the estimated 6-DoF pose with the ground truth 6-DoF pose.

Figure 7 .
Figure 7. Exemplary segmentation results of in-vivo endoscopic images.Input images are shown in the top row and their corresponding segmentation results are shown in the bottom row with true positive (white), true negative (black), false positive (red), and false negative (green).

Figure 8 .
Figure 8. A) Exemplary tracking results on ex-vivo phantom images.Segmentation results (green) are shown in the left column, 6-DoF pose estimation results from the module K are shown in the middle column (x, y, and z axes respectively in red, green, and blue), and 6-DoF pose estimation results after module ℛ are shown in the right column with the end-effector CAD model reprojection overlay (yellow).IoU-sg and IoU-rp are shown with true positive (white), true negative (black), false positive (red), and false negative (green).B) The enlarged view of the end-effector contaminated with simulated blood stain and the wrist marker.

Figure 9 .
Figure 9. Results from the ex-vivo tracking test, including the uncompensated robot kinematics (green), the proposed 6-DoF pose estimation from K (orange), and the proposed 6-DoF pose estimation from ℛ (red).The pose from the wrist marker (blue) is treated as the ground truth.

Figure 10 .
Figure 10.Exemplary tracking results on in-vivo images (left) with two continuum tools.Segmentation results are shown in the middle and 6-DoF pose estimation results are shown in the right (x, y, and z axes respectively in red, green, and blue) with the end-effector CAD model reprojection overlay (yellow).

Table 1 .
Nomenclatures in this work.

Table 2 .
Mean (AE std) position error, z-axis error, orientation error, and reprojection IoU-rp results from the ex-vivo tracking test, and the comparison of the results to several existing methods.The back slashes indicate unavailable values.Best results are shown in bold.

Table 4 .
The correlation between the degree of randomization and the segmentation performance on the in-vivo endoscopic images in IoU-sg (mean AE std).