Towards Safer Robot-Assisted Surgery: A Markerless Augmented Reality Framework

Robot-assisted surgery is rapidly developing in the medical field, and the integration of augmented reality shows the potential of improving the surgeons' operation performance by providing more visual information. In this paper, we proposed a markerless augmented reality framework to enhance safety by avoiding intra-operative bleeding which is a high risk caused by the collision between the surgical instruments and the blood vessel. Advanced stereo reconstruction and segmentation networks are compared to find out the best combination to reconstruct the intra-operative blood vessel in the 3D space for the registration of the pre-operative model, and the minimum distance detection between the instruments and the blood vessel is implemented. A robot-assisted lymphadenectomy is simulated on the da Vinci Research Kit in a dry lab, and ten human subjects performed this operation to explore the usability of the proposed framework. The result shows that the augmented reality framework can help the users to avoid the dangerous collision between the instruments and the blood vessel while not introducing an extra load. It provides a flexible framework that integrates augmented reality into the medical robot platform to enhance safety during the operation.


I. INTRODUCTION
R Obot-assisted surgery (RAS) got improved patient out- comes in both intra-operative operation and postoperative recovery compared to traditional open surgery, and it also provides the possibility to integrate artificial intelligence in this platform for autonomous and safe operation.An emerging technology, i.e., Augmented Reality (AR) fusing virtual targets on real scenes, provides more visual information for the users, and it has been introduced in the field of robotic surgery to enhance the safety [1]- [3].Generally, the surgeon needs to capture the pre-operative images of the patient such as Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) for the preliminary surgical planning.These CT/MRI slices can be segmented using software like 3D Slicer to generate the pre-operative 3D model, and then projected on the intra-operative images to implement the AR effect.These augmented intra-operative images can guide the operation of the surgeons by providing more visual information and improve the operation performance.One major challenge is how to localize the intra-operative soft tissues or organs so that the pre-operative model can register with the intra-operative target correctly, to implement the overlapping of the pre-operative model on the corresponding position of intra-operative images.In [4], [5], the authors implemented the AR in robot-assisted radical prostatectomy by overlapping the pre-operative 3D model on the endoscopic images, they used the software vMIX (StudioCoast Pty Ltd, Australia) to manually align the position of the pre-operative model, which hinders the practice in real-time AR visualization during the operation.Similarly, a manual alignment between the pre-operative model and the intra-operative anatomy was used in some other operations, such as the robotic thyroidectomy [6], the transoral surgery [7] and the partial nephrectomy [8].In [9], the authors introduced three possible solutions to implement AR registration including landmarks, laparoscopic video and intra-operative ultrasound to recognize the 3D position of surgical scenes, and they summarized that the laparoscopic video-based approach will be the mainstream since it does not require external hardware.Also, the authors in [10] introduced the registration process using the robotic instruments by pointing at preinstalled markers or the extra projector-camera system.To implement the automatic AR alignment using the laparoscopic video based on the da Vinci Surgical System (dVSS), the authors in [11] proposed to track the interested soft tissue and then recover the corresponding 3D position information of the soft tissue using stereo reconstruction.The experimental result showed that the AR effect can be implemented on the intra-operative images.Nevertheless, this approach relied on an external device to manually draw the boundary of the interested target at the beginning of the operation, and the pre-operative model was substituted using a simple ellipsoid that is different from the intra-operative tissue.
Many stereo reconstruction methods have been proposed to reconstruct the 3D scene by estimating the disparity map, and the depth information can be recovered by a transformation with the focal length and the baseline of the stereo endoscope.[12] is a classic stereo reconstruction network, the authors adopted the Convolutional Neural Network (CNN) with embedding a Spatial Pyramid Pooling (SPP) module to extract the high-level features of the stereo image, then they were concatenated as a 4D cost volume and fed into a stacked hourglass architecture for the disparity estimation.To simplify this hourglass architecture, the authors in [13] adopted fewer skipping connections in the decoder and designed a novel 4D cost volume by calculating group-wise correlation.Next, [14] was proposed to predict image pairs with a high resolution, different-level feature maps were employed to foster the multiple cost volumes and then gradually connected to estimate the disparity map based on a coarse-to-fine manner.More methods [15]- [17] were proposed to adopt different strategies to foster the cost volume since it is a key factor for the final prediction.Different from the traditional CNN architecture, attention-based transformer [18] provides a new network architecture and vision transformer [19] opens the path to utilize this module in the vision field by encoding the image into many tokens.The works in [20], [21] started to fuse the transformer and CNN for the disparity estimation and their results showed a satisfactory performance by evaluating a public stereo endoscopic dataset [22].
3D reconstruction recovers the depth information of the whole intra-operative scene, which raises another issue, i.e., keeping the interested region while removing other background points for an accurate registration between the preoperative model and the intra-operative target.Hence, another segmentation network needs to be integrated to distinguish the background and the target by predicting a binary mask.UNet [23] is the most representative model in medical image segmentation, and it adopted a U-shape architecture with downsampling and upsampling operation and implemented the fusion of different-level features using skipping connection.Following the similar UNet architecture, the authors in [24] proposed a multi-path feature fusion strategy by combining two UNet networks, the authors in [25] integrated an attention gate module to enhance the learning of target structures, and the authors in [26] designed a nested UNet architecture by densely aggregating different-scale features.The emerging Transformer module has also been utilized in the segmentation field.In [27]- [29], the authors employed the transformers as the encoder to extract the features, and then adopted the CNN architecture as the decoder to predict the segmentation mask.More recently, a fancy model named Segment Anything (SAM) [30] was released which had been trained based on 11 million images and showed a strong generalization ability in various segmentation tasks.With the explosion of big data, it can be foreseen that such kind of models will lead a new era because of the promising zero-shot performance.
Intra-operative bleeding is a risky situation that affects surgical quality and post-operative recovery of patients, and it generally occurs due to unconscious collisions between the surgical instruments and the delicate blood vessel (i.e., the artery and the vein) during the operation.Hence, we propose a markerless augmented reality framework to relieve the occurrence of this situation.Different from the existing approaches such as the manual location or the pre-installed markers, we combined the stereo reconstruction and segmentation networks to locate the intra-operative soft tissue.The proposed framework integrating the advanced neural networks can implement the visualization of the pre-operative model, and it also performs the minimum distance detection between the instruments and the blood vessel to avoid a dangerous collision.Furthermore, it does not rely on the extra external device, which means high generalization in other robotic systems and surgical tasks.The main contributions can be summarized as follows: 1) A markerless augmented reality framework was proposed to visualize the pre-operative model on intra-operative scenes, and it provides the minimum distance detection between the surgical instruments and the delicate blood vessel for safety.
2) A comprehensive evaluation of advanced neural networks in stereo reconstruction and segmentation fields was performed to find out the best combination to recover 3D information of the intra-operative blood vessel accurately and fastly.
3) A user study involving ten human subjects who performed a robot-assisted lymphadenectomy based on the da Vinci Research Kit in a dry lab, was achieved to explore the usability of the proposed AR framework compared to the standard setup.
The remainder of this paper is structured as follows.Section II describes the details of our proposed augmented reality framework.In Section III, it presents the framework evaluation metrics and the specific experimental protocol of the usability study, and the results are given in Section IV.Section V discusses the findings and limitations of our work, and the conclusion of this paper and future work are drawn in Section VI.

II. METHODOLOGY
The proposed augmented reality framework is shown in Fig. 1, and it is integrated into the popular da Vinci Research Kit (dVRK, Intuitive Surgical Inc., US, and Johns Hopkins University).The stereo image pair is input into a stereo reconstruction network to estimate the left disparity map, and the left image is also input into a segmentation network to segment the blood vessel.Then, the intra-operative blood vessel can be generated in the 3D space by combining the disparity map and the binary mask.The pre-operative model is utilized to perform the registration with the intra-operative blood vessel so that it can be projected on the corresponding position of the endoscopic image pairs to implement an AR effect.Furthermore, the minimum distances between the surgical instruments and the soft tissue are calculated based on the dVRK kinematics and the 3D position of the reconstructed blood vessel.The specific description of this framework is given below.The architecture of the markerless augmented reality framework.The image pair is fed into two different networks to estimate the disparity map and the binary mask.Then, the disparity pixels that belong to the blood vessel can be reprojected to generate the 3D intra-operative blood vessel.The pre-operative model can be overlapped on the corresponding region of the endoscopic images by registration, and the minimum distance between the instruments and the blood vessel can be detected based on the dVRK kinematics and the 3D reconstructed blood vessel.

A. dVRK system and calibrations
The dVRK system is an open-source robotic platform composed by the hardware of the first generation of da Vince surgical system as well as the customized software and electronics.It can be mainly divided into the leader side and the follower side.There are two Patient Side Manipulators (PSMs) that mount various surgical instruments such as the large needle driver at the follower side, and a stereo endoscope (1920×1080 resolution) mounted on an Endoscopic Camera Manipulator (ECM) is used to capture in vivo surgical scenes.By adjusting the position of the Setup Joint (SUJ) connecting these robotic arms, the respective Remote Center of Motion (RCM) can be located at the skin entry point of the abdomen [11].On the other hand, the surgeon at the leader side can observe the endoscopic scenes using a High Resolution Stereo Viewer (HRSV) and remotely control the movements of PSMs and ECM by operating Master Tool Manipulators (MTMs) as well as a foot pedal tray [31], [32].
Fig. 2 shows the details of our dVRK system, and it also presents the reference frame definition adopted in our framework.The Cartesian position of each instrument can be obtained from the dVRK kinematics, and a 9×6 chessboard with a square length of 1cm is adopted for our calibrations.First, the camera calibration is done based on Zhang's calibration approach [33] to generate the intrinsic and extrinsic parameters for the image rectification, undistortion and projection.Then, a hand-to-hand calibration is conducted to search for the rigid transformation between left and right end effectors based on Horn's method [34], since the position subscribed from the direct kinematics is not accurate enough.By collecting 40 non-collinear points, we can generate the rigid transformation matrix T EE2 EE1 and transfer the 3D points in {EE1} to {EE2}.Furthermore, a hand-eye calibration is performed to obtain the transformation between the left end effector {EE2} and the left camera {L CAM}.We operate the left end effector to point at 54 corner points of the chessboard to obtain the 3D coordinates, and the corresponding 2D coordinates on the left image are obtained using cv.findChessboardCorners and cv.cornerSubPix (OpenCV) functions so that the transformation can be calculated based on the Random Sample Consensus (RANSAC) scheme (cv.solvePnPRansac function) [35], [36].Since our end effector is referencing the frame {ECM}, the transformation matrix T L CAM ECM is generated based on the handeye calibration.

B. Intra-operative blood vessel reconstruction
The stereo image is first rectified to align the polar lines in the horizon axis, and the resolution is resized from 1920×1080 to 640×360 to accelerate the framework.Then, the images are fed into a stereo reconstruction network to estimate the disparity map [14].In this way, we can reproject the disparity map to generate the 3D intra-operative point cloud.The conversion between the estimated disparity value d and the depth value d is formulated as, where i,j are the pixel position on the 2D image, b is the baseline of the stereo camera, and f is the focal length.However, the reconstructed point cloud involves the whole intra-operative scene, and we need to extract the interested region (i.e., the blood vessel in our case) for the registration.Considering that the estimated disparity map is referencing the left rectified image, we propose to adopt another segmentation network to generate the binary mask of the interested region.The left rectified image is fed into the segmentation network to estimate the indices of the blood vessel and the background.Next, we use the disparity pixels that belong to the blood vessel by referencing these indices and reproject them to the 3D space to reconstruct the intra-operative blood vessel.The 3D position of the reconstructed blood vessel directly influences the quality of the following registration and the distance Fig. 2. The presentation of the dVRK system in a dry lab.In (a), the user is operating the MTMs and observing the surgical scenes using HRSV at the leader side, and the surgical instruments mounted on the PSMs are performing the operation following the remote control of the user at the follower side.In (b), it shows the three types of system calibrations, including camera calibration, hand-hand calibration and hand-eye calibration.
detection, so a comprehensive comparison study including the stereo reconstruction networks and segmentation networks is provided in Section III to determine the best combination in our framework.Finally, two post-processing approaches are adopted to improve the segmentation estimation, including mask boundary eroding and small object removal.Mask boundary eroding can remove the possible misclassified pixels near the boundary of the blood vessel, and the small object removal is used to remove the possible outliers in other regions, which can refine the segmentation quality in some challenging cases such as the blurred scenes caused by the fast movement of the instruments.
The recontructed 3D blood vessel is referencing the rectified left camera coordinate system {Rec L CAM} , and we transform it to the reference {ECM} for the following distance calculation based on the equation, where is the reconstructed 3D points of the intraoperative blood vessel referencing the rectified left camera coordinate system, T Rec L CAM L CAM is the transformation between the unrectified left camera system and the rectified left camera system obtained from the stereo image rectification, and T L CAM ECM is obtained based on the hand-eye calibration.

C. Registration between pre-operative and intra-operative targets
The pre-operative 3D model is generally captured using CT or MRI in the hospital, and then using some software such as the 3D Slicer to segment the interested region to generate the 3D structure.In our case, we utilize a 3D modeling software, i.e., Blender, to create a pre-operative 3D model for simplification since we don't have an external device to capture these CT/MRI slices.Next, we can perform the registration between the pre-operative model and the intraoperative reconstructed point cloud.Here, the pre-operative blood vessel is a mesh model while the intra-operative one is a point cloud, so we sample plenty of points from the mesh model for the registration process.Considering that there is an apparent position difference between the pre-operative model and the intra-operative blood vessel in the initial state, the global registration-based RANSAC algorithm [36] is adopted to conduct the initial transformation, and it is only implemented once at the beginning.Then, the local registrationbased Iterative Closest Point (ICP) [37] algorithm is performed to finetune the position of the pre-operative model.It is found that this registration strategy can register the models accurately and fastly in our experiment.

D. Distance detection and AR visualization
Our framework not only provides the augmented preoperative model visualization on intra-operative images, but also detects the minimum distances between the surgical instruments and the blood vessel.After aligning the position of the pre-operative model with the intra-operative blood vessel, we could overlap the pre-operative model on the left intraoperative scenes P L img pre obj by the transformation, where P BL pre obj is the 3D pre-operative points referencing the Blender coordinate system {BL}, T ECM BL is the transformation matrix generated by the registration, T L CAM ECM is the matrix obtained from the hand-eye calibration and K L contains the intrinsic and distortion matrices of the left camera obtained from the camera calibration.Similarly, we can use the following equation to project the pre-operative model on the right intra-operative scenes P R img pre obj , where T R CAM L CAM is the transformation between the left and right camera coordination systems, and K R contains the intrinsic and distortion matrices of the right camera.Then, we can observe that the pre-operative model is overlapped on the corresponding regions of the left and right images, respectively.
Finally, we calculate the minimum distance between the surface of the instruments and the reconstructed blood vessel based on the fast k-nearest-neighbor search strategy [38], [39].The Cartesian position of the end effectors and the RCM points of instruments are subscribed from the robot kinematics so that we can model the instruments as cylinders with a radius of 4mm and sample them as the point clouds (here, the position of PSM1 is aligned to PSM2 by the hand-hand rigid transformation T EE2 EE1 ).Two gauges are provided in the left upper and right upper corners of intra-operative images, and they can visualize the respective minimum distance of left and right instruments.Also, the color of the pre-operative model changes automatically according to the smaller distance by comparing the left and right minimum distances to remind the surgeons during the operation.

III. EXPERIMENTAL PROTOCOL AND PERFORMANCE METRICS
A. Framework characterisation evaluation 1) Reconstruction and segmentation networks: The prerequisite for performing AR visualization and distance detection is that the intra-operative position of soft tissue needs to be accurately restored, so we explored 14 state-of-the-art methods in the stereo reconstruction field to find out the best model that can be utilized in the medical scenes.Among them, ELAS [40] is an optimation based method while others [12]- [17], [20], [21], [41]- [45] utilize the neural networks.The stereo endoscopic dataset, SERV-CT [46], was adopted to conduct the quantitative evaluation of these methods, and it contains sixteen image pairs captured from porcine samples based on the dVSS and provides the dense ground truth.To reconstruct the endoscopic scenes, we run these models based on their official weights without any task-specific finetuning for the generalization [47].A set of accuracy-related metrics was chosen to evaluate the reconstruction error, comparing the estimated depth with the provided ground truth, both expressed in millimeters.The metrics include Median Absolute Error (MeAE), Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Absolute Relative Error (Abs Rel), Squared Relative Error (Sq Rel), as well as δ ratio [48], [49].
Abs Rel = 1 where S is the set of predicted depth values for each frame, d(i, j) is the predicted depth value related to pixel in position (i, j) and d ′ (i, j) is the ground truth of depth value.The last metric evaluates the depth fluction error between the reconstructed points and the ground truth, and three different thresholds τ ∈ [1.25 1 , 1.25 2 , 1.25 3 ] were adopted.Unlike the other metrics, the higher δ ratio means the better reconstruction result.Also, we provided the inference time of one frame when evaluating these models.
Another segmentation network is required to extract the interested soft tissue, so 8 state-of-the-art segmentation methods [23]- [30] were evaluated for the blood vessel segmentation.The recent segmentation network SAM [30] has a strong generalization in different fields, while other neural networks need to be trained.Hence, we captured six endoscopic videos containing the 3D-printed blood vessel based on the dVRK platform in our lab and extracted around 100 images from each video for the manual annotation (551 frames in total).We performed the annotation using Computer Vision Annotation Tool (CVAT) [50].Six-fold cross validation was adopted to train and evaluate the models.During the training process, we cropped the images into 128×128 patches and trained the models for 100 epochs based on a batch size of 64 (100000 patches in every epoch and 10% of images from the training images were used for the model validation).Two methods [27], [28] loaded the pre-trained weight following the authors' original configuration while other models were trained from the scratch.The images were also split into patches with the resolution of 128×128 without overlapping during the test phase.Common evaluation metrics in the segmentation field were used for a comprehensive comparison, including Dice coefficient ( 2T P 2T P +F P +F N ), accuracy ( T P +T N T P +T N +F P +F N ), specificity ( T N T N +F P ), sensitivity ( T P T P +F N ), precision( T P T P +F P ), the area of the precision-recall (PR) curve [51] as well as the inference time.
2) other characterisation: The quantitative performance of this framework needs to be measured for practicality, and the experimental platform is based on an NVIDIA RTX 3080 GPU in a local laptop.On the one hand, we provided the running time of the proposed framework by calculating the time spent in 100 consecutive frames, and we also gave the specific time distribution in each component of this framework.On the other hand, we calculated the error distribution in different components of this framework.The system calibrations based on the dVRK introduce errors, and the respective performance metric can be formulated as: • Camera calibration error E cam considers the difference between the coordinates of reprojected 3D points on the 2D plane and the actual coordinates of the chessboard corners by calculating the root mean square value, where ε pro i is the 2D points reprojected using the camera calibration parameters, while ε act i denotes the actual 2D coordinates on the chessboard corners.N cam is the number of the points used for the camera calibration, and ∥, ∥ 2 represents the Euclidean norm.
• Hand-hand calibration error E hand hand was calculated based on the Cartesian position difference between the 3D points subscribed from the left end effector and the same points transformed from the right end effector based on the hand-to-hand matrix T EE2 EE1 , where ρ L i and ρ R i are the 3D Cartesian positions of the left and right end effectors, respectively.N hh is the number of the 3D points used for the hand-hand calibration.
• Hand-eye calibration error E hand eye was evaluated based on the pixel position difference between the reprojected 2D coordinates γ pro i from the 3D points of the left end effector using the hand-eye transformation matrix and the actual 2D coordinates γ act i , where N he is the number of the points adopted for the handeye transformation matrix.Furthermore, the errors of the reconstruction and segmentation networks have been reported in the last subsection, and the registration process also introduces errors.We calculated the Cartesian position error E regis between the pre-operative blood vessel after registration using the transformation matrix T ECM BL and the reconstructed blood vessel, where φ pre-op i is the pre-operative 3D points, while φ recon i is the intra-operative 3D reconstructed points referencing the frame {ECM}.N re is the point number for the registration.

B. Framework usability study
To explore the usability of our proposed framework, a common surgical operation named lymphadenectomy was designed in a dry lab environment based on the proposal of an oncology surgeon.As shown in Fig. 3, the 3D-printed soft blood vessel and kidney were adopted to simulate the surgical scene, and the defined task is to remove the ten lymph nodes one by one (the white soft objects) near the blood vessel while not touching it.Frame 1 shows the initial position of the instruments, then the users operate the left manipulator to catch the left six lymph nodes and put them at the bottom-left corner as shown in Frames 2 and 3, next the users operate the right manipulator to remove the right four lymph nodes and put them at the same corner as shown in Frame 4 and 5, finally the users move the instrument to the initial position in Frame 6. Frame 3 and Frame 5 are challenging cases since the instruments are more likely to collide with the blood vessel.Ten human subjects (6 males and 4 females, aged between 21 and 28) who have a biomedical background were invited to join our experiment.Before the experiment, the users had spent 10 minutes for them to be familiar with the dVRK system.Then, the experiment was repeated in three rounds and we analyzed the user data of the last round to avoid the possible influence of the learning curve.In each round, two different modalities were performed in a random sequence for each user: Control (the standard endoscopic scene without the AR assistance), and Experiment (the endoscopic scene with the AR assistance).
Our AR framework not only visualizes the corresponding pre-operative model on the intra-operative blood vessel, but it also provides the visualization of the minimum distances between the instruments and the blood vessel.We added two gauges on the scenes to visualize the minimum distances (SA: 6cm means the safe area and RA: 3cm means the risk area), and the color of the pre-operative model changes following the smaller distance between the left and right distances.Five performance metrics were utilized to observe if the AR framework can help improve the surgical performance, • Minimum distance D min between the instruments and the blood vessel:  where d M L is the minimum distance of the left instrument in the M-th frame, while d M R denotes the minimum distance of the right instrument in the M-th frame during the operation.
• Mean distance D mean when the instruments are in the risk area of 3cm: where M is the number of the minimum distance points of the left instrument when they are smaller than 3cm, while N represents the number when the points of the right instrument are less than 3cm.
• Collision number N c when the distance points are smaller than the threshold r: where r is defined as 0.5cm in our case.Here, we regard it as one time of collision if the points keep smaller than 0.5cm for one consecutive second during the task.
• Overall movement path S p of the instruments during the operation: where C m L is the 3D Cartesian coordinate of the left end effector in the m-th frame, while C m R is the 3D coordinate of the right one in the m-th frame.
• Execution time T exe to perform the complete operation: where T start and T end are the start time and the end time of the task, respectively.

IV. RESULTS
The quantitative segmentation result was provided in Table II, and the qualitative comparison result was shown in Fig. 5. Here, we provided the specific boxes as the prompt for the mask estimation of the SAM model, and the area of PR curve is not applicable for this model since it predicts the pixel classification directly instead of the probability.We noticed that UNet could provide a reliable segmentation quality in our phantom environment, so we adopted this model to estimate the mask in our experiment.
2) Other characterisation: Table III presents the running time distribution of this framework.Among them, stereo image preprocessing includes image subscribing, rectification and resizing from 1920×1080 to 640×360 based on a fixed scale factor.It can be noticed that the three phases, i.e., the disparity map estimation, the binary mask estimation and the AR visualization, spent the most time in this framework, and the whole time to process one frame takes 0.1448±0.0079(6.91FPS), which can provide a smooth visual feedback during the operation.Also, Table IV shows the error distribution in this framework.It can be noted that these errors are small enough in our experiment.

B. Results on framework usability study
Fig. 6 presents the box plots of the user data, and the Wilcoxon signed-rank test (p<0.05) is conducted to explore if there is a significant difference between the control modality and the experiment modality.It shows that there are significant differences when referencing the minimum distance D min , the mean distance D mean and the collision number N c .Table V also provides the average values using the five metrics.With the AR assistance, the minimum distance increases from 0.0472cm to 0.1864cm, and the mean distance in the risk area  The statistical differences show that the AR assistance can help reduce the collision risk between the instruments and the blood vessel when operating.Furthermore, there is no statistical difference when referencing the overall path S p and execution time T exe , which indicates that the AR assistance does not introduce an extra load in operating the robot.The final result of the SUS questionnaire is given in Fig. 7.The average SUS score of the control modality is 66, while the experiment modality has a higher SUS score of 73.The statistical test presents there is a significant difference between the two modalities (p = 0.0104).

V. DISCUSSION
Augmented reality is a popular direction in various fields, including robotic surgery since it provides the possibility to enhance safety during the operation, such as the distance visualization in our case or the visualization of some invisible tissues during the operation.However, an unsolved challenge is to locate the region of interest so that the pre-operative model can be registered on the corresponding intra-operative soft tissues or organs.In this work, we proposed a vision-based markerless location approach to perform the AR effect with the distance visualization on the intra-operative scenes, which releases the burden of the manual alignment or landmarks.It can be integrated into other tasks and platforms because of the high independence on the specific device.
Advanced neural networks have been proposed for different vision tasks with promising performance by evaluating some public or self-made datasets, and it is worth integrating them to implement applications in practice.Hence, we compared the state-of-the-art networks in the stereo reconstruction and segmentation fields to track the soft tissue in the 3D intraoperative space.We adopted the model HSM [14] in our case since it shows a reliable depth estimation in both depth and inference time.Furthermore, we added a segmentation mask to extract the region of interest from the whole scene.The segmentation performance relies on the specific scenes, so we captured and annotated an endoscopic dataset based on the dVRK platform in our lab.By referencing the segmentation results of 6-fold cross validation, we noticed that UNet [23] can provide a more reliable segmentation quality compared to other models, so we adopted it in our framework.But it should be noted that the segmentation quality is influenced by different scenes and training strategies.For instance, the transformer-based networks [27]- [29] may produce a better estimation with a large of annotated training images, which needs to be compared in the specific tasks.In particular, we evaluated the recently emerging big model SAM [30].Although it does not perform a better result compared to some other models, it should be pointed out that this model has a strong generalization ability in different scenes without finetuning.With the rise of big models and data, it can be foreseen that such kind of model may dominate the vision tasks since it releases the burden of annotation and training.
By simulating a robot-assisted lymphadenectomy based on the dVRK platform, we obtained some surgical data from ten human subjects.It can be observed that the proposed AR framework can increase the distance and reduce the collisions between the instruments and the delicate blood vessel when operating.Moreover, it does not introduce the extra physical and cognitive load since the overall path and the complete time are similar in two different modalities.Based on the feedback from the users, sometimes they could not catch the lymph nodes well caused by the inaccurate depth perception when they operated.Under this circumstance, the color information of the pre-operative model can assist them in judging the proper occasion to catch the lymph nodes, especially the pink and red colors (the distances are within 1cm and 0.5cm, respectively), which can enhance their confidence and accuracy when catching the objects.Finally, the average SUS score given by the users increases by 7% when adopting the AR assistance, which means the proposed AR framework is friendly to the users.Nevertheless, one limitation comes from the modeling of surgical instruments.There are many types of instruments in real surgery, such as monopolar scissors, bipolar forceps, Cadiere forceps, clip appliers and needle drivers, which means the grippers of the instruments are different, and modeling them into a cylinder introduces slight errors.A possible solution is to add an extra neural network to recognize the tips of the instruments [53] so that the modeling can be more accurate even though it may slow down the framework.Another limitation comes from the phantom environment for the simulated surgical operation.In clinical practice, the in vivo environment of patients will be more complex and dynamic.

VI. CONCLUSION
This paper proposed a markerless augmented reality framework to implement a pre-operative structure visualization on the intra-operative scenes, and it also provides the minimum distance detection between the instruments and the delicate blood vessel for safety.It can be integrated into other existing robotic platforms and tasks because it does not rely on specific devices.Comprehensive comparison studies are performed to explore the best combination for intra-operative blood vessel reconstruction, and the framework usability evaluation from ten human subjects presents that the proposed framework can enhance safety during the operation and does not introduce extra burden, which shows its potential in clinical applications.
Augmented reality provides the possibility to enhance surgical safety based on visual feedback, and another popular direction, virtual fixtures, can also enhance safety by providing force feedback.In the next step, we will conduct the virtual fixtures based on the dVRK platform and compare the difference between the visual feedback and the force feedback in surgical assistance.

Fig. 1 .
Fig.1.The architecture of the markerless augmented reality framework.The image pair is fed into two different networks to estimate the disparity map and the binary mask.Then, the disparity pixels that belong to the blood vessel can be reprojected to generate the 3D intra-operative blood vessel.The pre-operative model can be overlapped on the corresponding region of the endoscopic images by registration, and the minimum distance between the instruments and the blood vessel can be detected based on the dVRK kinematics and the 3D reconstructed blood vessel.

Fig. 3 .
Fig.3.A simulated lymphadenectomy based on the dVRK platform in a dry lab.The first row presents the standard endoscopic scenes, while the second row shows the augmented scenes with AR assistance.The two gauges show the respective minimum distance between the two instruments and the blood vessel.The pre-operative model is overlapping on the intra-operative blood vessel, and its color will change automatically following the smaller distance by comparing the left and right distances.The defined task is to remove the ten lymph nodes while not touching the delicate blood vessel for safety.

Fig. 4 .
Fig. 4. Qualitative surgical scene reconstruction result.The reconstructed 3D surgical surfaces based on six representative models are provided.

Fig. 6 .
Fig. 6.The data of the ten human subjects based on five different metrics."Control" means the users complete the operation based on the standard endoscopic scenes, while "Experiment" is based on the scenes with the AR assistance.The result of the Wilcoxon signed-rank test is shown as ns : 0.05 < p ≤ 1, * : 0.01 < p ≤ 0.05, * * : 0.001 < p ≤ 0.01, * * * : 0.0001 < p ≤ 0.001, and * * * * : p ≤ 0.0001.

Fig. 7 .
Fig. 7.The specific SUS score distribution provided by the users as well as the average SUS scores in two different modalities.

TABLE I QUANTITATIVE
DEPTH ESTIMATION RESULT BASED ON THE SERV-CT STEREO ENDOSCOPIC DATASET (THE IMAGE RESOLUTION IS 720×576).

TABLE II QUANTITATIVE
SEGMENTATION RESULT USING THE SELF-MADE DATASET (RESOLUTION: 1920×1080) BASED ON 6-FOLD CROSS VALIDATION.

TABLE III THE
RUNNING TIME DISTRIBUTION OF THE FRAMEWORK (THE IMAGE RESOLUTION IS RESIZED INTO 640×360)

TABLE V THE
AVERAGE VALUES AND P VALUE BASED ON THE USERS' DATA.
also increases from 1.3387cm to 1.4641cm.When considering the collision number, the value reduces from 23.8 to 13.7.