Toward Human-Out-of-the-Loop Endoscope Navigation Based on Context Awareness for Enhanced Autonomy in Robotic Surgery

Although the da Vinci surgical system enhances manipulation dexterity and restores 3D vision in robotic surgery, it requires surgeons to asynchronously control surgical instruments and the endoscope, which hinders a smooth operation. Surgeons frequently position the endoscope to maintain a good field of view during operation, potentially increasing surgical time and workload. In this paper, a Human-Out-Of-The-Loop (HOOTL) endoscope navigation control with the assistance of context awareness is proposed to enhance surgical autonomy. A comprehensive comparison study using 8 state-of-the-art networks was conducted to find out the best model for surgical phase recognition. Ten human subjects were invited to participate in a classic ring transferring task based on three different endoscope navigation pipelines on a da Vinci research kit platform, including standard endoscope navigation, semi-autonomous endoscope navigation with manual pedal control, and HOOTL endoscope navigation supported by vision-based phase recognition. The experimental results showed that the proposed endoscope navigation approach releases the operation need of controlling the pedals, and it significantly reduces the execution time compared to the other two navigation pipelines. The result of the NASA Task Load Index (NASA-TLX) questionnaire indicates that the proposed endoscope navigation can reduce the physical and mental load for the users.


I. INTRODUCTION
R OBOT-ASSISTED Minimally Invasive Surgery (RAMIS)   has been widely adopted in medical practice because it shows the potential to reduce intra-operative bleeding and tissue trauma to patients, and shorten the postoperative hospital stay compared to traditional open or laparoscopic surgery.Also, it provides surgeons a 3D view of the surgical field and increases dexterity with the instruments [1], [2].The da Vinci Surgical System (dVSS, Intuitive Surgical, Sunnyvale, CA, USA) is a representative among surgical robots thanks to its commercialization success [3], [4].It has been utilized in various minimally invasive surgeries in hospitals today, such as cholecystectomy [5], prostatectomy [6] and nephrectomy [7].DVSS overcomes a major limitation of laparoscopic surgery, which is that the surgeon manipulates the surgical instruments while an assistant manipulates the endoscope, requiring a high degree of cooperation between them.In comparison, dVSS allows the surgeon to control the instruments or endoscope independently without requiring the involvement of an assistant.Nevertheless, the da Vinci robot does not support the surgeon in simultaneously operating the surgical instruments and the endoscope.To position the endoscope, the surgeon needs to interrupt the manipulation of the surgical instruments, and move to operate the foot switches on the pedal tray and the manipulators on the console, which may affect the smoothness of the operation and prolong the operating time [8].
Autonomy in medical robots is a promising but challenging direction that has attracted much attention in many research laboratories [9].Furthering autonomy in medical robots has the potential to increase the accuracy of operations and reduce the workload of surgeons [10].However, according to the definition of autonomy in [11], dVSS does not yet have autonomy as it is under the full control of surgeons during surgery.To increase the autonomy of dVSS, one of the popular topics is autonomous endoscopic navigation.It offers the possibility of freeing the control of the endoscope so that surgeons can concentrate on controlling the surgical instruments, which may release the fatigue of surgeons and improve surgical performance.Some works have been done to explore this field, for instance, the authors in [8] proposed a camera autonomous navigation approach based on the da Vinci robot.The camera can track the surgical tool tips by utilizing the kinematic data, and users can determine the tracking modes (including tracking the single tool tip or the middle point between the two tools) by depressing the foot switch.The experimental results based on a ring transferring task showed that the designed navigation system promoted better operation performance for the users than the standard setup.As an extension, the authors of [12], [13] adopted the autonomous camera navigation approach to perform an ex vivo neobladder reconstruction in a dry lab.Ten urologists were invited to conduct this operation, and the results showed that the camera navigation method can boost the system usability and reduce the operation time compared to the standard camera control.Similarly, the authors in [14] implemented an autonomous endoscope navigation approach by tracking the middle points of the instruments in a virtual reality simulator for surgical skill training.Referencing a time-accuracy metric and a camera-related metric, they found that the novices can obtain better skill improvement with the assistance of autonomous endoscope navigation than the novices who achieved training in manual endoscope control.
The above human-in-the-loop endoscope navigation strategies require users to manually switch different camera tracking modalities.To advance the autonomy in the endoscope motion, the authors in [15] proposed an online gesture recognition based navigation approach.They exploited the kinematic data containing 17 dimensional features to introduce the situation awareness for endoscope motion mode switching without the involvement of humans.However, the prediction accuracy of 0.84 reported by the authors remains to be improved.Recognizing the surgical situation based on the kinematic data may not be reliable.Surgeons may have different manipulation gestures even when performing the same operation, resulting in variable kinematic data such as the pose and velocity of end-effectors.Furthermore, background information is also a critical resource to be exploited in context awareness except for the information of surgical instruments [16].It can be noticed that surgeons perceive the surgical context by relying on the direct input of surgical video streams instead of kinematic data in RAMIS.
Vision-based surgical context recognition gradually becomes a mainstream direction, because the rapid development of deep learning technology promotes promising recognition performance in this field [17], [18], [19].EndoNet is a well-known neural network proposed in [20] to recognize the surgical phases and tool labels using videos of cholecystectomy surgeries.It relies on the AlexNet architecture [21] with a customized feature processing module to perform a multi-task prediction.After that, the authors in [22] also proposed a multi-task recurrent convolutional network to predict phase labels and tool labels simultaneously.They designed an end-to-end network integrating residual units [23] and Long Short-Term Memory (LSTM) [24] modules to process features of video clips, and obtained high accuracy in both tool and phase recognition.Similarly, the authors in [25] proposed a unified framework integrating deep learning and knowledge representation to predict the surgical phases, steps and tool labels using 9 videos of robot-assisted partial nephrectomy, but the non-end-to-end predictive property hinders its online deployment in practice.Next, the authors in [26] built a memory bank module to model global features of long-range video clips, and another branch containing ResNet and LSTM was built to extract high-level feature representation of short-range video clips, then different-scale features were processed by an attention module for final phase classification.The authors in [27] also introduced the memory bank module and transformer [28] to perform an accurate phase recognition.Global feature modeling may heighten prediction performance, while it hinders real-time phase recognition.
Online context recognition provides the possibility to make surgical decisions automatically.In [29], the authors integrated context awareness into imitation learning to implement a human-robot shared control.The surgical instruments can perform adaptive movement either under the control of users or following the trajectory generated by imitation learning, which is determined by three recognized phases based on video streams.The experiments were conducted in a ring transferring task, and the results showed that the assistance of context awareness can perform autonomous surgical decisions and promote the manipulation performance of users.
A context awareness based endoscope navigation approach is proposed in this paper to implement a Human-Out-Of-The-Loop (HOOTL) endoscope control.It can be defined as task autonomy according to the definition of autonomy in [11], since the endoscope control modes and the surgical context are predefined based on specific tasks.This differs from the work of [15], where kinematic data was used for online gesture recognition, we adopted vision (i.e., images) to recognize the surgical context, which aims to perform online phase recognition rapidly and with high accuracy.Vision-based context recognition has high generalization since it extracts the features of both surgical instruments and background scenes compared to kinematics-based methods.To the best of our knowledge, this is the first work to utilize vision-based phase recognition to further the autonomy of endoscope navigation.We introduced a classic ring transferring task for the comparison of (1) the standard manual endoscope navigation (2) the semi-autonomous endoscope navigation [8] (3) the proposed HOOTL endoscope navigation, to explore the influence of different endoscopic autonomy by analyzing the data captured from ten human subjects.
In summary, this paper has the following contributions: (a) It is the first time that vision-based surgical phase recognition is integrated into endoscope navigation to implement a HOOTL endoscope control.
(b) A comprehensive comparison study based on 8 stateof-the-art phase recognition networks involving images and kinematics was performed.
(c) A user study based on ten participants was achieved to explore the feasibility of the proposed HOOTL endoscope navigation.

II. METHODS
Fig. 1 presents the three endoscope navigation pipelines, including the standard endoscope navigation template using The surgical instruments are mounted on the Patient Side Manipulators (PSMs), and the endoscope is mounted on the Endoscopic Camera Manipulator (ECM).The mechanical structure allows these arms to meet Remote Center of Motion (RCM) constraints during operation, to avoid conflicts with the skin entry point of the abdomen, and these RCM points can be repositioned by the passive Set Up Joints (SUJs) [30].There is a foot pedal tray involving four pedals at the master sole: the Clutch pedal is used to extend the operating space of surgical instruments, the Camera pedal is used to adjust the viewpoint of the endoscope, while the Bicoag and Coag pedals are not activated in the setup.
Considering that the endoscope is driven by the ECM, we provide a description about the kinematics of the ECM arm.The ECM is a 4 Degrees of Freedom (DoFs) actuated arm that can move the endoscope around the RCM point following a RRPR sequence, where R is the revolute joint and P is the prismatic joint.Fig. 2 shows the kinematics of the ECM.The ECM end-effector pose can be described by a homogeneous transformation matrix T RCM EE between the end-effector frame F EE and the RCM frame F RCM , and it can be calculated by applying the standard Denavit-Hartenberg (DH) convention to the kinematic chain.Table I gives the DH parameters of the ECM arm [31], [32].

B. Task Description
The ring transferring task belongs to one of the Fundamentals of Laparoscopic Surgery (FLS) tasks for surgical skill assessment [33], and it is a common scenario to measure the manipulation level of users on the dVRK platform [8], [14], [29], [34].We adopted this task to evaluate the proposed endoscope navigation strategy, as demonstrated in Fig. 3.The task can be divided into three phases: • LH: Operate the left instrument to pick up the black ring on Peg A, transfer it to Peg B, place it down and then pick it up again.
• BM: Operate both left and right instruments to swap the ring from the left instrument to the right instrument on the top of Peg C, then transfer it to Peg D. • RH: Operate the right instrument to place the ring down and pick it up again on Peg D, then transfer it to Peg A and place it down.
The diameter of the circular pegs is slightly smaller than that of the black ring, and the users need to avoid collisions with the pegs when manipulating the ring, so the users need a good Field Of View (FOV) using the endoscope during operation.

C. Endoscope Control
Instead of image-based endoscope navigation in the 2D image plane, position-based endoscope navigation is introduced to allow the endoscope motion in the 3D surgical space [35], and the positions of surgical instruments are considered as tracking objects for the endoscope.Three tracking modes based on the instruments are introduced, including the left end-effector, right end-effector and a variable-scale center point referencing both the left and right end-effectors [8].The base frame is positioned at the base of the dVRK patient cart {CART}, and all subsequent position coordinates refer to the same base frame.
• Mode 1: The endoscope keeps tracking the end-effector position of the left instrument, and the position of the endoscope tip Endo T_P can be formulated as: where EE L_P is the end-effector position of the left instrument, and Endo RCM_P represents the RCM position of the endoscope.z l is the zooming factor and it is set to 0.1 in this work., represents the Euclidean norm.
• Mode 2: The endoscope considers tracking the center point position SC P between the left and right end-effectors in a variable scale manner, and the equation of the endoscope tip position Endo T − P is written as: where EE L_P and EE R_P are the positions of left and right end-effectors, respectively.z m is the zooming factor and it is equal to 0.08.s d is the scale factor to control the variable zooming based on the distance between the left and right endeffectors, and it is set to 0.4.In this way, it implements a variable zooming effect influenced by both the center point and the distance between the two instruments.
• Mode 3: The calculation of the position of the endoscope tip is similar to mode 1, but the endoscope keeps tracking the end-effector of the right instrument, where the zooming factor z r is equal to 0.1 as well.

D. Context Awareness Based Tracking Mode Switching
Vision-based context awareness is introduced to enhance the autonomy of endoscope navigation by predicting three different operation phases in a ring transferring task, as shown in Fig. 3. Specifically, the endoscope will adopt Mode 1 (tracking the left end-effector) for the navigation when the recognized context is "LH", the navigation will switch to Mode 2 (tracking the center point of the two end-effectors) when the phase is recognized to be "BM", and the endoscope will perform the navigation of Mode 3 (tracking the right end-effector) with a prediction of "RH".Using this navigation strategy can ensure that the endoscope continuously provides a good FOV for users during operation.
A series of neural networks [20], [22], [26], [29], [36] were compared to find out the best configuration to predict specific phases based on images.We also reproduced the kinematicsbased model proposed in [15], which utilized 17 dimensional kinematic features to recognize the phases during operation.To construct the database for model training, we captured seven videos (containing 10763 images with the corresponding kinematic data) by performing the ring transferring task based on the dVRK platform.Following the image preprocessing approach in [26], we resized the images from 1920×1080 to 250×250 to reduce the inference time of the networks for realtime phase recognition.The phase labels of the images were manually annotated using the open source software Anvil [37].It supports importing a single video, providing custom tracks for frame-by-frame phase labelling, and outputting a XML file containing the specific time and labels of all frames after annotation.The models [22], [26] rely on the input of video clips, and the model [15] utilizes the input of kinematics with consecutive frames, while other networks require a single image frame as the input.To promote the best performance of those models that use temporal features, we consider different downsampling rates when constructing the input of video clips and kinematics, including 1, 2, 5, 10, 15, 20 and 25.Here, the downsampling rate of 1 means that the frames consisting of the input are continuous, while a downsampling rate of 25 means that the number of frames between two sampled frames is 24 when making the video clip and kinematic input.The other parameters are standard configurations for their respective models.It should be noted that the downsampling operation was only used to construct video clips and kinematics for those models that consider inter-frame information as input, while all the collected images (for the vision-based models) and kinematic data (for the kinematics-based model) were used during the training and testing stages to maximize the utilization of the dataset and enhance the generalization.
To conduct a quantitative evaluation of the models, we consider six common metrics, including DICE coefficient Similar to the evaluation strategy in [22], [26], the accuracy was calculated at video-level by considering correct classifications in the entire video, while other metrics were calculated at phase-level and then we averaged the values of all phases to get the results of the entire video.7-fold cross validation was adopted to ensure the reliability of the evaluation results.The models were trained on a server with an NVIDIA A100 GPU (40 GB), and evaluated on a local PC with an NVIDIA RTX 3080 GPU (16 GB).

E. User Study
Ten non-expert human subjects (22 to 29 years old, 6 males and 4 females, only one left-handed) were invited to perform the ring transferring task based on the dVRK platform in a dry lab.Written informed consent was obtained from all the subjects included in the study (Institutional Review Board (IRB) approval number: 30/2023).Three endoscope navigation pipelines were adopted for the participants: • Control (C): the users perform the task based on a standard endoscope navigation pipeline by controlling both the Camera foot switch and MTMs.
• Semi-autonomous experiment (E1): the users perform the task using a semi-autonomous way [8] by controlling the Bicoag pedal to manually select different endoscope tracking modes.
• Autonomous experiment (E2): the users perform the task using a HOOTL endoscope navigation way, i.e., the neural network keeps recognizing the specific phase and switching different endoscope tracking modes automatically.
Before the experiments, the participants were provided around 30 minutes to become familiar with the da Vinci robot as well as the three different experiments.Once the users were ready to operate the robot, the formal experiments began and were repeated for three rounds.In each round, the three endoscope navigation experiments were performed randomly.The initial distance between the endoscope and the surgical instrument was set to 0.1m, so the users need to actively position the endoscope to maintain a good FOV.Furthermore, the function of the Bicoag foot switch was retained in the autonomous experiment (E2), in case the users want to switch the endoscope tracking modes by themselves caused of the wrong phase prediction of the network.

F. Performance Metrics
To quantitatively evaluate the proposed endoscope navigation system, we adopted four common performance metrics to analyze the data captured from the users, including: 1) The depressing number related to the endoscope foot switches N ENDO .In the control experiment (C), the users need to depress the Camera pedal to position the endoscope, while the Bicoag pedal is utilized in the semi-autonomous experiment (E1), as well as the autonomous experiment (E2) if necessary, so it can be expressed by a unified formula as, where n is the number of the collected data points in the whole task.S CAM = 1 means that the depressing signal of the Camera pedal was detected as on, and S BIC = 1 denotes that the user depressed the Bicoag pedal.
2) The movement path of the endoscope tip P ENDO during operation, and it can be formulated as, where Endo k T_P represents the k-th 3D position of the endoscope tip.
3) The movement path of MTMs P MTM to perform the whole task, and it can be defined as, where M k L_P and M k R_P represent the end-effector positions of the left and right MTMs in the k-th data point, respectively.
4) The execution time T exe to achieve the entire task.The Wilcoxon signed-rank test (p<0.05)was adopted to check if there are significant differences among different endoscope navigation experiments.After the experiments, the users filled out a NASA-TLX questionnaire [38] by giving scores on six specific questions including Mental Demand, Physical Demand, Temporal Demand, Performance, Effort and Frustration.It helps understand the subjective workload concerning three different endoscope navigation pipelines.

A. Comparison Study of the Networks for Context Awareness
Considering that some models require the input of video clips [22], [26] or kinematics [15] with consecutive frames, a preliminary comparison was made using different downsampling rates to generate the input of video clips and kinematics, and the result was shown in Fig. 4. Here, the TMRNet model contains three different architectures, and we tested their Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.performance respectively from S1: the simplest structure to S3: the most complex structure.Then, we referenced the videobased accuracy metric to select the models with the best performance at the appropriate downsampling rate, and the quantitative result can be seen in Table II.The vision-based models ResNeSt [36] (single frame input) and MTRCNet-CL [22] (video clip input) got the highest recognition accuracy of 0.99, while the kinematics-based model [15] has the worst accuracy of 0.79 in predicting the phases.Also, we provided a qualitative evaluation of the phase recognition using a test video, as shown in Fig. 5.Then, we integrated the two models with the highest accuracy, ResNeSt and MTRCNet-CL, into our dVRK platform.After some practical tests by the participants, we observed that the model ResNeSt has higher generalization compared to the model MTRCNet-CL, so ResNeSt was adopted for the following user study.It can be explained that MTRCNet-CL uses the temporal features for phase recognition, however, the inter-frame information is variable under the manipulation of different users caused by some factors such as speed, and a fixed downsampling rate may not be able to handle such variability.On the contrary, ResNeSt obtained a reliable prediction because it requires a single frame as the input, i.e., it is not affected by the inter-frame variability, which ensures the generalization with the manipulation of different participants.The mean time of ResNeSt takes 9.36 ms (about 106 FPS), so it can provide real-time phase recognition during operation.

B. System Usability Evaluation
Fig. 6 shows the data distribution of ten participants who performed the ring transferring task in three endoscope navigation pipelines.Referencing the metric of N ENDO , the result using the proposed HOOTL autonomous endoscope navigation (E2) has significant differences compared to both the standard navigation (C) and the semi-autonomous navigation pipelines (E1).Nobody utilized the Bicoag pedal in the autonomous endoscope navigation, though this foot switch was retained in this modality.It means that the selected vision-based model can perform an accurate and reliable phase prediction under the manipulation of different users, which implements a HOOTL endoscope control.Also, there are significant differences between E2 and the other two endoscope navigation experiments C and E1 when referencing T exe .In the standard endoscope control, the users need to manually move the endoscope or the surgical instruments separately, which is a time-consuming operation.Similarly, in the semiautonomous endoscope navigation pipeline, the users need to suspend their manipulation of the instruments and manually control the Bicoag pedal to switch different endoscope tracking modes.As a comparison, the users can concentrate on the manipulation of surgical instruments in the HOOTL autonomous endoscope navigation, which can implement a smooth operation.When considering the other two metrics P ENDO and P MTM , there is no significant difference between E1 and E2, while the result of C is significantly different from the other two navigation pipelines.Specifically, the movement path of standard endoscope navigation is statistically smaller than the other two pipelines since manual navigation can not maintain a good FOV during the task.Furthermore, the path of MTMs is significantly greater than the semi-autonomous and HOOTL autonomous pipelines, since the users need to operate MTMs for both the surgical instrument positioning and the endoscope positioning in the standard navigation pipeline, which potentially introduces a high physical load.
The specific mean values and standard deviations are provided in Table III.With the assistance of context awareness based HOOTL endoscope navigation, the execution time T exe is significantly reduced by 24.67% compared to the standard setup, and reduced by 6.57% compared to the semi-autonomous endoscope navigation.More importantly, it releases the need to control the pedals for adjusting the position of the endoscope during operation, while the mean number of depressing the pedals is 7.2 in the standard control (C) and 2 in the semi-autonomous modality (E1).The scores of the NASA-TLX questionnaire provided by the participants based on six specific questions.The mean scores based on the three endoscope navigation pipelines are shown as solid lines, and their standard deviations form the respective semi-transparent areas.Fig. 7 presents the results of the NASA-TLX questionnaire filled out by the ten participants.The HOOTL autonomous endoscope navigation gets the highest mean score when referencing the metric of Performance, while the standard endoscope navigation gets the highest mean scores in the other five aspects.The users intuitively think that they have the best performance to achieve the task in the HOOTL experiment, while the standard navigation with manual endoscope manipulation introduces the highest workload in both physical and mental aspects.The statistical test shows that the weighted scores of the HOOTL autonomous endoscope navigation are significantly different from those of the standard navigation and the semi-autonomous navigation (p = 0.0020 and 0.0039).

IV. DISCUSSION
Autonomy in robotic surgery is an important research direction to reduce the operating burden for surgeons and increase the efficiency of surgery [11], and it can be mainly divided into two specific aspects: autonomy in surgical instruments and autonomy in endoscope.Endoscope autonomy is foreseeable and more likely to be deployed in existing medical robots than surgical instrument autonomy, because endoscopes do not directly contact soft tissues and organs during operation, while surgical instruments may damage delicate structures if accidents occur during autonomous operation.In this paper, we proposed a HOOTL endoscope navigation based on context awareness.Context awareness can adapt to the current surgical situation in the operating room and provide intelligent intraoperative decisions [39], which motivates us to integrate it into Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
endoscope navigation to perform a HOOTL control, aiming at higher surgical benefits.
Deep learning has gradually taken a dominant position in various fields of medical image processing [40], [41] including surgical context recognition, so we introduced the vision-based neural networks to recognize specific surgical phases, which were used to automatically switch the endoscope tracking modes.After a comprehensive comparison study and physical tests on the dVRK platform, we found that ResNeSt can provide satisfactory phase recognition in terms of accuracy and speed, so it was adopted to perform the user study in the ring transferring task.Nevertheless, it should be noted that the models using temporal features may have better performance if they can adaptively process the inter-frame difference rather than adopting a fixed downsampling rate for the video clips.As more and more network architectures are proposed, the temporal feature based models may have higher generalization as the temporal feature may be a key information to be exploited in the surgical context recognition.
To evaluate the difference between the three endoscope navigation pipelines in Fig. 1, we invited ten participants to perform a classic ring transferring task.The result in Fig. 6 showed that the context awareness based HOOTL endoscope navigation approach allows the users to concentrate on the manipulation of surgical instruments, and relieve the manipulation burden of the endoscope, i.e., the users can perform a smooth operation using the HOOTL autonomous endoscope navigation, which helps shorten the operation time.According to the feedback of the participants after the experiments, the standard endoscope navigation pipeline needs improvement for them since they have to control both the foot switch and MTMs while suspending the operation of instruments when they want to position the endoscope, which makes their manipulation intermittent.As a comparison, the HOOTL endoscope navigation is preferred as they do not need to control either the pedal or the MTMs to position the endoscope, and they can concentrate on the surgical instruments to complete the task.The questionnaire scores also indicate that the HOOTL endoscope navigation can reduce the workload and improve performance from a subjective perspective.
A possible limitation comes from context awareness.Although the selected model showed strong generalization to the unseen users in our task, the complex clinical environment may affect the accuracy of context recognition.To address this issue, there are some possible solutions: (1) A foot pedal can be retained to allow users to manually switch endoscope tracking modes to address network errors if necessary.
(2) Some authors mentioned that the integration of vision and kinematics may improve the context recognition ability in their future work [15], [16].The integration of kinematic data may improve the generalization of the model, but it may also weaken the performance, which requires further research to verify.In our user study, the vision-based model showed a satisfactory recognition performance.(3) The emerging big models showed high generalization, such as Segment Anything (SAM) [42] in the field of image segmentation.Better context recognition models are expected in the near future.
Furthermore, we adopted a fixed orientation in the endoscope navigation, i.e., the endoscope maintains a horizontal shooting angle, as in some other works [15], [43].Although the endoscope usually maintains a horizontal angle during surgery, surgeons sometimes manually rotate the angle of the endoscope for an optimal FOV.How to implement adaptive orientation adjustment in autonomous endoscope navigation remains a problem to be solved.Finally, the joint limits avoidance issue in autonomous endoscope navigation also needs to be further optimized.A promising solution proposed in [31] can be introduced to optimally constrain the motion of the ECM joints.

V. CONCLUSION
In this paper, a context-aware HOOTL endoscope navigation is proposed for augmented autonomy of robotic surgery.A comprehensive comparison study was conducted to find the best network among a number of state-of-the-art models for phase recognition.To evaluate the proposed HOOTL endoscope navigation, a user study involving ten participants was conducted based on a classic ring transferring task using three different endoscope navigation pipelines, and the experimental results showed that the proposed strategy can maintain a good FOV of the endoscope by automatically switching different endoscope tracking modes based on the recognized phases.It reduces the workload of the users, and significantly shortens the operation time, showing the possibility and potential to be integrated into clinical practice.
To increase the practicality of the proposed strategy in clinical use, we consider introducing more endoscope tracking modes and more elaborate surgical tasks involving more phases.In addition, some surgeons who have experience in robotic surgery can be invited to expand our user base.

Fig. 1 .
Fig. 1.A demonstration of the dVRK components and the three different endoscope navigation pipelines.(a) shows the standard pipeline for manipulating the endoscope, which requires a combination of the foot switch and manipulators; (b) presents the semi-autonomous endoscope navigation pipeline [8], and the users manually select the endoscope tracking modes by controlling a foot switch; (C) shows the autonomous endoscope navigation strategy to perform a HOOTL control based on context awareness.

Fig. 3 .
Fig. 3.A demonstration of the ring transferring task based on the dVRK platform.The red path represents "LH", where the user needs to operate the left instrument to move the ring from Peg A to Peg B; the green path denotes "BM", where the user performs a bimanual operation to swap the ring and transfer it to peg D; and the blue path demonstrates "RH", where the user uses the right instrument to move the ring from peg D back to peg A. It also illustrates the definition of the reference frames, and the base frame is positioned at the base of the patient cart.

Fig. 4 .
Fig.4.The evaluation result using different downsampling rates to consist of video clips and kinematics with consecutive frames as input to the models.For the single frame input models, different downsampling rates do not affect the performance.The y-axis shows the values related to the accuracy metric based on 7-fold cross validation.

Fig. 5 .
Fig. 5.A qualitative comparison of the phase recognition using the advanced models based on a ring transferring task.The color-coded ribbon illustrates the three defined phases along the temporal direction.

TABLE I DH
PARAMETERS OF THE ECM ARM.q 1 , . . ., q 4 ARE THE JOINT VALUES

TABLE II QUANTITATIVE
EVALUATION OF THE NETWORKS BASED ON 7-FOLD CROSS VALIDATION.THE NUMBER AFTER THE MODEL NAME IS THE BEST DOWNSAMPLING RATE CONSISTING OF VIDEO CLIP AND KINEMATIC INPUT, AND THE INPUT IS THE SINGLE FRAME WHEN THE NUMBER IS 0

TABLE III THE
MEAN VALUES AND STANDARD DEVIATIONS FROM THE USERS Fig. 7.