Surgical Tools Detection Based on Training Sample Adaptation in Laparoscopic Videos

The performance of object detection methods plays an important role in the recognition of surgical tools, and is a key link in the automated evaluation of surgical skills. In this paper, we propose a novel framework for one-stage object detection based on a sample adaptive process controlled by reinforcement learning, which can maintain the speed advantage while maintaining higher accuracy than two-stage object detection methods. We use m2cai16-tool-locations and AJU-Set, two datasets covering seven surgical tools with spatial information collected from hospital gallbladder surgery videos to evaluate and verify the effectiveness of our proposed framework. The experiments show that our proposed framework can make the one-stage object detection method achieve 70.1% and 77.3% accuracy on m2cai16-tool-locations and AJU-Set, respectively. We further validated the effectiveness of our proposed framework by analyzing the usage patterns, motion trajectories, and mobile values of surgical tools.


I. INTRODUCTION
Surgery, as an important part of clinical medicine, plays a key role in solving human diseases. However, due to an imbalance in the level of social and economic development among regions [1], a considerable number of people cannot receive high-quality surgical treatment. In a state of lack of medical conditions, patients suffer trauma and complications due to low-quality surgical treatment, leading to a series of serious sequelae and even death. In response to this problem, the traditional model [2] in the medical field relies on assessment from senior experts to guide surgeons who need to communicate and learn, but this is limited by the impact of individual subjectivity and time-consuming processes.
To solve the abovementioned problems, in recent years, researchers have taken advantage of the rapid development of image processing technology to carry out automated assessment of surgical skills. The academic community, with the assistance of surgeons, analyzes videos recorded during operations to provide learners with a more objective, standardized and automated evaluation of surgical skills. The identification The associate editor coordinating the review of this manuscript and approving it for publication was Kok-Lim Alvin Yau . and positioning of surgical tools during surgery is the basis for automated evaluation of surgical skills and, can be achieved with the support of object detection technology.
Object detection is currently divided into two broad categories, anchor-based and anchor-free detectors. Anchorbased detectors are divided into two types: one-stage [3]- [5] and two-stage [6]- [10] detectors due to different processing methods in the preprocessing stage. Anchor-free detectors are divided into two types, keypoint-based [11]- [13] and centerbased [14]- [16] detectors, due to the different positional relationship between the predicted points and the object.From a formal point of view, anchor-free object detection methods are better than anchor-based methods because they eliminate the predefined design of an anchor, but this is not the case. Recent research [17] shows that the essential difference among object detection methods lies in the strategy defined for training positive and negative samples, rather than whether to use anchors.Therefore,a factor that has an important influence on object detection is the reasonable division method of positive and negative training samples. Based on this theoretical discovery, in this paper, we propose a novel one-stage object detection framework based on a sample adaptive process controlled by reinforcement learning, that is used to detect surgical tools quickly and accurately. Compared with current object detection methods, our proposed framework is more targeted towards the detection of surgical tools with complex backgrounds and small training sample sizes. With the support of the proposed framework, sample adaptation allows an object detection model to set thresholds based on a sample's own attributes to more reasonably distinguish between positive and negative training samples, and with reinforcement learning control [18],it can manipulate deformations in a negative sample bounding box to reach the positive sample standard. Therefore, with the help of the proposed framework, an object detection method can focus more on the surgical tools under complex background conditions, to achieve higher accuracy while maintaining the advantage of one-stage object detection speed.
The main contributions of our work are as follows.: • (1) For the first time, we use reinforcement learning control to optimize sample adaptation, a novel definition strategy for training positive and negative samples.
We first propose a one-stage object detection framework based on sample adaptation for the task of surgical tool detection.
• (3) The one-stage object detection supported by our proposed framework based on reinforcement learning control sample adaptation achieves better performance than other object detection methods on the cholecystectomy surgery datasets m2cai16-tool-locations and AJU-Set [19].

II. RELATED WORK A. LAPAROSCOPIC CHOLECYSTECTOMY
With the rapid development of electronic information technology in various fields, the method of recording surgical procedures with micro lenses has been increasingly widely adopted. The early purpose of this recording method was to facilitate a surgeon returning to the operation process in reflecting and summarizing the operation links, and has laid the foundation for subsequent research on the automated analysis of surgery. With the development of machine vision, research on the automatic analysis of surgery also bears obvious traces of the times. Early traditional surgical automation analysis research [20]- [22] is used for stage analysis, and is completed by many statistical models that rely on manual design features, such as conditional random fields [23], [24], Bayes classifiers [25], hidden Markov models [21], [26].
With the advent of the milestone technology of deep convolutional neural networks (CNNs) [6], which are a milestone technology, the traditional manual design feature model has been replaced by the automatic description of features obtained by CNNs, and has led to impressive results [27]- [29] in the analysis and research of surgical automation based on CNNs. However, the application of most of the current models belongs to the frame-level tool presence detection in the M2CAI 2016 tool presence detection challenge, which is essentially a surgical training task that is different from true surgery. Compared with true surgery, surgical training only focuses on specific tasks in an operation, which makes it unable to reflect the unpredictable conditions of smoke and lens fogging and anatomical deformation that may occur in an actual surgical environment. Only the method in [30], which truly achieves the technical evaluation of the true surgical level in a complete environment, expands the spatial information for the detection of the M2CAI 2016 tool to obtain a new dataset and uses faster regions with CNN features(R-CNN) [8] as the object detection method for surgical tools detection.

B. ANCHOR-BASED VS ANCHOR-FREE MODELS
Affected by the idea of traditional classic object detection algorithms, object detection methods based on deep convolution network technology also retain the concept of anchors. As a landmark object detection model in the introduction of deep convolutional networks, R-CNN [6] greatly surpassed the performance of previous related models based on traditional methods. It is for this reason that subsequent object detection based on deep convolutional network has been deeply affected by R-CNN and two-stage object detection [7]- [10] methods were developed that still occupy an important position in this field. The methods are called twostage object detection methods because candidate boxes are generated for images that need to be recognized first, and then detector is performed on these candidate boxes to identify the category and position. The two-stage object detection methods have achieved very impressive results in accuracy, but in practical applications, object detection requires higher execution efficiency to ensure real-time performance. In response to this problem, the academic community has proposed onestage object detection methods [3]- [5], which combine the two parts of a two-stage methods into one. Single shot multibox detector(SSD) [4] pioneered the use of multiscale layers to directly predict objects, ensuring high efficiency while greatly improving accuracy. Since then, the academic community has put forward much work to promote its performance in different aspects [5], [31]- [34].
The object detection methods based on an anchor depend on the design of the properties of the anchor box in advance, and this design has a great influence on the effect of the object detection methods. To avoid this problem, in recent years, the academic community has proposed anchor-free object detection methods. The idea of anchor-free object detection methods is to associate objects with specific points, that is, to detect objects by predicting some points. Object detection methods that are anchor-free are divided into two types due to the different positions of these observation points on the object. One object detection method type contains observation points that are mainly distributed around the detected object, which is called keypoint-based object detection [11]- [13]. Another object detection method type contains observation points that are mainly distributed in the center of the detected object and is called center-based object detection [14]- [16]. For the method in [17], similar anchor-based and anchorfree methods were selected for an in-depth analysis of the essential differences. The conclusion of the final analysis shows that the real difference between the two method types is not the anchor boxes but the definition of positive and negative training samples.

C. REINFORCEMENT LEARNING
Reinforcement learning is an important part of the field of artificial intelligence, and is a concern of the academic community. With the successful application of deep convolutional network technology in various fields, deep reinforcement learning [18] has came into being. Due to the good performance of deep convolutional networks for feature expression, reinforcement learning can be applied to high-dimensional problems that involve images [35] and videos [36].
In the field of object detection, the method in [35] links deep reinforcement learning with object detection for the first time, providing another research idea for an object detection subject. The difference between object detection methods based on reinforcement learning and current mainstream methods [4], [5], [8] is that the former treats object detection as a sequential decision problem, thus introducing a reinforcement learning framework to solve it, and the latter treats object detection as a regression problem.
Deep reinforcement learning has achieved very attractive results in the field of machine vision, especially in video games [18], [37], [38],which have surpassed human level. However, to obtain satisfactory results the models require longer training time. This defect restricts the object recognition algorithm based on reinforcement learning. Fortunately, the distributed reinforcement learning framework [39] solves this problem very well. Only raising the CPU core can greatly improve the efficiency of model operations without increasing the GPU requirements.

III. METHODOLOGY
According to the conclusion drawn by the sample adaptive method analysis [17], we know that the definition of positive and negative training samples has a substantial impact on object detection. Based on this theoretical discovery, we propose a new framework for defining positive and negative training samples based on RetinaNet [5]. An overview of the framework is shown in Fig. 1. Our proposed framework includes two modules, a judgment module and an optimization module. The principle of the judgment module is to determine the threshold of the intersection over union(IoU) according to the candidate box information from the five layers of the different scales output by RetinaNet, instead of using the prior knowledge to determine the fixed threshold. Its role is to differentiate the training samples adaptively, that is, to automatically distinguish between positive and negative training samples based on the statistical characteristics of the samples themselves. The idea of the optimization module is to use the reinforcement learning framework [18], [35] to deform the negative sample candidate box to reach the standard of positive samples. Its purpose is to increase the proportion of positive samples within the sample.

A. JUDGMENT MODULE
The judgment module draws on the idea of the method in [17] and classifies the candidate boxes generated in RetinaNet [5] into positive and negative samples.Algorithm 1 describes how the judgment module works after an image is input into the model. For each object in the input image, all we have to do is collect the candidate boxes for this object. As described in Line 2,for L layers output by the Feature Pyramid Networks(FPN) [40] in RetinaNet, each layer generates several candidate boxes corresponding to the ground truth of an object. According to the L 2 distance between the center points of the candidate box and the ground-truth, we select k candidate boxes as training samples in Line 3. Therefore, the ground-truth of each object in the image will correspond to L * k training samples.To distinguish the positive and negative of these training samples, we first calculate the IoU value between the candidate boxes and the ground-truth, then we calculate the mean and variance of these IoUs in Line 6, and finally we use the sum of these two parts as the threshold to judge the training sample. Then, we add constraints to exclude candidate boxes whose center is outside the groundtruth. Finally, we obtain positive training samples P and negative training samples N .

B. OPTIMIZATION MODULE
The optimization module operates the candidate box in the negative sample N , and utilizes the agent under the reinforcement learning framework to perform a series of deformation operations to reach the the positive sample standard.
Action. The agent deforms the negative sample candidate boxes through a series of actions to reach the positive sample standard. This series of actions includes horizontal movement, vertical movement, zoom in and out, and stop.
State. State s is composed of two parts, feature vector o and history vector h, in a tuple (o, h). Feature vector o represents the content of the observed area, and the starting position is the negative sample candidate box. History vector h represents the record of actions adopted by the agent.
Reward Function. Reward function r represents the feedback obtained after the agent takes action. This feedback can reflect the quantified distance between the observed areas, that is, the negative sample candidate box, which is deformed by the agent using action a and the ground-truth. The quantization distance used here is IoU, which is the relative position relationship between observed area b and ground-truth g. When the agent adopts action a, the negative sample candidate box is changed from initial b to b , that is, after the agent interacts with the environment, state s changes to s . At this time, the center of the observed area is p, the center after deformation is p , and the center of the ground-truth is p g . Therefore, the reward function is as follows: In A i , select the k anchors that are closest to the center point of g based on the L 2 distance → S i 10 C g = C g ∪ S i 11 end for 12 IoU threshold for g: The feedback of the reward function is divided into two parts, one is the difference in the IoU change between the observed area and the ground-truth, and the other is the difference in the change in L 2 distance between the center of the observed area and the ground-truth center, where λ, as the coefficient, balances the two parts. Relying on the feedback of the reward function, the agent can judge the pros and cons of the action in a certain state, and trigger the stop action when the target state is reached. Here, we set the target state as follows: IoU (b, g) ≥ T g shows that the condition that triggers the agent to adopt the termination action needs to satisfy IoU threshold T g in the determination module, and it also needs to simultaneously satisfy the condition in which the center point of the deformed candidate box is in the ground-truth(GT ).
Inspired by the Ape-X architecture in the distributed reinforcement learning method [39], we divide the agents in the optimization module into two categories, exploratory agent (A e ) and development agent (A d ). Here, the role of an exploratory agent is to participate in the operation of the candidate boxes deformation and input the obtained experience feedback into public experience pool B, while a development agent directly uses these experiences to update its priorities. We set the exploratory agent to regularly update itself with the latest network parameters from the development agent, and their numbers are divided into 6 and 2. We use the Q function, which is based on the Bellman equation, to evaluate the performance of the agents in the optimization module. The optimal value in the Q function is Q * , and its formula is as follows: In Equation 3, s represents the state and a represents the action. According to the deep reinforcement learning algorithm [18], we minimize the loss function through the i-th iteration to learn the Q function of the candidate action. The formula is as follows: In Equation 4, R (s, a) represents reward r obtained by the agent after deforming the candidate box in the s state. Six exploration agents participate in the deformation operation of the candidate boxes simultaneously and add the interactive results (s, a, R(s, a)) to the experience pool B. The two development agents further utilize the experience in the processing pool and share them with the exploration agent. This kind of distributed reinforcement learning structure with multiagent division of labor and coordination ensures a more efficient and concurrent execution of the algorithm, which overcomes the shortcomings of the long training time and low efficiency of the reinforcement learning algorithm.

C. DISCUSSION
Our work is inspired by ATSS and object detection methods based on reinforcement learning.In this section, we compare the differences between them and our work. (1) The judgment module in our work draws from the solution method in ATSS,which automatically determines the IoU threshold according to the anchor's own attributes as the basis for the agent to take action in the reinforcement learning process in the optimization module. In contrast to ATSS, which only automatically adjusts the threshold based on anchor attributes, our work also uses reinforcement learning control strategies to change the anchor shape to increase the IoU value between it and the ground-truth.For object detection methods, anchors with higher IoU values indicate that the candidate observation area is closer to the ground-truth of the detected object, which can lead to higher quality detection. (2) Our work draws on the object detection method based on reinforcement learning, but unlike previous methods, we use a distributed reinforcement learning architecture. The single agent-based reinforcement learning algorithm, DQN, requires training time and does not converge easily due to its long and unstable nature, which affects its scalability and practicality as an object detection method.In response to these problems, we propose an object detection idea based on distributed reinforcement learning architecture. Unlike previous solutions based on a single agent, we use multiagents (A e and A d ) with a different division of labor in our work. A e is responsible for changing the anchor shape, generating a series of IoU values between the candidate observation area and the ground-truth of the detected object, and then adding it to experience pool B. A d is responsible for extracting the operations from the experience pool that have a significant improvement effect on verification, updating the control strategy parameters, and sharing them with A e . The concurrent execution of multiagents in a distributed architecture greatly improves the efficiency of reinforcement learning, reduces training time and is more stable.

IV. EXPERIMENTS A. EXPERIMENTAL SETUP AND IMPLEMENTATION
We perform experiments on the m2cai16-tool-locations dataset [30] and the private AJU-Set dataset. Both datasets are composed of images extracted from video frames, which contain seven different types of surgical tools. Since there have been few previous studies on surgical tools, which results in limited available data, we have adopted a variety of data enhancement tricks [41] to expand the dataset. All models in this article have been trained on two NVIDIA Geforce RTX 2070 GPUs and Core i7 9700k CPU, and our solution can achieve real-time processing speed, thus providing excellent recognition performance.

B. DATASETS 1) m2cai16-TOOL-LOCATIONS
The m2cai16-tool-locations dataset comes from the surgeon's expansion of the spatial location information of surgical tools in m2cai16-tool [29]. The m2cai16-tool dataset, which contains 15 laparoscopic surgery videos recorded at 25 fps, obtains 12541 test samples and 23287 training samples through label processing at 1 frame per second. For the task of automatically evaluating the use of surgical tools, m2cai16tool, which only contains information on whether surgical tools are stored, is not sufficient.To meet the needs of the task, with the assistance of a surgeon, we selected 2532 frames from the m2cai16-tool dataset for spatial information annotation. Following the classic partitioning strategy, we divide the data set into a training set, a test set, and a validation set according to the proportions of 50%, 30%, and 20%, respectively.

2) AJU-SET
Considering the complex background of the surgical environment, a single dataset may reduce the performance of the object detection model. To solve this problem, we obtained 20 laparoscopic cholecystectomy surgery videos with the assistance of the Second Hospital of Jilin University to form a new dataset, AJU-Set, which is shown in Fig. 2. AJU-Set, which has the same recording rate and label rate as VOLUME 8, 2020  m2cai16-tool-locations, contains 3164 labeled frames, and maintains the same dataset division ratio. With the assistance of a professional surgeon, all surgical tools in the video have been accurately labeled.
The above two datasets cover seven types of surgical tools such as grippers, bipolars, hooks, scissors, clippers, irrigators, and specimen bags. The related category number distribution information and some sample displays are shown in Table 1 and Fig. 3.

C. BASELINE METHODS
To verify the effectiveness of our method, we set ATSS [17] as a benchmark in the experiment, which is shown in Table 3. For the first time, we adopt an object detection method based on reinforcement learning for sample adaptation in the detection and analysis of surgical tools. The two-stage object detection method has a higher accuracy than the one-stage object detection method because the candidate box generation stage mines richer object context information.To verify the effectiveness of the sample adaptive method, we apply the method proposed by ATSS and our own method to the onestage object detection method and compared it with two-stage object detection. The comparison results show that the onestage object detection model under the optimization of the sample adaptive method maintains the previous speed and has higher accuracy than the two-stage object detection model. For the ATSS, which is also sample adaptive, and the method we proposed, in terms of the effect of applying the one-stage object detection model, the method we proposed has better results than ATSS.
Thanks to the better performance of the object detection model, we can use this as a basis for identifying positioning and trajectory tracking of surgical tools, thus laying the foundation for the analysis of surgical behavior quality. As shown in Table 2, we can observe the detection of surgical tools optimized by our sample adaptive method with respect to the surgical tools in the two datasets. From the table, we can see that the two surgical tools with higher detection accuracy are the clipper and hook. This might be because these two tools require a better angle to operate during surgery. In addition, from the table, we can observe that the detection accuracy for the two surgical tools, bipolar and irrigator, is lower. This could be attributed to the poor observation angle due to the difference in function and the more complicated background of surgery at this stage.

D. ABLATION STUDY
To verify the effectiveness of the modules in our proposed sample adaptation method, we set up several method variants to verify on the datasets. For a fair comparison, we compared different variants under the same conditions. Table 4, includes only the judgment module, the judgment module joint optimization module, and the multiple iteration judgment module joint optimization module. We use JM to represent the judgment module. The judgment module determines the threshold for judging the positive and negative training samples according to the mean and variance calculation of the IoU value between the candidate box and ground-truth in each feature layer. Correspondingly, OM represents the optimization module. The optimization module is based on the optimization control of the detection behavior under the reinforcement learning framework. Its purpose is to use the reinforcement learning agent to deform the candidate frame of the negative training sample to reach the positive sample standard. From Table 4, we can see that when only the JM is used, the performance of the object detection method is not as good as that when the JM and OM are used in combination.
In the case where the number of iterations is n ≥ 20, the object detection method has better performance, which leads to a decrease in overall framework performance due to a substantial increase in computing resource consumption and a decrease in speed.

E. SURGICAL SKILLS ASSESSMENT
In the following, by applying the proposed framework to the two datasets, we analyze the spatial and temporal information of surgical tools to evaluate the skill level of surgeons. To achieve this process, we propose an object detection method based on the framework of reinforcement learning in order to control sample adaptation to automatically detect and evaluate the use status of surgical tools, and complete the evaluation of the surgical process by using the surgical tools' usage patterns, motion trajectories, and mobile values.  We extracted four test videos from the AJU-Set to evaluate the surgical skill level. As shown at the top of Fig. 4, we generated a heat map to represent the range of motion of surgical tools by detecting the bounding box derived from the surgical tools. Through medical practice and experience, high-level surgical operations are carried out frequently and accurately in a specific area, showing the higher mobile value of the operation. From the observations in Fig. 4, it can be seen that heat map a corresponding to video 1 has the smallest range, reflecting the doctor's proficient skills and level during surgery.
The separation of the triangle of the gallbladder is a critical operation in cholecystectomy, and in related to biliary tract injuries and complications. Since the surgical operation at this stage performs subtle operations in a short time, we chose to observe and study the two key surgical tools, clipper and grasper. Through the information shown at the bottom of Fig. 4, we can observe that the movement trajectories of the clipper and grasper in video 1 and video 2 converge in a specific range, which shows that the two surgical tools show good cooperation during the operation. Correspondingly, the estimated range of movement for the two surgical tools in video 3 and video 4 is larger and irregular, showing that the surgical skills are not proficient.
In addition to the heat map and motion trajectory chart mentioned above, we also counted the usage time of the surgical tools to analyze the doctor's skill proficiency during the surgery. In the histogram in Fig. 5, we can observe that the bipolar images of video 3 and video 4 have been used longer, which indicates that more hemostasis operations are required during the surgery and that the surgical skills are not proficient.  To prove the effectiveness of our proposed framework, we invited four surgical experts to conduct an evaluation. They agreed that the surgical skills demonstrated in video 1 are the best, and that both video 1 and video 2 have better surgical skills than videos 3 and 4, which confirms the effectiveness of our evaluation method.

V. CONCLUSION
In this paper, we propose a novel framework for onestage object detection based on a sample adaptive process controlled by reinforcement learning. Different from the traditional method of choosing fixed thresholds for defining strategies, our adaptive framework sets the thresholds according to the statistical characteristics of the samples themselves. In addition, our proposed method also uses flexible control of the reinforcement learning framework to optimize the negative sample candidate boxes to increase the proportion of positive training samples, and thus improves the accuracy of the object detection model for object detection. For the m2cai16-tool-locations and AJU-Set datasets with fewer training samples for surgical instrument detection, our sample adaptive method allows the one-stage object detection algorithms to perform better than the two-stage object detection while maintaining high speed. Accurate surgical instrument detection is helpful in analyzing the operation behavior pattern, movement trajectory and movement value of a instrument during the surgical operation process, and provides powerful assistance in summarizing and improving a doctor's surgical skills and professional communication. For future work, we hope that we can continue to improve the accuracy and real-time nature of the object detection model to achieve the function of on-site online learning and assisted guidance of surgery.