Patch-based adaptive weighting with segmentation and scale (PAWSS) for visual tracking in surgical video

Highlights • A simple but effective colour-based segmentation model is incorporated to assign weight to the patch-based descriptor.• A two-level sampling strategy enables the tracker to handle both incremental and abrupt scale variations.• Achieve superior results on various datasets among top trackers with near real-time performance.• Substantial evaluation is performed on both ex-vivo and in-vivo surgical datasets.


Introduction
Minimally invasive surgery (MIS) relies on endoscopic and laparoscopic video cameras to provide the surgeon with vision inside the body. Developing computer assistance for such procedures with multi-modal image overlays, robotics or novel imaging requires tracking of a variety of structures within the surgical site to estimate their motion and update their position. Visual tracking in an appealing approach for this task because it relies only on the patches. By using a simple but effective colour-based segmentation model, each patch is assigned with a weight which decreases background information influences within the bounding box. Besides, a two-level sampling strategy is introduced to extract multiscale samples, which enables the tracker to handle both incremental and abrupt scale variations between frames. To reference our method to general tracking approaches, we evaluated and compared it with state-of-the-art methods on Online Tracking Benchmark (OTB)  and VOT challenge datasets. To show how it performs for surgical scenes, we used MICCAI 2015 instrument tracking datasets with promising results demonstrating that PAWSS is the best performing tracker, which also works in realtime without any specific code optimisation.

Related work
Tracking-by-detection: Recently, inspired by the success of object detection algorithms, tracking-by-detection methods has been taking inspiration from advances in machine learning, such as structured output support vector machines (SVM) ( Tsochantaridis et al., 2005 ), boosting ( Avidan, 2007;Grabner et al., 2006 ), Gaussian process regression ( Gao et al., 2014 ) and deep learning . Tracking-by-detection frameworks build a classifier to distinguish the tracked object from background and update this classifier with new positive observations as well as with negative information. It is inevitable that falsely labelled samples will appear and degrade the model because wrongly labelled samples of background confuse the classifier ultimately leading to drift or failure. Structured Output Tracking with Kernels (Struck) ( Hare et al., 2011 ) adopts a structured output SVM and circumvents the traditional collection of positive and negative samples by integrating the labelling procedure within the learning process. In recent benchmark  Struck has shown excellent tracking performance compared to prior work.
Patch-based Representations: Recently patch-wise descriptors have been exploited to represent the object appearance ( Kim et al., 2015;Chen et al., 2013;Zhang and van der Maaten, 2014 ). A bounding box is divided into cells or patches and low-level features are used to construct features of these patches, which represent local structural information. A major challenge for trackingby-detection methods is that the bounding box usually not only includes the object but also some background information. Background changes differently to the moving object and causes inaccurate information transfer through the model update. To address this problem, different methods have been proposed to decrease the effects of background information such as assigning different weights based on the pixel spatial location or appearance similarity ( Comaniciu et al., 2003;He et al., 2013;Lee et al., 2014 ). SOWP ( Kim et al., 2015 ) exploits this concept by incorporating Random Walk with Restart (RWR) simulations to assign weights to patches. RWR simulations exploit the similarity between neighbouring patches and their relevance or self-similarity to the object appearance. Stationary distributions can be obtained to represent likelihoods that each patch belongs to either foreground or background. Patch weights are designed according to likelihoods so that foreground patches would have relatively larger weights. We introduce a different weighting method to patches by incorporating a colour-based segmentation model. Previous papers have integrated a segmentation step into tracking ( Godec et al., 2013;Duffner and Garcia, 2013 ), but these methods are sensitive to segmentation results since they directly track the segmented object patches free from the constraints of bounding box. By applying a segmentation step to patch weights instead we manage to enhance performance and avoid this sensitivity.
Surgical instrument tracking: For surgical instrument tracking, information from different sources has been used for instrument tracking. Typically colour, gradient or texture ( Uecker et al., 1995;Cano et al., 2008 ) is employed to represent the appearance model. The work ( Reiter and Allen, 2010 ) proposed to learn the instrument appearance online by combining multiple features, and explores new areas as the instrument moves in or out of view. To make feature of the instrument more distinctive, artificial markers were designed and mounted to the instrument ( Wei et al., 1997;Zhang and Payandeh, 2002;Tonet et al., 2007;Zhang et al., 2017 ). Although attaching markers on instrument makes tracking more robust and simple, the idea of modifying instruments is usually avoided since it changes the surgical procedure. Also, artificial markers may introduce inconvenience, such as biological hazard or retrofittable difficulty. Instrument shape can be simplified or explored using a prior model to confine the search space ( Pezzementi et al., 2009 ). To classify the target from background, a random forest was learnt to classify instrument in pixel-wise fashion, then the binary classification output was used to estimate the pose of a prior 3D instrument model through optimization within a level set framework ( Allan et al., 2013 ). Then, it was improved by combining constraints from feature points, temporal motion model with stereo setup ( Allan et al., 2014 ). Multi-part appearance model ( Allan et al., 2015 ) and articulated degrees-of-freedom ( Allan et al., 2018 ) of robotic instruments can be used to align the prior model with low level optical flow constraints. In addition, cues such as robotic kinematics ( Ye et al., 2016 ) can also be used as external constraints.

Patch-based descriptor
Given the location (bounding box ) of the object, to represent the object appearance, we used patch-based descriptor shown in Fig. 2 .
is evenly decomposed into n ϕ non-overlapping Low-level feature vector φ is extracted for each patch. Patch-based descriptor of can be constructed by concatenating features of all the patches in their spatial order. Since background information is potentially included in the bounding box, we incorporate an global probabilistic segmentation model ( Collins et al., 2005;Duffner and Garcia, 2013 ) to assign weights { w i } n ϕ i =1 to the patches based on their colour appearance, resulting a weighted descriptor: where w i is the weight of the feature φ i of the i -th patch ϕ i . . Given a bounding box , it is equally decomposed into n ϕ patches { ϕ i } n ϕ i =1 . For the i-th patch ϕ i , low-level feature vector φ i is extracted, and is assigned with a weight w i . Then, the descriptor is constructed by concatenating features of all patches, weighted by patch weights. Note that example patch weights are shown by the highlighted bounding box. Warmer colour indicates higher weight value.

Probabilistic segmentation model for patch weighting
The global segmentation model is based on colour histogram by using a recursive Bayesian formulation to discriminate foreground and background. Let y 1: t be the colour observation of a pixel from frame 1 to t, c be the class of a pixel. In our application, a pixel is classified as foreground ( c = 1 ) or background ( c = 0 ) by its colour observation. The foreground probability distribution p(c t = 1 | y 1: t ) at frame t is based on tracked results from previous frames where c t is the class of a pixel at frame t : 0 for background, and 1 for foreground, and Z is a normalization constant, which can be ignored in practice. The transition probabilities for foreground and background p(c t | c t−1 ) where c ∈ {0, 1} are empirical choices as in Duffner and Garcia (2013) . Foreground histogram p(y t | c t = 1) and background histogram p(y t | c t = 0) are initialized from all the pixels inside the bounding box and from those which are surrounding the bounding box (with some margin between) in the first frame, respectively. For the following frames, the colour histogram distributions are updated using tracked result.
where 0 ≤ δ ≤ 1 is the model update factor. t represents tracked bounding box in frame t . Instead of treating every pixel equal, the weighting of a pixel also depends on the patch where it is located. Patches with higher weight are more likely to contain object pixels and vice versa. So the colour histogram update for colour observation y t of current frame t is defined as where N y t ∈ ϕ i,t represents the number of pixels with colour observation y t in the i -th patch ϕ i,t in frame t , and x t represents any colour observation in frame t , so the denominator means the weighted number of all the pixel colour observations in the bounding box t . The weights w i , 1 for all the patches are initialized as 1 at the first frame, and then are updated based on the segmentation model where ϖ i,t denotes the average foreground probability of all pixels in the patch ϕ i,t in the current frame t , it is normalized so the highest weight update w i,t equals 1. The patch weight w i, t is then updated gradually over time. We omit background probability distribution p(c t = 0 | y 1: t ) since it is similar to Eq. (2) .
Unlike the weighting strategy in other patch-based methods ( Chen et al., 2013;Kim et al., 2015 ) by analysing the similarities between neighbouring patches, our patch weighting method is simple and straightforward to implement, the weight update for each patch is independent from each other, and only relies on the colour histogram based segmentation model. We show examples of the patch weight development in Fig. 3 . The patch weight thumbnails are displayed on the top corner of each frame, which indicate the objectness in the bounding box and also reflect the object deformation over time. Since we update the segmentation model based on previous patch weights, and in turn the segmentation model facilitates updating the weight of all patches. This cotraining strategy enhances the weight contrast between foreground and occluded patches, which suppresses background information efficiently.

Two-level sampling for scale estimation
The tracked object often undergoes complicated transformations during tracking, for example, deformation, scale variations, occlusion, etc. Fixed-scale bounding box estimation is ill-equipped to capture the accurate extents of the object, which would degrade the classifier performance by providing samples which are either partial cropped or include background information.
When locating the object in a new frame, all the bounding box candidates are collected within a search window, and the bounding box with the maximum classification score is selected to update the object location. Rather than making a suboptimal decision by choosing from fixed-scale samples, we augment training sample pool with multi-scale candidates, which is referred as two-level sampling strategy (see Fig. 4 ). On the first level, all the bounding box samples are extracted with fixed-scale s t−1 (the object scale in frame t − 1 ). The search window is centered at the t−1 with a height/width of r w , then the weighted patch-based descriptor of all candidates { } are fed into the classifier, and we select the bounding box t with the maximum classification score not as the final decision, but as the search center for our second level. After first level, the rough location of the object is narrowed to a smaller area. We then set a smaller search window with search height/width of r s , centring at the bounding box t selected in the first level, and we construct multi-scale candidates { } within the search window. All the samples are evaluated by the classifier, and we select the bounding box t of the sample with the maximum score as the final location of the object.
Obviously, the scales of augmented samples are critical. We consider two complementary strategies that handle both incremental and abrupt scale variations. Firstly, to deal with relatively small scale changes between frames, we build a scale set S r where λ is a fixed value which is slightly larger than 1.0. It is set to accurately search the scale change. n r is the scale number in the scale set S r . s t−1 is the scale of the object in frame t − 1 compared with the initial bounding box in the first frame. Considering object scale usually does not vary too much between frames, scale set S r includes scales which are close to the previous frame. Secondly, when object undergoes abrupt scale changes between frames, scale set S r is unable to keep pace with the speed of the scale variations. To address this problem, we build an additional scale set S p by incorporating Lucas-Kanade tracker (KLT) ( Bouguet, 2001;Shi and Tomasi, 1994 ), which helps us estimate the scale change explicitly. We randomly pick n pt points from each patch in the bounding box t−1 of frame t − 1 , and tracked all these points in the next frame t . With sufficient well-tracked points, we can estimate the scale variation between frames by comparing the distance changes of the tracked point pairs.
We illustrated the scale estimation by KLT tracker in Fig   where V is the set with all the distance ratios. We sort V by value and pick the median element s p = V sorted ( n 2 ) as the potential scale change of the object. To make scale estimation more robust, we uniformly sample the scales ranging between [1, s p ] or [ s p , 1] to construct the scale set S p .
where n p is the scale number in the scale set S p . When the object is out-of-view, occluded or abruptly deforms, the ratio of welltracked points will be low. In that case, the estimation from the KLT tracker will be unreliable. In our implementation, when the ratio is lower than 0.5, we then set s p = 1 , therefore the scale set S p will only add samples with the previous scale into the candidate pool. Only when there are enough points well tracked, the estimation from the KLT tracker will be trusted. We fuse these two complementary scale sets S r and S p into S f = S r ∪ S p to enrich our sample candidate pool. To show the effectiveness, we evaluate our proposed tracker in Section 4 with or without scale set S p estimated by the KLT tracker.

Tracking framework
PAWSS can be combined with any tracking-by-detection method. We show the pipeline of the whole framework in Fig. 6 . It includes two phases: evaluation and learning . The evaluation phase is to find the target in a new frame. Given the bounding box t−1 in the previous frame t − 1 , sample candidates are extracted in a search window, which centers at t−1 in the current frame t , unlike other tracking-by-detection approaches, we adapt a two-level sampling strategy for accurate scale estimation ( Section 3.3 ). Via the colour-based segmentation model, weights of all patches are updated as in Section 3.2 , and the descriptors of all samples are computed via patch weighting. Descriptors of all samples are fed into classifier and the one with the highest output score is picked as the best sample. The location t of the best sample shows where the target is in the current frame at time t . Between frames, the target appearance changes due to deformation, occlusions, light and scale variations, therefore, the classifier and the segmentation model needs to be learnt online to keep up with the changes. The best sample among all samples represents the most similar one compared to the target. For one thing, pixel colour distribution of the best sample is used to update the segmentation model. For another, samples are extracted around the best sample in order to collect foreground and background information. Descriptors of all samples are computed and used to train the classifier online to better discriminate the target from neighbouring background. The procedure starts again for the next frame.
In our implementation, we incorporate PAWSS into Struck Hare et al. (2011) . The algorithm relies on an online structured output SVM learning framework which integrates learning and tracking. It directly predicts the location displacement between frame, avoiding the heuristic intermediate step for assigning binary labels to training samples, which achieves top performance in OTB dataset Wu et al. (2013) .  6. Tracking framework. Given the target location t−1 in the previous frame at time t − 1 , the framework is to predict the target location t in the current frame at time t . The framework includes evaluation and learning phases. In evaluation phase, multi-scale samples are extracted via two-level sampling strategy, and then are fed into the classifier to pick the one with the highest score. The location of the sample is considered as the new location t . The sample is also used for updating the segmentation model and the classifier in the learning phase.

Results
Implementation details: Our algorithm is publicly available online 1 and is implemented in C++ and performs at about 7 frames per second with an i7-2.5 GHz CPU without any optimisation. We listed the parameter setting in Table 1 . To illustrate the generalization of our proposed framework, we use the same parameter setting through all experiments. For structured output SVM, we are using a linear kernel and the parameters are empirically set as δ = 0 . 1 in Eqs. (3) and (5) , λ = 1 . 003 in Eq. (8) , the scale numbers of the scale set are n r = n p = 11 . The number of extracted points from each patch n pt = 5 . The updating threshold for the classifier is set as η = 0 . 3 . For each sequence, we scale a frame to make sure the minimum side length of the bounding box is larger than 32 pixels, and the search window r w is fixed to (W + H) / 2 , where W and H represents the width and height of the scaled bounding box, respectively, and the search window r s is fixed to 5 pixels. We tested different low-level feature combinations in Section 4.1 and found that the combination of HSV colour and gradient features (HSV+G) achieves the best results. The patch number affects the tracking performance, too many patches increase the computation and too less patches do not robustly reflect the local appearance of the object. We tested different patch numbers, and selected n ϕ = 49 to strike a performance balance.

Online Tracking Benchmark (OTB)
OTB dataset  includes 50 sequences tagged with 11 attributes, which represent the challenging aspects for 1 https://github.com/surgical-vision/PAWSS . tracking such as illumination variation, occlusion, deformation et al. The tracking performance is quantitatively evaluated using both precision rate (PR) and success rate (SR), as defined in . PR/SR scores are depicted using precision plot and success plot, respectively. The precision plot shows the percentage of frames whose tracked centre is within certain Euclidean distance (20 pixels) from the centre of the ground truth. Success plot computes the percentage of frames whose intersection over union overlap with the ground truth annotation is within a threshold varying between 0 and 1, and the area under curve (AUC) is used for SR score. To evaluate the effectiveness of incorporating the scale set proposed by the KLT tracker, we provide two versions of our tracker as PAWSSa and PAWSSb: PAWSSa only includes scale set S r , while PAWSSb includes both S r and S p for scale estimation.
Comparison using different features: Selecting right features to describe the object appearance plays a critical role in tracking. The most desirable feature property is its uniqueness so that the object can be distinguished from background. Raw intensities or colour features are usually used for histogram-based appearance representations, while edge or gradient information are less sensitive to illumination changes. Generally, many tracking approaches use a combination of these diverse features to represent the object ( Hare et al., 2011;Grabner et al., 2006;Li et al., 2013 ). To evaluate the performance of our proposed approach, we tested different low-level features such as HSV colour, RGB colour, the combination of colour and gradient features (HSV+G, RGB+G) for constructing the descriptor in Table 5.1. The RGB histogram is 24dimensional with 8 bins for each channel, and the HSV colour histogram is 20-dimensional including 8 bins for H and S channels respectively and 4 separate bins for V channel. The gradient histogram is 16-dimensional signed gradients ranging from 0 to 360 • . We also compared our tracker PAWSSa and PAWSSb with Struck ( Hare et al., 2011 ) and SOWP ( Kim et al., 2015 ). From Table 2 , we observe: Augmenting colour with gradient histogram improves the tracking performance by providing diverse structural information of the object. In our experiments, the descriptor comprising combination of HSV colour and gradient features achieves the best results, we use this setting in the following evaluation.
Comparison with state-of-the-art trackers: We use the evaluation toolkit provided by Wu et al. (2013) to generate the  precision and success plots for the one pass evaluation (OPE) of the top 10 algorithms in Fig. 7 . The toolkit includes 29 benchmark trackers, besides that we also include SOWP tracker. It is shown that PAWSSb achieves the best PR/SR scores among all the trackers. For a more detailed evaluation, we also compared our tracker with state-of-the-art trackers in Table 3 . Notice that in all the attribute field, our tracker achieves either the best or the second best PR/SR scores. Our tracker achieves 36.7% gain in PR and 36.9% gain in SR over Struck ( Hare et al., 2011 ). By using a simple patch weighting strategy and training with adaptive scale samples, the performance shows that our tracker provides comparable PR scores, and higher SR score compared with SOWP ( Kim et al., 2015 ). PAWSSa tracker improves SR score by 2.6% considering gradually small changes between frames, PAWSSb improves SR score by 4.8% by incorporating scales estimated by the external KLT tracker. Specifically, when the object undergoes scare variation PAWSS achieves a performance gain of 10.3% in SR over SOWP. We show tracking results in Figs. 8 and 9 with the top trackers including TLD ( Kalal et al., 2012 ), SCM ( Zhong et al., 2012 ), Struck ( Hare et al., 2011 ), SOWP ( Kim et al., 2015 ) and the proposed PAWSSa and PAWSSb. In Fig. 8 , five challenging sequences are selected from the benchmark dataset, which include illumination variation, scale variations, deformation, occlusion or background clusters. PAWSS can adapt when the object deforms in a complicated scene and track the target accurately. In Fig. 9 , we select five representative sequences with different scale variations. PAWSS can well track the object with scale variation, while other trackers drift away. The results show that our proposed tracking framework PAWSS can track the object robustly through sequence by using the weighting strategy to suppress background information within the bounding box, and also by incorporating scale estimation allowing the classifier to train with adaptive scale samples. Please see the supplementary video for more sequence tracking results.

Visual Object Tracking (VOT) challenges
For completeness, we also validated our algorithm on VOT2014 (25 sequences) and VOT2015 (60 sequences) datasets. VOT datasets use ranking-based evaluation methodology: Accuracy and robustness. Similar to SR rate for OTB dataset, the accuracy measures overlap of the predicted result and the ground truth bounding box, while the robustness measures how many times the tracker fails during tracking. A failure is indicated whenever the tracker loses the target object which means the overlap becomes zero, and it will be re-initialized afterwards. All the trackers are evaluated, compared and ranked based on with respect to each measure separately using the official evaluation toolkit from the challenge. 2 VOT2014 VOT2014 challenge includes two experiments: Baseline experiment and region-noise experiment. In baseline experiment, a tracker runs on all the sequences by initializing with the ground truth bounding box on the first frame; while in the regionnoise experiment, the tracker is initialized with a random noisy bounding box with the perturbation in the 10% of the ground truth bounding box size. ( Kristan et al., 2015b ). The ranking plots with 38 trackers are shown in Fig. 10 for comparing PAWSS with the top three trackers: DSST ( Danelljan et al., 2014 ), SAMF ( Li and Zhu, 2014 ), KCF ( Henriques et al., 2015 ) in Table 4 . For both the experiments our PAWSS has lower accuracy score 0.58/0.55, but less failures 0.88/0.78 and have a second average rank. But considering the tracking process of the experiments: once a failure is detected, the tracker will be re-initialized, to eliminate the effect of achieving higher accuracy score by more re-initialization steps, we performed experiments without the re-initialization, also shown in Table 4 . The results show that PAWSS has the highest accuracy score 0.51/0.48 among all the trackers without re-initialization, which means it is more robust than the other trackers. Table 3 Comparison of PR/SR score with state-of-the-art trackers including Struck ( Hare et al., 2011 ), DSST ( Danelljan et al., 2014 ), SAMF ( Li and Zhu, 2014 ), FCNT  and SOWP ( Kim et al., 2015 ) in the OPE based on the 11 sequence attributes: illumination variation (IV), scale variation (SV), occlusion (OCC), deformation (DEF), motion blur (MB), fast motion (FM), in-plane rotation (IPR), out-of-plane rotation (OPR), out-of-view (OV), background cluttered (BC) and low resolution (LR). The best and the second best results are shown in red and blue colours respectively.

Table 4
The Accuracy (Acc.) and Robustness (Rob.) results of VOT2014 baseline and region-noise experiments with and without-re-initialization compared with the top trackers DSST ( Danelljan et al., 2014 ), SAMF ( Li and Zhu, 2014 ) and KCF ( Henriques et al., 2015 ). The best and the second best results are shown in red and blue colours respectively.   ( Kim et al., 2015 ) and three conventional trackers: TLD ( Kalal et al., 2012 ), SCM ( Zhong et al., 2012 ) and Struck ( Hare et al., 2011 ) on some sequences with scale variations in the benchmark. VOT2015 Finally, we evaluated and compared PAWSS with 62 trackers on VOT2015 dataset. VOT2015 challenge only includes baseline experiment, and the ranking plots are shown in Fig. 11 . In VOT2013 and VOT2014, average ranking measure is used to determine the performance of the trackers. Although average ranking has taken both accuracy and robustness measure into consideration, it is not theoretically representative as a concrete tracking performance. In VOT2015 ( Kristan et al., 2015a ), expected average overlap measure is introduced which combines both per-frame accuracies and failures in a principled manner. Compared with the average rank, expected overlap has a more clear practical interpretation.
We list the score / rank and expected overlap of the top trackers from VOT2015 ( Kristan et al., 2015a ) which are either quite robust or accurate, the above VOT2014 top three trackers DSST ( Danelljan et al., 2014 ), SAMF ( Li and Zhu, 2014 ), KCF ( Henriques et al., 2015 ), 3 and the baseline NCC tracker in Table 5 and also shown in the expected average overlap plot Fig. 11 . It can be shown that the average rank is not always consistent with the expected overlap. According to the paper ( Kristan et al., 2015a ), a VOT2015 published sota bound criteria (0.2) Fig. 11. The accuracy-robustness ranking plots and the expected overlap score ranking plot of VOT2015 dataset. Tracker is better if its result is closer to the top-right corner of the plot. The published sota bound is established based on top trackers in recent years. Any tracker with performance over the boundary is considered as a state-of-the-art tracker. is established by averaging the tracker performance published in 2014/2015 from top computer vision conferences and journals. The tracker will be considered as a state-of-the-art tracker with performance over this boundary criteria. Our tracker PAWSS is well above the criteria, and is among those top trackers (ranks the 7-th, outperforming 54 trackers), also PAWSS achieves better than any of VOT2014 top trackers on VOT2015 dataset.

Surgical instrument tracking
PAWSS is a general tracking framework, we also want to evaluate its performance on both ex vivo and in vivo surgical instrument sequences. In the Endoscopic vision MICCAI2015 Challenge., 4 one of the sub-challenge focuses on comparing differ-4 https://endovissub-instrument.grand-challenge.org/ . ent vision-based methods for tracking conventional and articulated laparoscopic instruments in robotic surgery. The dataset has not released ground truth for test data. The official evaluation categorized conventional laparoscopic instrument test set according to the challenging factors including bleeding (C blood ), smoke (C smoke ), instrument occlusions (C occlusion ), multiple instruments (C multiple ) and surgical objects such as meshes and clips (C objects ). And the robotic laparoscopic instrument dataset includes sequences with multiple instruments (C multiple ). For evaluating the tracking performance, Euclidean distance of the centre point between the ground truth and the tracking result of training data is computed and compared separately for these challenging factors. We submitted our proposed method to the challenge, and obtained the performance comparison from the official report.
EndoVis Articulated Robotic Laparoscopic instrument dataset The articulated instrument dataset is from ex vivo interventions,  and the sequences are collected using the da Vinci ® (Intuitive Surgical Inc., CA) system with porcine tissue samples. Example frames from each sequence are shown in Fig. 12 (a). The dataset is divided into training and test data. Training data contains four 45 seconds surgery video sequences. For each instrument, the tracked point of the instrument is defined as the intersection between the instrument axis and the border between the shaft and the manipulator. The annotation includes pixel coordinates of the tracked point ( Fig. 12 (b)). Test data is composed of 15 additional seconds video from each of the training sequence, and two additional new 60 s video sequences.

Original annotation
We have summarized the frame number for each sequence and have shown the accuracy evaluation separately in the original annotation section of Table 6 and Fig. 14 Left. The accuracy is defined as the percentage of tracked frames within the error threshold. Distance (pixels) is averaged over correctly tracked frames. In Fig. 14 , it shows accuracy under different threshold. In four train sequences, there are five instruments to be tracked. The average accuracy score for train data is 79.01% for 20 pixel threshold, with a distance error of 8.00 pixels. It is noted that the accuracy score (36.55% for 20 pixel threshold) for sequence 4 is relatively lower compared with the rest sequences.   As we have summarized, the target is out of view several times in sequence 4, reaching 67 frames out of 1123 frames. Trackingby-detection methods typically cannot handle out-of-view scenario without additional re-detection module. The underlying assumption is that the target is always in frame view, which means Whenever the target is out of frame, the tracker will gradually drift away. This explains the low accuracy of the performance, if the threshold is increased to 30 pixels, the performance has significantly improved, achieving 82.67% for accuracy. We show some tracking result examples in Fig. 13 . The tracked point and bounding box are shown in cyan colour, with the ground truth point shown in green colour. The first column is the first frame of each sequence. As we can see, the quality of the annotation is not consistent through the whole sequence. On certain frames, the annotation is drifted and is not labelled where it is supposed to be. This would certainly affect our performance evaluation result. It is also observed that whenever the instrument is close to the frame border, the tracker will stick to the border and not track the instrument well.
High quality annotation The original annotation is retrieved from the robotic system, which includes the location of the intersection point between the instrument axis and the border between plastic and metal on the shaft, normalized Shaft-to-Head axis    vector and the clasper angle. Since the original annotation does not provide consistent ground truth, the accuracy result does not reflect true performance. We manually labelled the training data, and construct a high quality annotation. In this annotation, we labelled multiple joints of the instrument including the original tracked point, the Head and Shaft point. The original and our proposed annotations are demonstrated in Fig. 12 (b).
We also tracked and evaluated on the Head and Shaft points we defined in our high quality annotation in the high quality annotation section of Table 6 and Fig. 14 right. With new annotation, our average accuracy has increased to 98.56% for 20 pixel threshold, with distance error of 6.65 pixels.
The tracking accuracy evaluation results are displayed in Table 7 and Fig. 15 . Our average accuracy has reached 99.96% and 99.68% for 20 pixels threshold, with distance error of 5.68 and 6.51 pixels, respectively.
Comparison performance In Table 8 , the distance error (pixel) was computed and compared separately for challenging factor multiple instrument (C multiple ) with all the submitted methods KIT, UGA, MOD and our method PAWSS. From official report, PAWSS outperforms all the other methods with the lowest average distance error 29.66 pixels.

EndoVis Conventional Laparoscopic Instrument Dataset
The conventional instrument dataset contains six in vivo sequences, which are collected from complete laparoscopic colorectal interventions. Similar to the robotic instrument dataset, training data contains 45 s video sequences, and test data is made up of 15 additional seconds videos for each sequence and two new 60 s videos. Compared to ex vivo robotic instrument dataset, these sequences reflect complex challenges during surgery, including smoke, bleeding, blurry and various kinds of instruments. In Table 9 , the distance error (pixel) was computed and compared separately for each challenging factor with all the submitted methods KIT, UGA and our method PAWSS. From the official report, PAWSS outperforms all the other methods in every challenging subset with the lowest average distance error 96.78 pixels. We show some tracking result examples in Fig. 17 . The tracked point is shown in cyan colour, and the first column is the first frame of each sequence in test set. ( Fig. 16 ) In vivo surgical instrument experiments We also test on some other in vivo sequences and show the result in Fig. 18 . As we can see, the tracker works well even under complex in vivo environment. The video is submitted to display the tracking results for the whole sequences.

Conclusions
In this paper, we propose a tracking-by-detection framework, called PAWSS, for online object tracking. It uses a colour-based segmentation model to suppress background information by assigning weights to the patch-wise descriptor. We incorporate scale estimation into the framework, allowing the tracker to handle both incremental and abrupt scale variations between frames. The learning component in our framework is based on Struck, but we would like to point out that theoretically our proposed method can also support other online learning techniques with effective background suppression and scale adaption.
The performance of our tracker is thoroughly evaluated on OTB, VOT2014 and VOT2015 datasets and compared with recent stateof-the-art trackers. Results demonstrate that PAWSS achieves the best performance in both PR and SR in OPE for OTB dataset. It outperforms Struck by 36.7% and 36.9% in PR/SR scores. Also, it provides a comparable PR score, and improves SR score by 4.8% over SOWP. On VOT2014 dataset, PAWSS has relatively lower accuracies but the lowest failure rate among the top trackers, we evaluated without re-initialization, and achieves the highest performance. Also on VOT2015 dataset, PAWSS is considered state-ofthe-art and is among the top trackers.
For instrument tracking, we also qualitatively and quantitatively evaluated our tracker on public EndoVis robotic and conventional surgical instrument datasets, and in vivo surgical instrument sequences. We compared our result with the official GT for the Tracked Point on the robotic instrument dataset, and tracking accuracy reached 79.01% with 20 pixel threshold. As we have shown, the official annotation is not quality consistent, we manually created a high quality multi joint annotation for the dataset. We tested multiple joints (Tracked Point, Head and Shaft Point) on the dataset, and our performance accuracy increased over 98% for all the joints with 20 pixel threshold. From the official challenge report, Our method has shown the lowest tracking error for both robotic and conventional instrument datasets, and it also shown its excellent tracking ability with in vivo sequences dealing with complicated surgical environment. Our framework is designed for general single object tracking. It does not require prior information about the target or any offline training to achieve robust and realtime performance. We would also like to discuss the limitations of our framework. First, if the target disappears and reappears from the scene, the framework does not recover. Second, the target position is represented by rectangle bounding box. Even with the assistance of the segmentation model to distinguish foreground and background, the assumption is that the target occupies most area of the bounding box. If the target only occupies small fraction, the classifier would be polluted and misled by background information and can easily cause tracking failure. In the future, we would like to focus on re-detection module and semantic foreground segmentation.

Declarations of interest
None