High-resolution satellite video single object tracking based on thicksiam framework

ABSTRACT High-resolution satellite videos realize the short-dated gaze observation of the designated area on the ground, and its emergence has improved the temporal resolution of remote sensing data to the second level. Single object tracking (SOT) task in satellite video has attracted considerable attention. However, it faces challenges such as complex background, poor object feature representation, and lack of publicly available datasets. To cope with these challenges, a ThickSiam framework consisting of a Thickened Residual Block Siamese Network (TRBS-Net) for extracting robust semantic features to obtain the initial tracking results and a Remoulded Kalman Filter (RKF) module for simultaneously correcting the trajectory and size of the targets is designed in this work. The results of TRBS-Net and RKF modules are combined by an N-frame-convergence mechanism to achieve accurate tracking results. Ablation experiments are implemented on our annotated dataset to evaluate the performance of the proposed ThickSiam framework and other 19 state-of-the-art trackers. The comparison results show that our ThickSiam tracker obtains a precision value of 0.991 and a success value of 0.755 while running at 56.849 FPS implemented on one NVIDIA GTX1070Ti GPU.


Introduction
In the single object tracking (SOT) task, the position and size of the target are manually annotated in the first frame of the video, and then the algorithm is used to track the specified target in subsequent frames (Hu et al. 2020).High-resolution satellite videos realize shortdated gaze observation of the designated area on the ground, and its emergence has improved the temporal resolution of remote-sensing data to the second level.For satellite video interpretation, SOT is an important step in dynamic information extraction, and it is also the basis and prerequisite for estimating traffic density (Kopsiaftis and Karantzalos 2015), motion analysis (Lu et al. 2018;Thomas, Kambhamettu, and Geiger 2011), and surveillance (Vivone et al. 2017).
In recent years, trackers based on correlation filter (CF) and deep learning (DL) have achieved satisfactory results.The mechanism of the CF-based trackers utilizes the constructed filter template to perform correlation estimations in object candidate areas.The candidate region with the maximum response value is considered to contain the predicted object (Shao et al. 2019b).Numerous CF-based methods such as MOSSE (Bolme et al. 2010), CSK (Henriques et al. 2012), and KCF (Henriques et al. 2014) have been proposed to solve the difficulties in SOT task.C-COT (Danelljan et al. 2016) and ECO (Danelljan et al. 2017) adopt convolutional neural network (CNN) features to improve their tracking accuracy and speed gradually.Recently, methods based on fully convolutional Siamese architecture have attracted wide attention.They consider SOT issue as solving the similarity learning paradigm between the inherent template branch and search branch (Bertinetto et al. 2016b).The Siamese similarity function keeps fixed during tracking after offline training (Li et al. 2019).SiamFC (Bertinetto et al. 2016b), SiamRPN (Li et al. 2018), SiamRPN++ (Li et al. 2019), SiamFC++ (Xu et al. 2020), and SiamCAR (Guo et al. 2020) continuously reform the architectures to enhance their performances on natural scene SOT.
Unlike natural scene videos, satellite videos are obtained from sensors deployed in outer space in a topdown view.In the wide-format scene of satellite video, ground targets show attributes such as small size (SS), partial occlusion (PO), persistence of vision (PoV), poor target-background discriminability (PTBD), shape deformation (SD), and poor general field illumination (PGFI) (Du, Cai, and Wu 2019;Hu et al. 2020), as shown in Figure 1.The complex background in high-resolution satellite video and the poor feature representation caused by the above attributes make natural scenebased trackers inapplicable to satellite videos.In addition, there currently lacks publicly available large-scale testing datasets, which restricts the development of satellite video SOT.
Researchers have proposed several methods dedicated to solve satellite video SOT task.Du et al. combined the motion features with color space features of the target and utilized interpolation among multiple frames in the optical flow tracker to improve tracking accuracy (Du, Cai, and Wu 2019).Wu et al treated the moving object in satellite video scene as point-target and introduced the Bayesian classification for tracking (Jiaqi et al. 2017).Wang et al built an object feature model using a Gabor filter to enhance the contrast between the target and background, and improve the discrimination ability of the tracking algorithm (Wang et al. 2020).In these methods, the hand-crafted features, such as HSV color, need to be carefully designed and lack generalization compared with deep features.Some researchers also applied deep Siamese networks to satellite video SOT task.Shao et al used shallow CNN to extract appearance features of the target and focused on solving problems such as occlusion in the tracking scene.They used natural scene datasets as the training datasets and directly tested the network on satellite video datasets, which lacks domain suitability (Shao et al. 2019a).Zhu et al constructed an ID-DSN tracker, which included a deep Siamese network and an ID-CIM module to alleviate model drift.However, when there existed partial occlusion or the object was poorly distinguishable from the background, ID-DSN could not completely solve the size adaptability (Zhu et al. 2021).In addition, all the above methods implemented experiments on self-built datasets, which were not organized and systematic.The performances of these satellite video trackers lacked fair verifications on the same benchmark.
In order to solve the above problems and facilitate the development of satellite video SOT tasks based on DL, a ThickSiam framework consisting of a Thickened Residual Block Siamese Network (TRBS-Net) and a Remoulded Kalman Filter (RKF) module is designed in this work.The results of these two modules are complementarily combined by an N-frameconvergence mechanism frame by frame to obtain accurate tracking results.A manually annotated testing dataset is established to conduct comparative experiments and verify tracker's performances.This paper contains the following contributions: (1) The TRBS-Net is built based on Siamese architecture by stacking the well-designed thickened residual block (TRB) and thickened maxpooling residual block (TMRB) to obtain the initial tracking results.The TRB and TMRB are remolded based on the residual block and down-sampling residual block, respectively.The modifications mainly include doubling the number of channels in bottleneck to enrich the representation of semantic features in CNN, and cropping out the outermost features to eliminate the position bias caused by padding operations in convolution.
(2) To complement the tracking results of TRBS-Net, the classical Kalman filter is remolded to form the RKF module, which simultaneously corrects the trajectory and size of the targets based on the fact that the objects in the satellite video have rigid characteristics, and  The remaining chapters of this paper are arranged as follows.Section 2 elaborates the overall structure of ThickSiam framework.Section 3 implements ablation experiments and makes reliable analysis and discussion to the results.Finally, Section 4 summarizes the work of this paper.

The ThickSiam framework
This section elaborates the architecture of the proposed ThickSiam framework, which consists of a Thickened Residual Block Siamese Network (TRBS-Net) equipped with thickened residual block and thickened maxpooling residual block to obtain the initial tracking results, and a Remoulded Kalman Filter (RKF) module to simultaneously correct the trajectory and size of objects.Since Kalman filter has the burn-in period problem, an N-frame-convergence mechanism is proposed to combine the results of these two modules by frames.The overall tracking workflow is shown in Figure 2.

The architecture of TRBS-Net
The TRBS-Net formally includes a template branch T and a search branch S. The template branch absorbs image patch with a size of 127 � 127 pixels, the search branch absorbs image patches with a size of 255 � 255 pixels, and these two branches share the same weights.The original residual block and downsampling residual block (He et al. 2016) are reformed to obtain thickened residual block (TRB) and thickened maxpooling residual block (TMRB), and these two modules are stacked to build the TRBS-Net.
The residual block is used to solve the problem of gradient disappearance (He et al. 2016), which consists of three stacked convolution layers and an independent shortcut connection, as shown in Figure 3(a).The stacked convolution layers include two 1 � 1 kernels to adjust the number of channels, while one 3 � 3 kernel to extract semantic information.The modifications of TRB include doubling the number of channels in bottleneck for enriching the representation of semantic features in CNN to promote the characterization ability of the network, and cropping out the limbic elements attached to the feature map to eliminate the position bias caused by padding operations in convolution (Zhang and Peng 2019).The modifications are shown in Figure 3(b).
The down-sampling residual block is similar in structure to the residual block.The independent shortcut connection is a 1 � 1 convolutional layer and its stride is 2, and the stride in bottleneck is also 2, as shown in Figure 4(a).The modifications of TMRB include doubling the number of channels in bottleneck, cropping out the outermost features, modifying the stride in the above two convolutional modules to 1, and adding a maxpooling layer to achieve downsampling of the feature map.The modifications are shown in Figure 4(b).
Considering the objects in satellite video sequence images are apparently small size and in various scales (e.g. a ship with a size of 14 � 15 pixels and a train with a size of 154 � 63 pixels in "Jilin-1" satellite videos), the TRBS-Net is constructed with three stages and a total stride of 8. Hence, the designed TRBS-Net   can effectively extract semantic information of small targets.The basic channel number C of TRB in stage 2 is 64 while the basic channel number C of TRB and TMRB in stage 3 is 128.Therefore, the number of channels in bottlenecks of these two stages are 128 and 256, respectively.The detailed information of feature map in each stage is shown in Table 1.
During training, due to the lack of labeled satellite video image sequences, the ThickSiam framework uses object detection datasets to generate exemplarsearch training pairs by augmenting still images (Zhu et al. 2018).In this work, the training datasets include natural scene object detection dataset COCO (Lin et al. 2014) and remote-sensing image object detection dataset DIOR (Li et al. 2020).The COCO dataset includes 118,287 images, it plays a cornerstone role in the training process.The DIOR dataset includes 23,463 images, which are used to transfer object scale adaptability from natural scenes to remote sensing domain.Specifically, for each annotation in the used object detection datasets, the exemplar-search training pairs are constructed by the following formula: where w is the width of the annotation and h is the height of the annotation, s represents the scale factor, and A represents the input image sizes of the Siamese network.The exemplar image size is constructed as 127 � 127 pixels, while the search image size is constructed as 255 � 255 pixels.Each annotation takes its original center point as its processed center point.The ðw þ hÞ=2 term is the context margin added in the width and height axis in order to cover the boundingbox as completely as possible.The constructed exemplar-search training pairs are shown in Figure 5.
After obtaining the training sample pairs, The TRBS-Net is trained by weighted Binary Cross Entropy loss: where ŵ is the weight used to balance positive and negative samples and is set to 0.5.Y is the real-sample score for each exemplar-search pair and Ŷ belongs to f0; 1g, which is used to classify whether the elements on the feature map are positive or negative, and its form is: where d ec indicates the distance between the element and the center.We consider elements close to the  center with greater probability to be foreground and set this distance to 2. During tracking, the region containing manually annotated target in the first frame is trimmed into a patch with a size of 127 � 127 pixels.Each subsequent frame crops the contextual region containing the target based on the result of the previous frame as a patch with a size of 255 � 255 pixels.The image patches are processed by the template branch and search branch of the TRBS-Net to encode the CNN parameters φ θ .The encoded φ θ ðTÞ of the target in template branch is only calculated in the first frame, and then used as the convolution kernel of the search image φ θ ðSÞ for cross-correlation calculation.The TRBS-Net ignores the category of the tracking target and obtains the score map by calculating the crosscorrelation between these two branches.The subwindow with the highest score in the candidate image encodes the tracking object, denoted as: where b 2 R represents the corresponding bia at every position.f θ ðT; SÞ is the similarity calculated by the cross-correlation operation ? .Multi-scale processing is performed on the results of the template branch to cope with changes in the size of the tracked target.The shape and scale penalties are also imposed to ensure the geometric similarity of objects between two adjacent frames, and a cosine window is utilized to penalize abnormally large offsets.

The mechanism of the Remoulded Kalman Filter (RKF)
In object tracking task, Kalman filter can be considered as a trajectory corrector, it includes prediction process and observation process.The prediction process estimates the current state based on the state at the previous moment.The observation process is to estimate the optimal values at the next time according to the predicted values and measured values at the current time.In the existing satellite video SOT methods (Guo, Yang, and Chen 2019;Xuan et al. 2020), Kalman filter is only used to predict the center point coordinates and ignores the shape of the target.In satellite videos, ground targets are generally rigid bodies, their shapes do not shift significantly among several adjacent frames.Hence, a Remoulded Kalman Filter (RKF) is proposed to simultaneously predict and update center point coordinates, velocity, and size of the target.
At frame t, object's motion state consists of center coordinates ðc x ; c y Þ, width w, height h, and their respective change values in the time interval Δt, The state transition matrix F t in current frame is the sum of the coordinates in the last frame and the offsets within a time interval Δt, while the velocity and size remain unchanged.Hence: 1 0 0 0 Δt 0 0 0 0 1 0 0 0 Δt 0 0 0 0 1 0 0 0 Δt 0 0 0 0 1 0 0 0 Δt 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 (5 where Δt is the reciprocal of FPS.Since the tracking process is not affected by any external factors, the control vector u t is ignored.The covariance matrix P t , external noise covariance matrix Q t , observation matrix H t , and observation noise matrix R t are correspondingly initialized as: Z t is the result of the TRBS-Net tracker, denoted as: where c t x ; c t y ; w t ; h t , respectively, represent center coordinates, width, and height of the tracked object at frame t.

N-frame-convergence mechanism
The TRBS-Net proposed in Section 2.1 employs the Siamese architecture to track target to obtain initial result.The RKF module proposed in Section 2.2 takes the center point coordinates, width, and height from the TRBS-Net result as inputs and predicts them to obtain the rectified result.In this section, an N-frame-convergence mechanism is proposed to fuse the results of TRBS-Net and RKF modules to obtain final tracking results, and it also implicitly solves the burn-in period problem in RKF.Specifically, we initialized RKF using center point coordinates and size of the tracked object at the first frame, and set the speed and size change values to zero.The tracking result obtained by TRBS-Net is used as the observation value to continuously update RKF during tracking.In the first N frames, object centroid coordinates obtained by the TRBS-Net is adopted as the final position results.Based on the fact that the shape of rigid objects on the ground in satellite video hardly changes, the center point coordinates, width, and height of the tracked target in current frame are calculated by using the average values of the historical frames.It can be expressed as: where x t final , y t final , w t final , and h t final , respectively, represent the final results of target centroid coordinates, width, and height at frame t (t � N). x t TRBSÀ Net , y t TRBSÀ Net , w t TRBSÀ Net , and h t TRBSÀ Net respectively represent the results of target centroid coordinates, width, and height obtained by TRBS-Net at frame t. α is the weight factor of historical frame parameters, and it is experimentally set to 0:3.
After N frames, the final tracking results of the target centroid coordinates in current frame t adopt the results of RKF, and the size of target is calculated by the following formulas: where x t RKF , y t RKF , w t RKF , and h t RKF respectively represent the results of target centroid coordinates, width, and height obtained by RKF at frame t (t > N).The representative meanings of other variables are the same as in Equation 11.

Description of the constructed testing dataset
In this work, a manually annotated testing dataset is established to perform ablation experiments and verify tracker's performance, it includes 12 objects derived from eight videos with a total of 5550 frames, covering four categories of airplane, ship, train, and vehicle.The tracked targets are manually labeled by horizontal bounding boxes in every frame.Considering the slight deformation of the object's appearance, the extents of bounding boxes are different for a same object among different frames.The thumbnails of the constructed testing dataset are shown in Figure 6.
Video 1, 2, 4, 5, 6, 7, and 8 were acquired from "Jilin-1" satellites provided by Chang Guang Satellite Technology Co., Ltd. 1 .The ground sample distance (GSD) of the sequence images is 0.92 m.Video 3 was downloaded from the International Space Station (ISS) provided by Deimos Imaging and UrtheCast . 2  The GSD of the sequence images is 1 m.Each video is tagged by attributes of SS, PO, PoV, PTBD, SD, and PGFI, which represent the challenges in satellite video SOT task.The detailed descriptions of these experimental datasets are given in Table 2.

Evaluation criteria
The precision plot and success plot widely used in natural scene SOT tasks are adopted as evaluation criteria.The precision plot uses center location error (CLE) to evaluate the center point distance between the prediction box and the ground truth box.The success plot uses overlap score to evaluate the Intersection over Union (IoU) between the prediction box and the ground truth box (Wu, Lim, and Yang 2013;Xuan et al. 2020;Wu, Lim, and Yang 2015).For a given tracking bounding box and GT bounding box, CLE measures the distance between the center point (x tb , y tb ) of the tracking box and the center point (x gt , y gt ) of the GT box, denoted as: CLE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi The overlap score S can be represented as: where \ and [ represent the intersection and union of these two boxes of r tb and r gt , and their values are measured by the number of pixels in the regions.The quantized values are represented by the Area Under Curves (AUC) of precision plot and success plot.An effective tracker is manifested as large AUC values.
The FPS criterion is adopted to illustrate the speed of trackers in ablation experiments.The higher the FPS value, the faster the tracking speed.

Implementation details
Training: The ThickSiam framework utilize natural scene object detection dataset COCO (Lin et al. 2014) and remotely sensed object detection dataset DIOR (Li et al. 2020) to jointly carry out training, and these two datasets totally generate 141,750 exemplar-search training pairs.The TRBS-Net is trained by stochastic gradient descent (SGD) for 30 epochs.The learning rate is decreased from 0.005 to 0.00001.The momentum, weight decay, and batch size are set to 0.9, 0.0001, and 16, respectively.The parameters of networks are initialized with the weights pre-trained on ImageNet dataset (Russakovsky et al. 2015).Testing: During testing period, the ThickSiam framework is directly tested on our constructed testing dataset without fine-tuning.The score map  upsampling uses bicubic interpolation from 17 � 17 pixels to 272 � 272 pixels.The size of objects is regressed over three scales of 1:03 fÀ 1;0;1g , and updated by linear interpolation with a factor of 0.65 to cope with the shape change.These two parameters are set according to (Bertinetto et al. 2016b) to ensure fair comparisons.
We implemented the proposed framework using PyTorch with CUDA toolkit 10.1.The offline pretraining is executed on a GPU Server in Supercomputing Center of Wuhan University and the tracking experiments are implemented on one NVIDIA GTX1070Ti GPU on Ubuntu 20.04.

Ablation experiment 1: validation of the thicksiam
The ThickSiam framework includes TRBS-Net and RKF modules, and the results of these two modules are combined by the N-frame-convergence mechanism.The overall framework was modularly verified based on three training schemes: training with individual COCO dataset, training with individual DIOR dataset, and training with joint COCO and DIOR datasets.The baseline method was stacked by original residual block in Figure 3(a) and down-sampling residual block in Figure 4(a) according to the structure of the TRBS-Net.The RKF module was compared with standard Kalman Filter (KF for short), which only predicted the center point coordinates of the tracked object.The "+" represented the fusion of results between different modules, and "TRBS-Net+RKF" represented the ThickSiam framework.In this set of ablation experiments, the value of N was 70 for the methods combining the KF and RKF modules.The bold P and S represented the AUC of precision plot and the AUC of success plot, respectively.In each column, the bold number indicated the highest criterion.The experimental results of the ThickSiam framework with different training mechanisms were shown in Table 3.
Comparing method (a) with method (d), the ablation experiments showed the superiority of the proposed TRBS-Net.With different training dataset mechanisms, the precision values of TRBS-Net were 0.017, 0.005, and 0.025 higher than these of baseline; the success values were 0.02, 0.013, and 0.022 higher than these of baseline, respectively, which indicated that the designed TRB module and TMRB module could improve the capability of feature extraction and representation of the network, and obtained more effective tracking performance.
Compared to the KF, the RKF mechanism took into account the changes in speed and size of the target, and used the appearance features in historical frames to correct the size of object in current frame.After adopting this mechanism, the success criterion, reflecting the IoU between the tracking box and the GT box, had been significantly improved.These enhanced performances could be verified by ablation experiments including method (b) versus method (c), method (e) versus method (f).However, from the perspective of improving precision criterion, not all methods that incorporated the RKF mechanism could be improved in performance.For example, trained with COCO dataset, the precision value of method (d) was 0.002 higher than that of method (f), while the precision values of method (a) and method (c) were almost the same.It indicated that the method only using the natural scene dataset for training did not effectually adapt to the size change of the ground targets on the remote sensing images.Using the other two training strategies, the methods combined with the RKF mechanism achieved higher precision values than the methods without the RKF mechanism.In particular, using the joint COCO and DIOR datasets for training, the ThickSiam framework obtained the highest precision value of 0.973 and success value of 0.745, which simultaneously verified the effectiveness of the TRBS-Net and RKF modules, and the rationality of the training strategies.
In terms of tracking speed, the baseline method obtained the highest FPS value of 63.166.Compared with the baseline method, the FPS value of our TRBS-Net dropped by 4.74.This was because the number of channels in the bottleneck of TRB and TMRB modules were doubled, making feature extraction more timeconsuming.After adopting the RKF mechanism, the FPS values of the baseline method and TRBS-Net had dropped slightly because the status of the RKF was updated frame by frame.Our ThickSiam framework obtained an FPS value of 56.849, which could achieve real-time processing for satellite videos.The precision plots and success plots about the comparison results of the Ablation Experiment 1 were drawn in subsection 1.1 of supplementary materials.

Ablation experiment 2: validation of the n-frame-convergence mechanism
The N-frame-convergence mechanism proposed in Section 2.3 was utilized to combine the results of TRBS-Net and RKF modules after N frames to solve the burn-in period problem existed in RKF mechanism.This section selected the values of N as integers between 10 and 100 with an interval of 10, and comprehensively evaluated when the precision-success criteria relatively obtained the highest value.The relevant comparative experiments and results are shown in Table 4.
When N was equal to 30, the baseline network trained with COCO dataset obtained a precision value of 0.937 and a success value of 0.708, outperforming this method without the RKF mechanism for obtaining a precision value of 0.922 and a success value of 0.683.It indicated that the RKF mechanism could prevent precision and success decline stemming from the poor performance of the baseline tracker.Trained with COCO dataset, method (b) obtained the highest precision value of 0.939 without using the RKF mechanism.When N was equal to 10, method (b) obtained the highest success value of 0.716 and a not bad precision value of 0.938.These two comparative experiments verified that although the RKF mechanism was adopted, it could not guarantee the steady improvement of precision criterion when using COCO dataset as the training dataset.However, the RKF mechanism significantly improved the success criterion.
When N was equal to 80, the baseline network trained with DIOR dataset obtained a precision value of 0.948 and a success value of 0.72, which were still higher than those of this method without RKF mechanism.When adopting the same training strategy, method (b) obtained the highest precision value of 0.962 and the highest success value of 0.737 when N was equal to 70.Compared with the method without the RKF mechanism, the precision and success values were increased by 0.019 and 0.036, respectively.This indicated that the RKF mechanism could improve the center point proximity in distance and shape similarity between the tracking box and the GT box.
When N was equal to 60, the baseline network trained with COCO and DIOR datasets obtained a precision value of 0.954 and a success value of 0.722.Compared with this method without RKF mechanism, the precision value and success value were 0.02 and 0.023 higher, respectively.With the same training strategy, the TRBS-Net obtained the global optimal precision value of 0.991 and success value of 0.755 when N was 50, which were greatly superior to the method without RKF.Therefore, we chose 50 as the value of N in N-frame-convergence mechanism, and utilized the result of TRBS-Net jointly trained by COCO and DIOR datasets as the tracking result in this work.The precision and success fluctuations were drawn in subsection 1.2 of supplementary materials.
We did not display the FPS criteria in the

Ablation experiment 3: comparisons with state-of-the-art trackers on our constructed testing dataset
To illustrate the superiority of the proposed ThickSiam (TRBS-Net+RKF) framework, other 19 state-of-the-art trackers including CF-based and DLbased methods with different features and backbones were tested.Our framework was trained with the COCO and DIOR datasets without fine-tuning on the constructed testing dataset and N was equal to 50.The results of TRBS-Net under this training scheme was also displayed.The other tested trackers include MOSSE (Bolme et al. 2010), CSK (Henriques et al. 2012), KCF (Henriques et al. 2014), CN (Danelljan et al. 2014b), DSST (Danelljan et al. 2014a), Staple (Bertinetto et al. 2016a), SiamFC (Bertinetto et al. 2016b), DCFNet (Wang et al. 2017), ECO (Danelljan et al. 2017), STRCF (Li et al. 2018), ATOM (Danelljan et al. 2019), DiMP (Bhat et al. 2019), SiamFC+ (Zhang and Peng 2019), SiamRPN+ (Zhang and Peng 2019), SiamRPN++ (Li et al. 2019), SiamFC+ + (Xu et al. 2020), and ID-DSN (Zhu et al. 2021).They were implemented in their original environments without any additions.The features, network backbones, and the comparison results were shown in Table 5.The "CUDA" column indicated whether the method supported CUDA-based GPU acceleration.If the method supported CUDA, it used GPU for calculation; otherwise it used CPU for calculation.The top four results were observably marked in bold red, orange, green, and blue, respectively.
As shown in Table 5, the proposed ThickSiam framework, namely TRBS-Net+RKF, obtained the highest precision value of 0.991 and the highest success value of 0.755.Without the RKF mechanism, our TRBS-Net still obtained a precision value of 0.959 and a success value of 0.721, ranking the second place.ID-DSN, as a tracker specially designed for satellite video SOT, adopted ResNet50 as the backbone and obtained the third-ranked precision value of 0.933 and success value of 0.718.Among the remaining DL-based trackers, methods with shallow networks as the backbones such as SiamFC++ (AlexNet), SiamRPN++ (AlexNet), and SiamFC (AlexNet) performed more effective than those with deep networks as the backbones such as DiMP (ResNet18), DiMP (ResNet50), and SiamRPN++ (ResNet50).In particular, SiamFC++ (AlexNet) obtained the fourth-ranked precision value of 0.925 and success value of 0.699.It illustrated that in satellite video, the object with small size could not be effectively represented in the deep layers.The shallow layers inversely exhibited advantages because they focused on the appearance and shape features.
Among the CF-based trackers, methods characterized by grayscale intensity such as MOSSE and CSK achieved unsatisfactory tracking performance.The MOSSE tracker obtained the global lowest success value of 0.48, which was 0.275 lower than our ThickSiam framework.This indicated that only using grayscale intensity features to represent objects could not effectively interpret complex scenes in highresolution satellite videos.After adopting features (e.g.HOG, Color Table ) that were more conducive for representing appearance and shape information, the performance of these CF-based trackers had also been enhanced to a certain extent.In particular, applying deep features to CF-based trackers such as DCFNet (conv1 from VGG) and ECO (ResNet18 with vgg-m conv1 layer) could achieve quite effective tracking performance, which showed that the deep features could effectively represent the ground targets in satellite videos.It was worth noting that the above-mentioned contributing deep features still came from the shallow networks, which payed more attention to the appearance and shape attributes of the targets.The precision plots and success plots of the comparison results with the state-of-the-art trackers on our constructed testing dataset were drawn in subsection 1.3 of supplementary materials.
As for the tracking speed, the four trackers including SiamRPN++ (AlexNet), SiamFC++ (AlexNet), SiamFC (AlexNet) and SiamRPN+ (ResNet22) achieved more than 100 FPS thanks to GPU accelerated calculation, and they also obtained considerable precision and success criteria.It should be noted that all these four DL-based trackers adopted shallow networks as backbones, which once again proved that shallow networks were more suitable for satellite video SOT task and the rationality of the off-line training and online testing paradigm of DL-based trackers.On the contrary, all CF-based trackers had very slow tracking speed, and lost the huge computation advantage in natural scene SOT tasks.We deemed that on the one hand, these methods did not utilize GPU to accelerate calculations; on the other hand, the overlarge remote sensing image size, complex backgrounds, and abundant information made satellite video SOT task more difficult, thus greatly reducing the tracking speed.Our ThickSiam (TRBS-Net+RKF) obtained an FPS value of 56.849, which could achieve real-time processing for the satellite videos used in this paper.However, compared with SiamRPN+ +(AlexNet) with an FPS value of 144.783, the significant difference in FPS criterion encouraged us to improve the tracker's speed in the following work.
The tracking results obtained by ThickSiam (ours, TRBS-Net+RKF), SiamFC++ (AlexNet), and SiamFC (AlexNet) trackers, respectively won the first, third, and fourth places in the comprehensive ranking of precision and success criteria, were shown in Figure 7.We did not visualize the results of the second-ranked ThickSiam (ours, TRBS-Net) tracker, because this tracker was a variant of our ThickSiam framework.The yellow, red, blue, and green bounding boxes represented the results of GT, ThickSiam (ours, TRBS-Net+RKF), SiamFC++ (AlexNet), and SiamFC (AlexNet), respectively.
From Figure 7, all tested trackers could locate the targets within simple scenarios, such as airplane-2, airplane-3, airplane-4, airplane-5, train-1, and vehicle.These three trackers were very effective even in complex scenarios such as tracking ship-1, ship-2, ship-3, and ship-4, where the targets were moving ships and disturbed by the surrounding waves.
However, when the target was poorly distinguishable from the background, the tracking performances of SiamFC++ (AlexNet) and SiamFC (AlexNet) were not satisfactory.In the tracking scene of airplane-1, for example, the airplane target was indistinguishable from the airstrip.As could be seen from the tracking results shown in Figure 7, the SiamFC++ (AlexNet) and SiamFC (AlexNet) trackers had lost their size match at the 320th frame.At frame 435, the result of SiamFC++ (AlexNet) tracker was still decreasing in size match with the target, but SiamFC (AlexNet) completely lost the airplane target.In the tracking scene of train-2, there existed buildings in the tracking area and they were similar to the train target.At the initial 66th frame, the SiamFC++ (AlexNet) and SiamFC (AlexNet) trackers gradually lost their size match.At the 307th frame, SiamFC++ (AlexNet) and SiamFC (AlexNet) trackers could barely track the target.At the 587th frame, SiamFC++ (AlexNet) and SiamFC (AlexNet) trackers completely failed to track the target.Our ThickSiam (TRBS-Net+RKF) tracker showed strong performances in all tracking scenarios.By stacking carefully designed TRB and TMRB modules, the proposed ThickSiam framework combined the results of TRBS-Net composed of these two modules and RKF mechanism by the N-frame-convergence method, and yielded effective tracking performance.

Performance of the ThickSiam framework in overcoming multiple challenges
This section discussed whether the ThickSiam framework could solve the challenges including small size (SS), partial occlusion (PO), persistence of vision (PoV), poor target-background discriminability (PTBD), shape deformation (SD), and poor general field illumination (PGFI) of the tracking targets.Six typical tracking scenes containing the above attributes were selected for analysis.The attributes of the targets and their visualization results are shown in Figure 8.
By analyzing Figures 8(a) and 8(b), the ThickSiam framework could completely deal with the challenges of SS, PoV, SD, and PGFI of the ground objects.Due to the high speed of the flying airplanes, there existed SS, PoV, and SD challenges in these two scenarios.In the tracking scene of airplane-1, there still existed the challenge of PTBD between the airplane and the airstrip.However, the ThickSiam tracker could still effectively track the target without losing the compactness of the bounding box.The ship in Figure 8(c) was affected by surrounding waves, which showed similar spectral characteristics to the tracking ship thus brings interference.Our ThickSiam tracker only focused on the ship without being disturbed by the waves.The train in Figure 8(d) was confronted with challenges of SD and PGFI, and the whole scene was relatively gray.But the ThickSiam framework could adaptively expand the tracking range to capture the objects more accurately.
However, the ThickSiam tracker also obtained unsatisfactory tracking results in scenarios with PTBD and SD challenges.In the tracking scene of train-2, the ThickSiam tracker gradually lost the size match in later frames.The reasons for this phenomenon were that on the one hand, the background areas were more complicated, and there were SD and PTBD challenges such as similar trains and buildings; on the other hand, the train with a size of 154 � 63 pixels in this video was beyond the range of template frame with a size of 127 � 127 pixels, causing the tracker to extract incomplete features.In video 8, the tracking target was a vehicle with a size of 10 � 30 pixels.The vehicle took on its true face at 66th frame, and it was partially occluded by a bridge at 177th frame.From the visualized results (displayed with red box), the ThickSiam framework still recognized the full shape of the vehicle despite partial occlusion.It was the cooperation results between the RKF module for simultaneously correcting the trajectory and size of the targets and the N-frame mechanism for combining historical frame and current frame tracking results.However, there was a large SD challenge when the vehicle turned around, and the ThickSiam tracker could not effectively adapt to the size change.

Applicable scenarios and limitations of the ThickSiam framework
This work focused on SOT for short-term, high-framerate (over 10 FPS) satellite videos.In terms of operational applications, our approach could be used for a variety of tasks.In the military field, object tracking had been widely applied to various aspects of modern warfare such as battlefield dynamic analysis, weapon navigation, missile early warning, and reconnaissance.In the civilian field, object tracking was critical in intelligent traffic management systems, global positioning systems, video surveillance, and environmental monitoring Song et al. (2022); Zhang et al. (2021).
Remote sensing data had disadvantages such as large field of view with small targets, weak object features, and poor target-background discriminability, as shown in Figure 9. Therefore, the satellite video SOT methods should pay more attention on improving object feature representation and enhancing the discrimination between the target and the background.Perhaps attention mechanism was an effective choice to achieve this goal.In addition, a strong correlation strategy should be designed to explore the cross-correlation between template and search branches.
Although the ThickSiam framework could effectively track specified targets in high-frame-rate satellite videos, the duration of these videos was usually only 90 seconds.Such a short time made it less useful for tracking military targets.In addition, the ThickSiam framework was not suitable for tracking targets in satellite data (over a large area) with relatively long revisit cycles, unless there was a satellite constellation to provide continuous data support.Perhaps, change detection was one of the ways to track targets in this type of data.
Nevertheless, the ThickSiam framework proposed in this paper still injected fresh blood into the satellite video SOT task.It was a further development of the object detection method and could also provide a reference for multi-object tracking (MOT) task.

Conclusions
In this paper, an effective ThickSiam framework to solve high-resolution satellite video SOT issue is proposed.The ThickSiam tracker consists of a welldesigned TRBS-Net stacked by thickened residual block and thickened maxpooling residual block to obtain the initial tracking results, and a RKF module to simultaneously correct the trajectory and size of the tracking target.An N-frame-convergence mechanism to deal with the burn-in period problem and combine the results of TRBS-Net and RKF modules by frames is designed.We also construct a testing dataset for satellite video SOT and make it available to the public to facilitate this task.
Internal and external contrast experiments are implemented on the constructed testing dataset to verify the effectiveness of the proposed framework.The ThickSiam framework is compared with other 19 state-of-the-art CF-based and DL-based trackers including different features and backbones.Our ThickSiam tracker obtains the global optimal precision value of 0.991 and success value of 0.755.It also obtains an FPS value of 56.849 on a single GPU and delivers real-time processing of satellite videos.Experimental results show that the ThickSiam framework can cope with SS, PoV, SD, PO, and PGFI challenges.When SD and PTBD challenges exist at a same scene, the tracking results of the ThickSiam appear slight model drift, but still maintain the similarity with the GT in object shape.
In future work, we will further expand the scale of the testing dataset, combine semantic feature of targets (Chen et al. 2021;Zhu et al. 2022) to design more effective object tracker and conduct MOT research in satellite videos.

Highlights
• Siamese network is used for satellite video single object tracking.

Figure 1 .
Figure 1.In high-resolution satellite videos, ground targets show attributes such SS, PO, PoV, PTBD, SD, and PGFI.These attributes are the challenges in current satellite video SOT task and also make natural scene-based trackers inapplicable to satellite videos.

Figure 2 .
Figure 2. The overall tracking workflow of ThickSiam.It formally includes TRBS-Net for extracting robust semantic features to obtain the initial tracking results and a RKF module for simultaneously correcting the trajectory and size of the targets.The results of TRBS-Net and RKF modules are combined by an N-frame-convergence mechanism to achieve final tracking results.

Figure 3 .
Figure 3.The structure comparisons of original residual block and the proposed TRB.Based on the original residual block, the modifications of TRM include doubling the number of channels in bottleneck and cropping out the limbic elements attached to the feature map.
(a) The paradigm of original down-sampling residual block (b) The paradigm of thickened maxpooling residual block (TMRB)

Figure 4 .
Figure 4.The structure comparisons of original down-sampling residual block and the proposed TMRB.Based on the original downsampling residual block, the modifications of the TMRB include doubling the number of channels in bottleneck, cropping out the outermost features, modifying the stride in the above two convolutional modules to 1, and adding a maxpooling layer to achieve down-sampling of the feature map.

Figure 5 .
Figure 5.The constructed exemplar-search training pairs.The annotation selected from DIOR (Li et al. 2020) object detection dataset is expanded outward by 1/2 of the sum of width and height, and scaled according to the sizes of the exemplar image and search image of the TRBS-Net.

Figure 6 .
Figure 6.The constructed testing dataset used in the experiments.There are twelve objects in eight satellite scenarios, and the targets consist of airplanes, ships, trains, and vehicle.

Figure 8 .
Figure 8.The tracking results of the ThickSiam tracker in six typical scenarios containing all attributes including SS, PO, PoV, PTBD, SD, and PGFI.

Figure 9 .
Figure9.The field of view of "Jilin-1" satellite video.The specified area in the red box on the left was enlarged and displayed in the upper right corner.Cars on the bridge were further magnified for visual display.These targets usually had only a few or dozens of pixels, and the ultra-small size made their apparent features weak, and the boundary with the background was not clearly visible.

Table 1 .
The detailed information of feature map in each stage of TRBS-Net.s is the abbreviation of stride.conv1 and bn1, respectively, represent the convolutional layer and batch normalization in stage1.

Table 2 .
The detailed descriptions of the constructed testing dataset.px is the abbreviation of pixel.Attributes refer to the difficulties in tracking this object.

Table 3 .
The experimental results of the ThickSiam framework with different training mechanisms.The baseline method was stacked by original residual block and downsampling residual block according to the structure of the TRBS-Net.
Table 4 because this set of ablation experiments only changed the value of N in the RKF mechanism while keeping the network structure unchanged, thus not adding extra time-cost.

Table 5 .
Comparisons with the state-of-the-art trackers on our constructed testing dataset.
• Remolded Kalman Filter simultaneously correct the trajectory and size of the target.• Object tracking training sample pairs are constructed from static images.• Manually annotated satellite video object tracking testing dataset for public research.