A method of real-temporal object tracking combined the temporal information and spatial information

The purpose of single target tracking is to accurately and continuously locate a specific object when it is moving. However, when the objects encounter with fast movement, severe occlusion, too small size, and the same local features, the tracking algorithm which based on correlation filter or convolutional neural network will appear the positioning error phenomenon. Aiming at the above problems, this paper designs a single target tracking algorithm: relative temporal spatial network (RTSnet). RTSnet is a multi-thread network that composed of Relative temporal Information Network (RTInet) and Relative Spatial Information Network (RSInet). RTInet is designed on the basis of LSTM, and it has the predictable characteristics of temporal. It mainly obtains the relative temporal information between the frames before and after the target. RSInet, an improved twin network based on the Triplet Network, has the effect of similarity determination which can to obtain the spatial information between the frames before and after the target. In the experiments, the RTSnet is trained by using LASOT data set and verified by using the LASOT test set and the OTB100 data set. In the test set of LASOT, the accuracy of RTSnet reaches 85.5%, Trans-T reaches 62.3% and STMTrack reaches 57.4%. Meanwhile, its tracking speed reaches 117.3fps due to the RTSnet adopts dual-thread operation. On the OTB100 data-set, the accuracy of RTSnet is 81.1%.


Introduction
Target tracking is an accurate and real-time positioning process for continuously moving targets. This technology has been widely used in the field of computer vision, such as video surveillance, robots, and drone positioning. According to the number of targets, target tracking can be divided into single target tracking and multiple target tracking. The single target tracking is the most basic theoretical model in target tracking. That model mainly includes tracking algorithms based on correlation filtering (CF) (Henriques et al. 2012;Li & Zhu 2014;Galoogahi et al. 2017;Wang et al. 2017;Possegger et al. 2015) and tracking algorithms based on the convolutional neural networks (CNN) algorithms Sandler, et al. 2018;Iandola et al. 2016;Szegedy et al. 2015;Simonyan and Zisseman 2014). This paper proposes a new singletarget tracking algorithm from the perspective of relative information generated in the continuous movement of the target.
The single target tracking algorithm based on CF mainly uses the filter template to perform correlation algorithm processing on each frame of the input image and achieve the continuous positioning effect of the target. Among them, these algorithms include MOSSE (Bolme, et al. 2010), KCF (Henriques et al. 2015), SRDCF (Danelljan et al. 2015), DSST (Danelljan et al. 2014), etc. These algorithms have certain limitations in obtaining image features. When the background of the moving target is too complex, the accuracy of image feature acquisition by this type of algorithm will be greatly reduced. Since 2012, experts in this field have proposed many single target tracking algorithms based on CNN, such as SiamFC (Bertinetto et al. 2016a), CFNet (Valmadre et al. 2017), SiamRPN (Li et al. 2018), and GOTURN (Held et al. 2016). Compared to tracking algorithms based on CF, the single target tracking algorithms based on CNN has achieved good results in the key indicators of tracking speed and accuracy. The single target tracking algorithms based on CNN has been greatly improved in accuracy and speed, but when the target encounter with fast motion, severe occlusion, and the same local feature, the accuracy of this type of algorithm will drop quickly. At the same time, the single target tracking algorithms based on CF cannot achieve real-time results in accuracy and speed.
In response to the phenomenon mentioned above, this paper designs a relative temporal spatial information network (RTSnet) based on the relative spatial information and the relative temporal information of the consecutive moving target. RTSnet is a combination of the relative spatial network (RSInet) and the relative temporal network (RTInet). Among RTSnet, RSInet, a trilinear network designed based on Triplet Network (Hoffer and Ailon 2015), Trans-T (Chen et al. 2021) and Learn-F (Cheng et al. 2021), mainly make use of the relative spatial information among the targets of three consecutive to predict the spatial information of the next target. RTInet is a time prediction network based on LSTM algorithm (Yang et al. 2016;Zhou and Xu 2015;Greve et al. 2016;Gulcehre et al. 2016) and STMTrack (Fu et al. 2021), which can predict the temporal information of the next target through the relative temporal information among the targets of three consecutive. Compared with STMTrack, Trans-T and Learn-F, RTInet and RSInet will do multi-thread parallel operations. Then, RTSnet fuse the spatial information and the temporal information about the target from the RSInet and the RTInet, respectively, to obtain the target information about the next frame.
The main contributions of this paper can be summarized as follows: 1. This paper proposes an efficient tracking algorithm for multi-type information fusion. Here, the relative spatial information and relative temporal information are fused. On the one hand, the fusion of multiple information improves the algorithm's accuracy; on the other hand, the relative processing method between consecutive frames improves the running speed of the algorithm. 2. In the extraction of spatial information, this paper designs the RSInet model. Compared with the traditional feature extraction algorithm, RSInet is designed based on the tri-linear twin network, and obtains the relative spatial information of the next frame of the target by processing three consecutive frames of images. 3. In the extraction of temporal information, this paper designs the RTInet model. Compared with the traditional temporal information feature extraction method, RTInet is improved based on the LSTM model, and obtains the relative temporal information about the next frame of the target by reading three consecutive frames of images. 4. Experiments on LASOT (Fan, et al. 2018) and OTB100 (Wu et al. 2015) data sets prove that spatial information is beneficial to target tracking. At the same time, when comparing with some SOTA trackers algorithms, it demonstrates the superiority of the algorithm in this paper.

Dataset and prior-work
This section mainly introduces the relevant data sets and the design ideas of RTInet and RSInet, and explains the network structure of RTSnet.

Dataset
LASOT is a long-term tracking data set. The data set has 1400 video sequences and each video has an average of 2512 frames. The shortest video in this data set also has 1000 frames, and the longest contains 11,397 frames. The OTB100 data set contains 98 videos and 100 test scenarios. The coordinates of the object to be located in the picture are recorded in the groundtruth_rect of the data set. Each line of coordinates corresponds to the coordinate position of the upper left corner of the positioning frame and the width and height of the positioning frame in the picture.
TD: We first extract the specific location information about the target from both LASOT and OTB100. Then, we subtract the current location information from the location information of the next moment and obtain the relative time information about the target. The collection of these relative time information is the data set TD. SD: We first crop the images of each sequence in LASOT and OTB100 according to the target position at the corresponding time, and the image set obtained after the interception is the spatial information data set of the target.

RTInet
We can draw a conclusion from StruckSiam that the position information of moving object in temporal has certain continuous features. We will further calculate the relative temporal information of the target on the basis of that conclusion. As shown in Fig. 1, a is the target position sequence of airplane-6 from the LASOT and b is the relative position sequence of the target. It can be seen from Fig. 1 that the data fluctuation of a presents discrete features, while the b presents a relatively stable effect. Therefore, we can obtain more accurate the relative temporal feature information about the target through the calculation method of b.
On the basis of the above conclusion, we design an RTInet algorithm which can extract the relative temporal information, as depicted in Fig. 2.
RTInet is designed on the basis of the LSTM algorithm and trained on the TD dataset. The data in the TD dataset is obtained by anno t subtract anno t-1 , which anno t and anno t-1 are the temporal information about the target at the current moment and the previous moment, respectively, as shown in formula 1. We will get the r t ! jðx; yÞ r t ! jðx; yÞ by the center point (x, y) of the target position from each frame and do the subtraction between the current frame and the previous frame. The data of r t ! jðx; yÞ r t ! jðx; yÞ indicates in which direction the target position of the next frame will be located. Then, we put the absolute value of r t and r t-1 into RTInet to calculate the predicted value r t?1 of the next frame. r t?1 is an absolute value data. When r t?1 is added to anno t , we will get prediction data about the next frame in 16 different directions. In order to obtain more accurate prediction data, we will further to filter these data according to the angle of r t ! jðx; yÞ r t ! jðx; yÞ and the relative spatial information.
RTInet first obtains the target's temporal information from three consecutive frames P t-2 , P t-1 , P t , namely the target's position information anno t-2 , anno t-1 and anno t . Then, the relative temporal information r t-1 , r t is obtained by formula 1. Finally, the relative temporal information r t-1 and r t are calculated by rti to obtain the relative temporal information r t?1 of the next frame. RTInet mainly predicts the relative temporal information of the target in the next frame from the relative temporal information of three consecutive frames. Compared with STMTrack, RTInet mainly predicts changes in relative time information, reducing the error rate.

RSInet
The moving target not only has temporal information, but also has spatial information. For example, SiamFC and SiamRFC use the spatial information of the target to achieve continuous positioning of the target. When we analyze the moving target, we will divide each image into foreground and background to analyze separately. The foreground mainly describes the spatial information of the target, and the background mainly describes the background information of the target. If the spatial information about the target is obtained from the entire image, it will be disturbed by the background information so that will not only affect the speed of calculation, but also affect the accuracy of the target information acquisition. As for these problems, we propose relative spatial information and design a relative spatial information network, RSInet.
As shown in Fig. 3, we first crop three corresponding target images from three consecutive moving images, so that we can only focus on the spatial information of the target and avoid the interference of background  Fig. 3 The relative spatial information of targets in continuous sequence A method of real-temporal object tracking combined the temporal information and... 8691 information. After the three intercepted target images are subtracted between the front and rear frames, two relative images about the target will be obtained. Here, RSInet is used to learn the relationship between consecutive relative images to predict the information of the next frame of target image and the RSInet's process is shown in Fig. 4. RSInet firstly extracts three target images information c t-2 , c t-1 , c t from the three consecutive frames of images P t-2 , P t-1 , P t according to the target position. Then, we put c t-2 , c t-1 , c t into rsi which is the feature extraction part of RSInet. And rsi is mainly a feature extraction layer composed of 3*1 and 1*3 types of convolutional layers. The parameters are shown in Table 1. Images c t-2 , c t-1 , c t are extracted by rsi to obtain corresponding high-dimensional features f t-2 , f t-1 , f t . The high-dimensional feature f t-2 , f t-1 , f t obtain the parameter l in the following formula 4. The parameter l mainly represents the relative information ratio of the spatial feature information generated by the target between three consecutive frames. We can get the relative vector r tþ1 ! jðx; yÞr tþ1 ! jðx; yÞ of the target center position in the next frame by multiplying r t ! jðx; yÞ and l.
RSInet mainly obtains the spatial relative information of the target from three consecutive frames and predicts the target information in the next frame. In terms of spatial feature extraction, compared with the relation detector (RD) of Learn-F and the attention mechanism of Trans-T, RSInet utilize the Triplet Network to obtain the relation spatial feature of continuous targets, which is more concerned with the relative spatial information of the target. Therefore, the interference of background noise is reduced, and the accuracy of relative feature information is improved.

RTSnet
RTInet extracts the relative temporal information about the target from three consecutive frames and predicts the relative temporal information about the target in the next frame. RSInet extracts relative spatial information about the target from three consecutive frames and predicts the relative spatial information about the target in the next frame under rsi. Here, we use multi-threaded parallel computing to fuse the two types of information, as shown in Fig. 5.
The main thread mainly reads information about the image. The RSInet thread reads the target's spatial information c t-2 , c t-1 , c t from P t-2 , P t-1 , P t and obtains the relative spatial information ratio l of the target in the next frame. The RTInet thread reads the relative temporal information r t-1 , r t about the target from P t-2 , P t-1 , P t and obtains the relative temporal information r t?1 about the target in the next frame. Then, to calculate the target information annot at the current moment, the relative temporal information r t?1 , and the relative spatial information ratio l through the following formula to obtain the position information anno t?1 about the target in the next frame.

Experiments
In this section, we discuss the details of RTSnet in training. And compared with some SOTA algorithms from two perspectives of algorithm accuracy and tracking speed under LASOT test set and OTB100 test set, respectively.

Experiments preparation
Experimental environment: The test platform of the network is AMD Ryzen 3 2200G; the training platform of the network is GeForce RTX 2070.
Training details: The purpose of RSInet is to obtain the relative spatial information of the next frame through the relative spatial information between successive frames. When we train RSInet, we first extract the target image P t-1 and P t of two consecutive frames from each sequence and designate them as similar images, then extract the target image P t?50 of the 50th frame after P t as non-similar images. Finally, we train RSInet by making the distance between P t and P t?50 much greater than the distance between P t and P t-1 . When the value range of l is between (0.25, 0.75), the accuracy of image similarity determination is the highest by experiment. RTInet generates relative temporal information about the next frame through the relative temporal information between successive frames. When we train RTInet, we first extract the temporal information of four consecutive frames from the sequence, namely the location information of the target, anno t-2 , anno t-1 , anno t , anno t?1 . Then, the relative temporal information of c t-1 , c t and c t?1 between successive frames was obtained. Finally, RTInet was trained in the way of c t and c t-1 fitting prediction c t?1 . Training: RTSnet's algorithm is composed of RSInet and RTInet, and RTSnet's algorithm also will be divided into two parts for training. RTInet's algorithm is trained under TD dataset and optimized under formula 5, aiming to make the predicted value of RTInet close to the true value.
RSInet's algorithm is trained under SD dataset and optimized under formula 6, aiming to let RSInet's algorithm to reduce the relative distance Df t-1,t-2 between similar image pairs and enlarge the relative distance Df t,t-1 between non-similar image pairs.

Experimental results on LASOT
We utilize the LASOT toolkit to evaluate the tracking effect of the RTSnet and its contrasting algorithm. The evaluation criteria mainly depend on two aspects: precision plots and success plots. The success plots is that under a threshold, we firstly calculate the IOU (Ren et al. 2017) overlap between the predicted border of each frame and the ground truth of the frame, and then compare the comparison rate between the IOU overlap and the threshold. The precision plots is that the ratios of frames where the location errors are within a certain values. We sort the evaluated trackers according to the area size under the curve scores of the success graph and the accuracy graph. In order to verify the efficiency and accuracy of RTSnet, we compare the effect of RTSnet with some SOTA tracking networks, including STMTrack, Trans-T, Learn-F, MDNet (Nam Han 2016), Staple (Bertinetto et al. 2016b RSInet anno t-2 anno t-1 anno t RTInet γ t-1 γ t μ γ t+1

RSInet Thread
Main Thread

Fig. 5 Rtsnet algorithm structure diagram
A method of real-temporal object tracking combined the temporal information and... 8693 Siamese algorithm to achieve efficient tracking effects. The experimental results are shown in Table 2.
Compared with algorithms such as Struck-Siam, D-Siam, and CSRDCF, from the perspective of accuracy, RTSnet uses a multi-type information fusion method to ensure the diversity and accuracy of target feature information. From the perspective of tracking speed, on the one hand, RTSnet uses the relative processing between consecutive frames to reduce the amount of calculation. On the other hand, multi-threaded parallel processing is used to accelerate the calculation speed of the algorithm. Figure 6 shows the success plots on LASOT dataset. It can be seen intuitively that the proposed RTSnet has an accuracy rate of 85.5%. And, the proposed RTSnet is 50 and 54% higher than StruckSiam and STRCF, respectively. From Table 2, we can find that the RTSnet's tacking speed reaches the optimal value of 117.3 fps on the LaSOT dataset.

Experimental results on OTB100
The OTB100 data set contains 98 video sequences and 11 attributes. Here, we use the OTB100 toolkit to compare the RTSnet and SiamFC, CFNet, CSRDCF. And, the result is shown in Fig. 7 and Table 3. It can be concluded that RTSnet has an accuracy rate of 81.1% higher than other tracking algorithms, 21% higher than SiamFC and 19% higher than CFNet. RTSnet's tracking speed reaches the optimal value of 102.3fps on the OTB100 dataset.
Compared with other algorithms, RTSnet uses a multiinformation fusion method to achieve an accuracy rate of 81.1% and RTSnet uses a continuous frame relative processing method and a multi-threaded parallel processing method to achieve a speed of 102.3 fps.

Discussion
In this section, we compare RTSnet with some SOTA algorithms under the 14 attributes of LASOT dataset, and make a detailed discussion on the actual tracking effect.

Attribute evaluation
All sequences in the LASOT dataset contain a total of 14 different attributes. Here, we first analyze the superiority of the dataset attributes. Here, we use the accuracy in Fig. 6 as the benchmark, and then calculate the relative changes in the accuracy of the three tracking algorithms RTSnet, StruckSiam and STRCF under 14 attributes as shown in Fig. 8. From the accuracy changes in Fig. 8, we find that under the attributes of FM, FOC, LR, and OV, the three algorithms all have achieved a relatively large gap, but the change value of RTSnet is lower than the other two algorithms. That is mainly because of RTSnet not only adds relative spatial information in the tracking process, but also adds relative temporal information, while the StruckSiam and STRCF algorithms only add spatial information in the tracking process. In the comprehensiveness of information, RTSnet is superior to the other two algorithms. Under the attributes of ARC, BC, ROT, VC, and SV, RTSnet has a positive change, while StruckSiam and STRCF both have a negative change. This is mainly because of BC, ARC, ROT, SV, VC are mainly related to unfavorable factors such as background interference and viewing angle changes. RTSnet can exploit RTInet to obtain temporal feature information, and obtain spatial features about target by RSInet. So RTSnet can reduce the interference of background, perspective and other factors. StruckSiam is based on the principle of CNN to obtain high-dimensional features of the image, but under background interference and perspective changes, CNN will have high errors in extracting target features. However, the STRCF based on the correlation filter will use the spatial information of the preceding and following frames to locate the target, thus reducing a certain error.
Under the MB attribute, the accuracy of the three algorithms is reduced due to the movement of the lens. Although the accuracy of the three algorithms has been lowered under the POC attribute, RTSnet can still perform high-accuracy real-temporal tracking mainly, that's because of the temporal feature information of the target is added to RTSnet. So, RTSnet can reduce the influence of some occlusion factors. On the contrary, the accuracy of StruckSiam and STRCF are drop greatly by the occlusion.
When encountering OV, FM, LR, FOC, StruckSiam and STRCF have serious shortcomings in obtaining target information. On the contrary, RTSnet with the help of RTInet not only to obtain more temporal information about the target, but also to make up for the problem of insufficient spatial information acquisition.
Then, we continue to analyze the success plots of RTSnet and the comparison network under the LASOT data set which are shown in Fig. 9. It can be intuitively concluded from Fig. 9 that RTSnet has achieved higher accuracy in all attributes. The main reason is that compared with traditional algorithms, RTSnet performs multiple types of information fusion in information acquisition.

Qualitative evaluation
Here's a further discussion of the tracking effects of RSTnet. We selected 4 data sets of video sequences from LASOT, namely Basketball-6, Bear-6, Boat-6, Bottle-6.
As shown in Fig. 10, the RTSnet algorithm is compared with the StrcukSiam and STRCF algorithms in the tracking effect. In the Basketball-6 video sequence, it is intuitively found that SturckSiam has severe target loss due to longterm target tracking. In the case of high-speed ball movement and complete occlusion, STRCF also has short-term target loss. Compared with Groundtruth, RTSnet can achieve short-term prediction and tracking of the ball's motion, and is not affected by obstructions and fast motivation. In the Bear-6 video sequence, StuckSiam will cause target tracking errors due to too many targets with the same characteristic. RTSnet and STRCF still can track the target in real temporal with the case of camera shake. In the Boat-6 video sequence, we can intuitively see that because of the target size is too small, StruckSiam and STRCF can't achieve precise positioning of the target. On the contrary, RTSnet can realize real-temporal tracking on small objects, which based on the characteristic relationship between the front and rear frames. There are 4 identical bottles in the video sequence of Bottle-6. When one bottle is falling down and is partially occluded by another bottle, Stur-ckSiam and STRCF may have target positioning errors due to the similar characteristics of the bottle image. On the Fig. 8 Under the 14 attributes of the lasot data set, the relative accuracy comparison chart of rtsnet, strucksiam and strcf Fig. 9 Under the 14 attributes of lasot, the accuracy comparison chart of rtsnet and some algorithms contrary, RTSnet can achieve the tracking effect of the bottle according to the change of the spatial feature information of the target's previous and subsequent frames and the change of the temporal feature information between the previous and subsequent frames.

Conclusion
In this paper, we propose a single-target tracking network RTSnet that integrates temporal information and spatial information. RTSnet is composed of the spatial relative network RSInet and the temporal relative network RTInet. First, we analyze the relative information changes of continuous moving targets from the spatial and temporal perspectives. After that, the relative spatial information is obtained from RSInet, and the relative temporal information is obtained from RTInet. Finally, the two kinds of relative information are merged and the prediction information about the next frame of the target is obtained. Our algorithm effectively improves the accuracy and robustness of tracking by acquiring temporal and spatial information of moving targets. Therefore, the proposed tracker is higher robust to DEF, FM, FOC, and SV. As a result, the performance of RTSnet in the LASOT data set and OTB100 data set is superior to some SOTA trackers. Meanwhile, RTSnet uses the relative processing method of continuous frames and the method of multi-threaded parallel processing so that the tracking speed on the LASOT data set reaches 117.3fps.
Author contributions XJ conceived the algorithms, conducted experimental demonstrations, and wrote the paper; ZL wrote the paper; KL wrote the paper; SZ wrote the paper.
Funding The work was supported by the Natural Science Foundation of Guangdong Province (No. 2020A1515010784).
Data availability This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's the Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's the Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creati vecommons.org/licenses/by/4.0/.

Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.

RTSnet
STRCF StruckSiam GruoundTruth Fig. 10 In 4 different sequences, the tracking effect comparison chart of rtsnet, strucksiam and strcf A method of real-temporal object tracking combined the temporal information and... 8697