Performance of Correlational Filtering and Deep Learning Based Single Target Tracking Algorithms

Visual target tracking is an important research element in the field of computer vision. The applications are very wide. In terms of the computer vision field, deep learning has achieved remarkable results. It has broken through many complex problems that are difficult to be solved by traditional algorithms. Therefore, reviewing the visual target tracking algorithms based on deep learning from different perspectives is important. This paper closely follows the tracking framework of target tracking algorithms and discusses in detail the traditional visual target tracking methods, the mainstream single target tracking algorithms based on correlation filtering, and the video single target tracking algorithms based on deep learning. Experiments were conducted on OTB100 and VOT2018 benchmark datasets, and the experimental data obtained were analysed to derive two visual single-target tracking algorithms with optimal tracking performance. Finally, the future development of tracking algorithms is envisioned.


Introduction
Visual target tracking is a fundamental and important research topic in the field of computer vision, which has received a great deal of attention from scholars. Given the state (position and size) of a target in the first frame of a video, the aim is to predict the state of the target in subsequent frames 1,2 . Visual target tracking has wide and deep applications in human-computer interaction, intelligent video surveillance, medical diagnosis, visual navigation, and other fields.
Although visual target tracking technology has been studied for many years and some progresses have been made, it is still difficult to meet the practical needs, such as scale change, fast motion, deformation, blur, illumination change, occlusion, and background clutter in some situations. Many academics attempt to improve target tracking and overcome its problems 3,4 , mainly including the challenging factors like the self-factor and background factors as shown in Figure 1. Often, multiple challenges are faced in a tracking task, which makes it particularly important to design a robust tracking algorithm that can cope with a variety of complex situations. Wang et al. 1 summarized the general framework of the target tracking system into five main parts, which are motion model, feature extraction, observation model, model update, and integration processing as shown in Figure 2. The motion model generates the target candidate region for the current frame; the feature extraction performs feature extraction on the candidate region, which is used to describe the properties of the target; the observation model determines whether the candidate region contains the target and uses it as the predicted target location; the model update is used to control the strategy of observation model update; and processing operation fuses the outputs of multiple sub-tracking algorithms to obtain the final output (multi-target tracking algorithm).
The literature [5][6][7][8] surveyed visual target tracking algorithms from different perspectives, but due to the rapid development of visual target tracking algorithms, especially based on the technical breakthroughs in deep learning tracking algorithms, there still needs to be more focused and comprehensive visual single-target tracking algorithms. Thus, this paper aims to provide a review of the research progress of visual single-target tracking methods based on basic deep learning theory, hoping to provide an organized and hierarchical reference of diverse single-target tracking algorithms and valuable ideas for future research work.
The work in this paper is organised as follows: Section 2 introduces traditional target tracking algorithms; Section 3 analyses the mainstream correlation filterbased video target tracking algorithms; Section 4 explores deep learning-based video target tracking algorithms; Section 5 covers the experiments; and Section 6 includes data analysis, results and future directions.

Research on Target Tracking Algorithm Based on Correlation Filtering
Since 2010, correlation filter (CF)based tracking algorithms have gained popularity in academia and industry for their excellent performance and faster running speed, which have developed rapidly. Bolme et al. 9 proposed the minimum output sum of squared error (MOSSE) tracking algorithm, which was the first to introduce a correlation filter model in the field of target tracking to find the best position in subsequent frames by minimizing the mean-squared error In 2012, Henriques et al. 10 proposed the cyclic structure detection tracking algorithm with kernel (CSK), which uses a cyclic shift to densely sample the data and quickly train a classifier by fast Fourier transform (FFT). A number of related filtering algorithms have followed and built on them with a series of improvements in terms of feature representation, scale improvement, and resolution of boundary effects. Table 1 shows the technical comparison of various mainstream CF tracking algorithms.

Feature Improvement
Henriques et al. 10 extended the multichannel function and kernel method based on CSK and proposed the Kernel Correlation Filter (KCF) tracking algorithm, while transforming the solution of correlation filter into a ridge regression problem. Danelljan et al. 11 mapped the original RGB 3-channel image to 11 channels in the Colour Name (CN) tracking algorithm and processed each channel individually before fusing the results. To solve the problem of too many channels affecting the running speed, principal component analysis (PCA) is used to reduce the dimensionality of two major channels from the 11 channels for the above processing. The efficient convolution operators for tracker with hand-crafted feature (ECO-HC) 12 using histogram of orientation gradient (HOG) and colour name (CN) 10 features were fused and good results were achieved.
Bertinetto et al. 13 proposed the sum of template and pixel-wise learners (STAPLE) tracking algorithm where HOG features and colour histogram are used to model the appearance of the target, that consist of some complementary features. By solving their response maps independently, a better tracking effect is obtained through a weighted fusion of the response maps.
Deep learning has achieved unprecedented results in the field of computer vision. In recent years, deep learning has also been introduced into the field of target tracking, where depth features are used to improve tracking performance under the tracking framework of correlation filtering.

Scale Improvement
Danelljan et al. 14 proposed the discriminative scale space tracker (DSST) algorithm, which views target tracking as two separate problems of target centre translation and scale change. The HOG feature is used to train the translation filter and the scale filter. The translation filter is used to obtain the target centre position, while the scale filter is used to calculate the confidence map. The scale corresponding to the response map that finds the maximum value of the response is the best scale. In order to better cope with the scale variation, 33 scale filters were used, and this scaling method is also followed in the subsequent paper of Denlljan et al 14 .
To better cope with scale changes, Li et al. 15 proposed the scale adaptive with multiple features tracker (SAMF) algorithm, which uses HOG features and CN features to extract features and seven scales for the target in the candidate region. This further detects both target translation changes and scale changes to determine the location and scale of the target quickly.

Handling Boundary Effects
Danelljan et al. 16 proposed the spatially regularized discriminative correlation filter (SRDCF) tracking algorithm to suppress the boundary effect by learning the correlation filter with larger spatial support in the detection phase. It maintains an extensive search range to better cope with the fast motion of the target.
The MOSSE-based correlation filters with limited boundaries (CFLM) tracking algorithm 16 and the background-aware correlation filters (BACF) tracking algorithm based on HOG features were proposed by Galoogahi et al. 17 Filters (BACF) tracking algorithm 17,18 , is more effective in mitigating boundary effects by using larger size detection image blocks and smaller size filters to increase the proportion of real samples.
Unveiling the Power of Deep Tracking (UPDT) algorithm follows the Gaussian distribution used in ECO to extract positive samples, and also separates deep and shallow features. Experiments found that different features should be used with different variances. The influence of deep and shallow features in target tracking was systematically analysed and it was found that the deep model should be responsible for the robustness of the network while the shallow model was responsible for accurate localization. A novel feature fusion strategy is then proposed. Danelljan et al. 19 proposed the ATOM tracking method, by designing a novel architecture consisting of specialized target estimation and classification components. An online trained classifier and an offline trained evaluation network were proposed to jointly solve the target tracking problem, which is very similar to detection, a twostage tracking framework.
The tracking method of Probabilistic Regression for Visual Tracking (PrDiMP) 20 introduces meta-learning to incorporate the information of the first frame into the later frames, i.e., the information of the first frame is used to provide weights for the online update model of the later frames, where the online update model refers to the two Head parts of position prediction and bounding box prediction. Categorized as a regression problem, a conditional probability model is used here to predict the position of the next frame from the information of the previous frame. Raw pixels  ×  SAMF 15 Raw pixels\HOG\CN  ×  KCF 10 Raw pixels\HOG × ×  HCF 25 HOG

Research on Target Tracking Algorithm Based on Deep Learning (DL)
Deep learning-based target tracking algorithms can be divided into depth feature-based target tracking algorithms, Siamese network-based target tracking algorithms, recurrent neural network (RNN)-based target tracking algorithms, generative adversarial network (GAN)based target tracking algorithms and other specific network-based target tracking algorithms. Table 2 shows the technical comparison of various mainstream DL tracking algorithms

Depth Feature-Based Target Tracking Algorithms
In depth feature-based target tracking algorithms, scholars have replaced the traditional features with depth features under the existing target tracking framework 21 . In 2015, Danelljan et al. 22 proposed the SRDCF framework using DeepSRDCF, an improved algorithm for feature extraction by VGGNet 23 , which achieved better results, and also explored the effect of features of different layers of convolutional neural networks on target tracking accuracy. In 2016, Danelljan et al. also proposed the C-COT tracking algorithm, which uses VGGNet 23,24 to extract multi-resolution features in the continuous domain, interpolates multiresolution features, and trains continuous correlation filters, which was used in the VOT2016 challenge, and resulted in an amazing performance. In 2017, Danelljan et al. 12 also proposed the ECO algorithm based on C-COT combining convolution features, HOG features and CN features by factorizing the ECO, which combines convolutional features, HOG features and CN features, reduces the dimensionality of features by factorization of convolution operations, and reduces the training samples in the learning model to improve the tracking speed and robustness.
Ma et al. 25 proposed the tracking algorithm of Hierarchical Convolutional Features for Visual Tracking (HCF) which uses three correlation filters. Since the upper layer provides semantic information and the bottom layer provides texture information, the correlation filters are used in the order from deep to shallow to determine the target location from coarse to fine.

Siamese
Network-Based Target Tracking Algorithm Scholars have suggested the use of Siamese network-based target tracking algorithms to overcome the poor speed caused by pre-trained networks as feature extractors. With quicker speed and greater tracking performance, Siamese networks have received much interest in target tracking.
Held et al. 26 suggested GOTURN in 2016. GOTURN introduced Siamese networks to target tracking and employed an offline feedforward network where a block of pictures from the current and previous frames is fed into a convolutional neural network for feature extraction and subsequently cascaded into a fully connected layer. The layer compares target and frame information to determine the target's location offset. Fully linked layer learns a complicated feature comparison function and outputs target motion.
Tao et al. 27 proposed Siamese Instance Search for Tracking (SINT) algorithm based on Siamese networks. SINT trains a matching function offline through a large amount of video data, which matches a given target in the initial frame with the next SINT trains a matching function offline by using a large amount of video data to match a given target in the initial frame with a candidate target in the next frame, and then returns the most similar target.
Bertinetto et al. 28 introduced Fully-Convolutional Siamese Networks for Object Tracking (SiamFC), which implements a fully convolutional Siamese network architecture and uses AlexNet as the backbone network to extract template and search picture features. The feature map of the template image is convolved with the feature map of the search image to create the response map. Figure 3 shows SiamFC's tracking architecture. Valmadre et al. 29 improved on SiamFC to obtain the end-to-end representation learning for Correlation Filter based tracking (CFNet) algorithm. CFNet integrates correlation filtering into a network layer and adds to the template branch to update the template model, thus making the Siamese network more robust to appearance changes.
Li et al. 30 introduced Region Proposal Network (RPN) 31 to Siamese network target tracking and proposed SiamRPN tracking algorithm. SiamRPN is first trained end-to-end using large-scale images offline. In the tracking phase, the tracking task can be viewed as a single-sample detection task that directly regresses the target to be tracked without the need for scale estimation, greatly increasing the runtime speed. SiamRPN++ 32 presents a Depth-wise convolution design that saves arithmetic power without sacrificing accuracy. SiamRPN adds bounding box regression and short-term monitoring is restricted.
The Recurrently Optimizing Tracking Model (ROAM) 33 technique provides a tracking model with a resizable response generator and a bounding box modulator. Only one anchor size is utilized for each spatial location, and its convolution filter may adapt to shape changes through bilinear interpolation. A meta-learningtrained recurrent neural optimizer speeds up convergence of the updated tracking model.
The Siamese Fully Convolutional Classification and Regression for Visual Tracking (SiamCAR) 34 technique converts the network's regression output into a feature map using an anchor-free approach. Classification and centrality score maps are used to determine the optimum target centroid. The distance between the best target centroid and the four edges of the chosen box determines the tracking prediction box.
Siamese Box Adaptive Network for Visual Tracking (SIamBAN) 35 is built on Siamese network architecture and uses an anchor-free method, which gives the frame greater flexibility. Anchor-free removes predetermined anchors, which reduces model parameters and speeds it up. Null convolution improves perceptual field and tracking performance.

Recurrent Neural Network-Based Target Tracking Algorithm
Visual tracking is strongly tied to the spatial and temporal information of video frames, hence recurrent neural networks are progressively included in target tracking.
Structure-Aware Network for Visual Tracking (SANet) 36 is based on recurrent neural networks. The SANet employs RNNs to encode the structure of targets throughout the learning process, which enhances target identification and interference source recognition. To supply richer information to the network, a layerhopping connection method fuses CNNs and RNNs, and the algorithm's superior tracking effect is tested.
Yang et al. 37 proposed Learning Dynamic Memory Networks for Object Tracking (MemTrack). MemTrack is a dynamic memory network for visual tracking. The external storage unit is managed by a long-short term memory (LSTM) network with an attention mechanism to adjust to target appearance changes. Gated residual template learning generates the final matching template and prevents excessive model updating.
The SiamR-CNN 38 algorithm uses a hard case mining strategy to discriminate the interferers and designs a dynamic trajectory planning algorithm (TDPF) by which all object candidate frames in the previous frame are redetected and grouped into small trajectories over time, thus tracking all potential objects, including interferers, simultaneously. Then the best target is selected within the current time step using dynamic planning based on the complete history of all target and interfering object trajectories. Therefore, the algorithm is computationally intensive and cannot be tracked in real-time.

Generative Adversarial Network-Based Target Tracking Algorithms
Generative Adversarial Networks (GANs) have been extensively employed in various study domains to capture statistical distributions and generate training samples with little or labelled input.
Song et al. 39 applied GAN to target tracking and created an adversarial learning-based approach (VITAL). VITAL employs a generative network to randomly build masks and adaptively delete certain input attributes to boost positive samples. VITAL's network uses adversarial learning to identify masks that keep target object properties over time. VITAL presents a higher-order cost-sensitive loss to lessen the influence of clearly discernible negative samples while enabling network training.

Target Tracking Algorithms Based on Other Specific Networks
Some researchers have created target tracking networks. Nam et al. 40 suggested Multi-Domain Convolutional Neural Network (MDNet) tracking technique. MDNet needs pre-training with several tracking movies to achieve a generic target representation. Each domain corresponds to a training sequence, and the shared layer learns the generic target representation during training. When a new sequence has to be updated, only the domain-specific layers of MDNet are updated online, allowing the network to adapt to the current tracking environment.
Chen et al. 41 introduced a novel Transformer tracking system, including feature extraction, class fusion, and head prediction modules. Transformer class fusion mixes template and searches region characteristics without correlation. Feature fusion networks based on self-context enhancement and cross-feature enhancement are created, focusing on important information, including edges and comparable targets, as well as building correlations between distant data that improves classification and regression outcomes.

Experimental
This section gives experimental data on the performance of the two types of target tracking algorithms discussed in Sections 3 and 4 on the OTB100, and VOT2018 benchmark datasets. Table 3 gives the details of some common singletarget tracking benchmark datasets.

Evaluation Methods for Single Target Tracking
To promote the development of the target tracking field, scholars have summarized and generalized the evaluation criteria of target tracking algorithms, i.e., the performance of different tracking algorithms is evaluated by qualitative and quantitative evaluations. For qualitative analysis, three evaluation criteria are commonly used: traditional evaluation methods, Visual Object Tracking (VOT) evaluation methods 42,43 and Online Object Tracking Benchmark (OTB) evaluation methods 3,4 .

Traditional Evaluation Methods
The traditional evaluation methods include two metrics, central location error (CLE) and overlap ratio (OR) 44 . The smaller the CLE value, the higher the accuracy of the algorithm. The larger the OR value, the better the tracking performance of the algorithm. Generally, using the average overlap rate in the tracking algorithm can reflect the tracking accuracy more accurately.

OTB Evaluation Methods
On the basis of a description of prior work, Wu et al. 3,4 presented the target tracking benchmark OTB for assessing the performance of single-target tracking algorithms. The OTB assessment database originally had 50 video sequences, and the OTB not only offers evaluation metrics for testing target tracking systems, but also includes some well-labelled, challenging video sequences. The OTB benchmark also offers an assessment toolkit with MATLAB and Python versions, and the function interface is straightforward and easy to use. Therefore, it is frequently used. The OTB analyses the performance of the tracking algorithm using the precision rate (PR) based on the centre position error and the accuracy rate based on the target tracking method, and the success rate (SR) is determined by the overlap rate.
The success rate chart of the algorithm can be developed based on the success rate of the target tracking algorithm under different thresholds The area under the curve (AUC) of the success rate chart is used to rank different tracking algorithms and compare the advantages and disadvantages of the algorithms, based on the accuracy rate. metric based on the centre position error and the success rate metric based on the overlap rate. OTB proposes three metrics: one pass evaluation (OPE), temporal robustness evaluation (TRE), and spatial robustness evaluation (SRE). These three values represent the PR and SR of different tests.
The larger the value, the better the tracking accuracy and tracking performance.
The performance of the target tracking algorithm can be easily evaluated by the OTB evaluation method, using metrics such as accuracy and success rate to assist in analysing the performance of the algorithm, as well as to evaluate and compare different algorithms.

VOT Evaluation Methods
Since 2013, VOT has been an annual target tracking competition 43,44,46 that typically acts as a workshop for IEEE International Conference on Computer Vision (ICCV) and European Conference on Computer Vision (ECCV) conferences. The number of test video sequences on VOT has climbed from 16 in 2013 to 60 currently, while the complexity of the video sequences has constantly increased. Since VOT provides resources such as evaluation criteria required to assess the performance of tracking algorithms, a large number of manually labelled test videos, open source evaluation toolkits, and test results of many tracking algorithms on VOT, the VOT evaluation method has been widely adopted in the field of target tracking.
Starting from VOT2016, three key metrics to evaluate the performance of target tracking algorithms are used in VOT: accuracy (A), robustness (R), and expected average overlap (EAO). The larger the accuracy value, the higher the tracking accuracy. The smaller the robustness value, the better the tracking performance. The larger the EAO value, the higher the target tracking accuracy.
In addition to the above experimental datasets, a number of others have emerged in recent years, such as UAV123, LaSOT 45 , as shown in Table 3.

Experimental Data of Target Tracking Algorithm
To get an accurate understanding of the performance of the classical singletarget tracking algorithm, we tested on a high-performance computer with an Inteli7-12700H CPU and paired with a GeForce RTX3070 and 32G RAM, with data sets based on OTB100 and VOT2018, ranging from the classical single-target tracking algorithm. In OTB100, two metrics, PR and AUC, were used to measure the performance of the algorithms, PR is the accuracy rate based on the center position error, and AUC is the area under the curve through the success rate plot to rank and compare the different tracking algorithms for the algorithms' merits. Both values were taken as the average of 11 attributes in the OTB100, and higher values represent better corresponding performance of the algorithms. In VOT2018, three metrics, A, R, and EAQ, were used to compare the performance of the algorithms. A stand for accuracy, the tracking frame predicted by the target tracking algorithm in the test video, and the overlap between the predicted target bounding box and the manually marked target bounding box was calculated. In contrast, the performance of the algorithm was measured by the degree of overlap of the bounding box. The higher the overlap rate, the better the accuracy of the target tracking algorithm. R stands for robustness, where the target tracking algorithm may not succeed in a single run after the test video. It may need several reinitializations to succeed, which depends on the number of times the algorithm is reinitialized to characterize the robustness of the target tracking algorithm. The lower the number of re-initializations, the better the robustness of the algorithm. The larger the EAO value, the higher the accuracy of the target tracking algorithm. All algorithm codes are available at the official download source published by the algorithm founders, and the speeds of the algorithms are obtained from the officially published data.

6
Results and Discussion

Experimental Data Analysis
From Table 4, Figure 4 and Figure 5, it can be concluded that the discriminative tracking approach converts the tracking problem into a detection problem when analysed from the perspective of features, so good features are the key factor for such tracking. From the success rate and accuracy results given in Table 5, it can be seen that HOG and CN features reflect excellent performance in the field of visual tracking, and many methods proposed afterwards combine depth features in different ways to construct tracking frameworks that reflect good performance. The biggest advantage of the correlation filtering-based tracking algorithm is reflected in the speed.
As can be derived from Table 5, Figure 6 and Figure 7, the results show that the MDNet algorithm based on video data trained offline and with online model updates, alongside the improved MDNetbased algorithm VITAL achieved good results in terms of tracking accuracy, but was not satisfactory in terms of speed and did not meet the real-time criteria. The algorithms SiamFC, Dsiam and SINT based on the Siamese network framework also achieved relatively good rankings. C-COT uses VGG-Net to extract depth features, using the original colour image and the output of two convolutional layers as features, which have significantly improved accuracy compared with similar algorithms. Still, the various features seriously reduce the computational efficiency and make it challenging to meet the real-time requirements. ECO reduces the feature dimensions of HOG, CN and CNN by factorization operation on the basis of C-COT, where HOG is compressed to 10, CN is compressed to 3. The 1st and 5th convolutional layers of CNN are compressed to 16 and 64, respectively, reducing the training parameters and thus, effectively reducing the computational complexity. The tracking performance is very high in the experimental data for each dataset.

Results
The results show that ECO is the best performing CF algorithm in terms of tracking accuracy, tracking speed, and other aspects of performance.
Among the DL algorithms, the Siamese network structure of the algorithm alone is not particularly outstanding in all aspects, but it is the best in terms of stability. In particular, combined with the use of the lightweight network model SANet 35 , the comprehensive performance in various aspects such as tracking accuracy and tracking speed is the best.

The Future Direction of Development
The future works could be in two main directions. First, how to balance the relationship between tracking performance and real-time. Mainly in the balance between tracking accuracy and tracking speed. If the accuracy of the algorithm is good, but cannot be used for real-time, it cannot be converted into products.
Second, visual saliency, attention mechanism and the integration of various modules of target tracking, weakening the background to highlight the foreground, guiding the tracker to focus on useful information, and realizing the combination of correlation filtering and twin networks will all be the space for researchers to explore.

Conclusion
This paper focuses closely on the visual target tracking framework. Firstly, the traditional visual target tracking algorithm was analysed. Then, the mainstream video target tracking algorithms based on correlation filtering were analysed from three aspects: feature improvement, scale improvement, and dealing with boundary effects. Then, the video target tracking algorithms based on deep learning were discussed in detail, and the target tracking algorithms based on deep learning were divided into five major categories, and each type of algorithm was analysed in terms of research motivation, algorithmic ideas, research framework, advantages and disadvantages. Finally, the tracking algorithms analysed above have experimented on OTB100 and VOT2018 benchmark datasets, and the experimental data obtained were compared to draw the authors' conclusions on visual single-target tracking algorithms and point out the future development trend in the field of video target tracking.

Conflict of Interest
No potential conflict of interest was reported by the authors.

Acknowledgment
ZhongMing Liao acknowledges the support of the InnoSTRE 2022 conference organising committee.

Funding
No funding sources.