Transformer-Based Visual Object Tracking with Global Feature Enhancement

Wang, Shuai; Fang, Genwen; Liu, Lei; Wang, Jun; Zhu, Kongfen; Melo, Silas N.

doi:10.3390/app132312712

Open AccessArticle

Transformer-Based Visual Object Tracking with Global Feature Enhancement

¹

School of Computer Science and Technology, Anhui Engineering Research Center for Intelligent Computing and Application on Cognitive Behavior (ICACB), Huaibei Normal University, Huaibei 235000, China

²

College of Electronic and Information Engineering, Hebei University, Baoding 071000, China

³

Department of Geography, Universidade Estadual do Maranhão, São Luís 65055-000, Brazil

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(23), 12712; https://doi.org/10.3390/app132312712

Submission received: 27 October 2023 / Revised: 19 November 2023 / Accepted: 21 November 2023 / Published: 27 November 2023

Download

Browse Figures

Versions Notes

Abstract

:

With the rise of general models, transformers have been adopted in visual object tracking algorithms as feature fusion networks. In these trackers, self-attention is used for global feature enhancement. Cross-attention is applied to fuse the features of the template and the search regions to capture the global information of the object. However, studies have found that the feature information fused by cross-attention does not pay enough attention to the object region. In order to enhance cross-attention for the object region, an enhanced cross-attention (ECA) module is proposed for global feature enhancement. By calculating the average attention score for each position in the fused feature sequence and assigning higher weights to the positions with higher attention scores, the proposed ECA module can improve the feature information in the object region and further enhance the matching accuracy. In addition, to reduce the computational complexity of self-attention, orthogonal random features are introduced to implement a fast attention operation. This decomposes the attention matrix into the product of a random non-linear function between the original query and key. This module can reduce the spatial complexity and improve the inference speed by avoiding the explicit construction of a quadratic attention matrix. Finally, a tracking method named GFETrack is proposed, which comprises a Siamese backbone network and an enhanced attention mechanism. Experimental results show that the proposed GFETrack achieves competitive results on four challenging datasets.

Keywords:

visual object tracking; transformer; global feature enhancement; fast self-attention; enhanced cross-attention

1. Introduction

Single-object tracking is pivotal in computer vision [1,2,3] and is tasked with estimating an object’s position and motion trajectory across frames. Overcoming challenges like size variations, occlusions, and deformations are persistent issues [4,5].

In the field of computer vision, advancements have been made in CNN-based tracking algorithms, such as the Siamese networks [6,7,8], particularly in addressing challenges like size variations, occlusions, and deformations. SiamFC [6] is a classic example that employs cross-correlation operations to compute a score map, determining the target’s bounding box by convolving the feature maps of the template and search regions. However, this approach, due to significant differences between the score map and the feature map, overlooks the semantic information of the target, hindering the subsequent classification and regression operations. To solve this problem, many transformer-based trackers [9,10,11,12] are used for feature fusion. Mixformer [10] proposes a mixed attention module (MAM), which applies the attention mechanism to simultaneously extract features and exchange information. OSTrack [11] integrates the target information more widely into the search area to better capture their correlation. In [12], the learning and memory capabilities of the transformer are utilized by encoding the information required for tracking into multiple tokens. Through cross-attention operations in the cross-feature augment (CFA) module, TransT [9] obtains fused features containing rich semantic information from templates and search regions. This fusion feature can be directly used for the subsequent classification and regression operations to estimate the state of the target. By introducing transformer operations [13], our proposed GFETrack replaces the cross-correlation operations adopted in TransT. Although the attention score map of TransT may exhibit some dispersion, as shown in Figure 1a, the proposed GFETrack exhibits a more concentrated pattern, as shown in Figure 1b. Considering the limitations of TransT’s CFA module in extracting target features and position information, the proposed GFETrack with global feature enhancement can focus more on the object itself and make it more suitable for the subsequent regression operations.

With the rise of Vision Transformer [14,15], self-attention techniques show promise in visual tasks. However, their quadratic computational complexity has prompted research into mitigating this challenge. Sparse global attention patterns [16,17] and smaller attention windows [18,19] can reduce costs, but at the risk of overlooking information or sacrificing long-term dependency modeling. In Fast re-OBJ [20], the authors efficiently leverage the intermediate outputs of the instance segmentation backbone (ISB) for triplet-based training, avoiding redundant feature extraction during both ISB and the embedding generation module (EGM). In contrast, our method employs a distinct strategy. Initially, we pass the Q and K matrices through a

φ

map to create new matrices

\tilde{Q}

and

\tilde{K}

, ensuring that the attention matrix A is equal to or approximately equal to the product of

\tilde{Q}

and

\tilde{K}

. This approach offers computational advantages by first calculating the product of

\tilde{K}

and V after decomposing matrix A, followed by multiplying the result by

\tilde{Q}

, reducing the spatial complexity of regular attention from

O (N^{2} + N d)

to

O (N r + N d + r d)

. The main contributions of this paper are as follows:

An enhanced cross-attention module is proposed to boost object region feature information in cross-attention. It is primarily achieved by computing the average attention weights and giving higher weights to the highly ranked positions to enhance matching accuracy.
Orthogonal random features are introduced to implement fast self-attention within self-attention. The attention matrix is decomposed into the product of random nonlinear functions of the original query and key, which can effectively reduce the model’s spatial complexity and enhance the inference speed.
On publicly available datasets, such as OTB, VOT, LaSOT, etc., the proposed algorithm has demonstrated improvements in tracking accuracy and success rates compared to the baseline algorithm. In comparison to some other state-of-the-art methods, the proposed algorithm has achieved comparable results.

2. Related Work

In recent years, convolutional networks have been widely used in computer vision tasks [21], including object detection, segmentation, and tracking. SiamFC [6] was the first to apply Siamese networks for visual object tracking, which predicts the object states by calculating the distance between the current frame and the template. SiamFC++ [8] is an improved version of SiamFC that introduces a new network architecture and training method, enabling object tracking in different scenarios. SiamMask [22] uses Siamese networks for both object tracking and segmentation simultaneously. SA-Sia [23] utilizes feature mapping to extract spatial information leading to more accurate and robust object tracking. SAT [7] employs a deep Siamese network to assess the similarity between objects in each frame, offering a novel solution for evaluating and addressing the association challenges in consecutive frames. However, the local matching strategy, which is based on cross-correlation, could lead to sub-optimal results, especially when the object is occluded or partially visible. Moreover, semantic information about the object may be lost during the correlation operation, resulting in imprecise object boundaries. Thus, an improved transformer and attention mechanism are proposed to replace traditional correlation operations in object tracking. It can effectively extract global contextual information while preserving an object’s semantic information, resulting in more robust and accurate tracking results.

The transformer is a popular neural network architecture that is widely used in natural language processing (NLP) tasks [13]. It consists of attention-based encoders and decoders. Self-attention, which is the main module of a transformer, can compute representations of input sequences. It allows each position in the input sequence to focus on the other positions and calculate the weighted averages of their values. DERT [24] and ViT [14] are early methods for introducing transformer models into the field of computer vision. In visual object tracking fields, transformer-based tracking methods achieve significant improvements compared to Siamese network-based trackers. The correlation operation in the Siamese network is replaced by a self-attention module from the transformer in TransT [9] to fuse the information between the template and search region. Stark [25] applies an encoder–decoder structure in tracking, where the encoder models the global spatiotemporal feature dependencies between the object and search region, while the decoder learns embedded queries to predict the spatial location of the object. SwinTrack [26] introduced a fully attention-based transformer algorithm for feature extraction and feature fusion, enabling the complete interaction between an object and a search region in the tracking process. The encoder–decoder framework based on transformers is widely applied to sequence prediction tasks, as self-attention models the interaction between elements in the sequence. Based on these works, we find that traditional transformer methods struggle to distinguish between the object to be tracked and similar interfering objects. To enhance the object information within the fused features, the ECA module is proposed to boost the attention on tracking the object. The ECA module can calculate the average attention level for each position in the fused feature sequence and weigh the positions with higher attention levels. This enhances the feature information in the object region and improves matching accuracy.

In recent years, a series of new attention mechanisms, such as fast attention [27], have garnered research interest. Sparse attention [28] reduces computational costs by constraining attention weights to consider only relationships within a neighborhood. Shaw et al. [29] proposed an attention mechanism to embed relative positional information into attention calculations, allowing the model to better handle dependencies in long sequences. This mechanism has shown excellent performance in tasks involving long sequences, such as text generation. In this work, orthogonal random features are proposed to achieve fast attention operations. It can decompose the attention matrix into the product of random nonlinear functions between the original query and key, avoiding the explicit construction of a quadratic-sized attention matrix.

3. The Proposed Method

A transformer-based tracker with global feature enhancement, named GFETrack, is proposed. The pipeline of GFETrack is shown in Figure 2.

GFETrack mainly consists of three components: the backbone network, the attention-enhanced fusion network, and the prediction head network. Similar to a Siamese network, the backbone network can extract features from both the template and the search region by sharing weights. Then, the attention-enhanced fusion network is proposed to enhance and fuse these features. Finally, the prediction head estimates the states of the objects and obtains the final results by binary classification and bounding box regression. Firstly, detailed information about each component of GFETrack is presented in this section. Then, two crucial modules, fast self-attention (FSA) and enhanced cross-attention (ECA), are proposed and presented in the attention-enhancement fusion network. Finally, some training details are described.

3.1. Overall Architecture

The overall architecture of the proposed GFETrack can be mainly divided into three parts: the feature extraction backbone network, the attention enhancement feature fusion network, and the prediction head network. The attention enhancement feature fusion network comprises the FSA module and the ECA module.

3.2. Feature Extraction Backbone Network

In alignment with the transformer method [6,8], we utilize features extracted from the template image and search region in initial frames with annotations as inputs to the feature extraction network, commonly referred to as the backbone network. In the initial frame, a template patch,

T \in R^{H_{z 0} \times W_{z 0} \times 3}

centered around the object’s coordinates and with dimensions twice the length of the object’s surroundings, captures both the object’s appearance and surrounding features. Simultaneously, the search region,

S \in R^{H_{x 0} \times W_{x 0} \times 3}

extends from the previous frame’s object center coordinates, with dimensions four times the length of the potential object movement range. These patches are reshaped into a square format and fed into the backbone network for feature extraction. The ultimate feature map is derived from the fourth stage of the ResNet50 network [21]. Notably, we removed the last stage of ResNet50 and adopted the outputs of the fourth stage as the final outputs. By changing the convolution stride of the downsampling unit in the fourth stage from 2 to 1, we aimed to achieve a larger feature resolution. Additionally, to increase the receptive field, we replaced the

3 \times 3

convolution in the fourth stage with a dilation convolution with a stride of 2, inspired by [9]. In general, the backbone network could extract the feature maps of the template,

F_{T} \in R^{H_{T} \times W_{T} \times C}

, and search region,

F_{S} \in R^{H_{S} \times W_{S} \times C}

. While,

H_{T} = \frac{H_{T 0}}{8}

,

W_{T} = \frac{W_{T 0}}{8}

,

H_{S} = \frac{H_{S 0}}{8}

,

W_{S} = \frac{W_{S 0}}{8}

, and

C = 1024

.

3.3. Attention Enhancement Feature Fusion Network

The attention enhancement feature fusion network is designed to fuse the

F_{t}

and

F_{S}

features. Firstly, the channel dimensions of

F_{t}

and

F_{S}

features are reduced through a

1 \times 1

convolution, resulting in two low-dimension feature maps:

F_{T 0} \in R^{H_{T} \times W_{T} \times d}

and

F_{S 0} \in R^{H_{S} \times W_{S} \times d}

, where d is set to be 256. Then,

F_{T 0}

and

F_{S 0}

are flattened along the spatial dimensions to obtain

F_{T 1} \in R^{H_{T} W_{T} \times d}

and

F_{S 1} \in R^{H_{S} W_{S} \times d}

, which can be viewed as sets of feature vectors with a length of d. As shown in Figure 2,

F_{T 1}

and

F_{S 1}

are taken as inputs for the template branch and search region branch in attention-enhancement fusion networks, respectively. Two FSA modules are used to enhance input features by making these features adaptively focus on useful information through multi-head fast self-attention. Two ECA modules receive these feature maps simultaneously and fuse these feature maps with multi-head ECA. The attention-enhanced fusion layer is formed by these two FSA and ECA modules, as shown by the dashed box in Figure 2. The attention-enhanced fusion layer is repeated N times, and then an additional ECA module is used to fuse these feature maps from the two branches, resulting in the fused feature map,

F \in R^{d \times H_{S} W_{S}}

. N is set to be 4 in this paper.

3.3.1. Attention Mechanism

Attention is the fundamental component in feature fusion networks. N represents the size of the input feature sequence. The attention mechanism can be viewed as a mapping function that takes

Q, K, V \in R^{N \times d}

as inputs, Q and K as two feature vector sets with dimensions

d_{k}

, and V as a feature vector set with dimensions

d_{v}

. The process can be interpreted as queries, keys, and values in a continuous dictionary. Thus, the attention scores between Q and K are obtained through scaled dot-product operations, while the attention map can be generated through softmax operations. The proposed GFETrack can reweight V based on the attention map. Thus, it can adaptively adjust the weights of features in V by leveraging the correlation between Q and K, allowing the algorithm to better focus on the features at useful positions within V. The traditional form of attention is as shown in Equation (1):

A t t n (Q, K, V) = s o f t m a x (A) V, A = e x p (\frac{Q K^{T}}{\sqrt{d_{k}}}),

(1)

where

A \in R^{N \times N}

is the attention matrix. Because the attention matrix A must be saved explicitly, the time complexity and space complexity of Equation (1) are

O (N^{2} d)

and

O (N^{2} + N d)

, respectively. When extending to the multi-head attention, the formula is as shown in Equation (2):

\begin{matrix} M u l t i H e a d (Q, K, V) = C o n c a t (H_{1}, \dots H_{n h}) W^{o}, \\ H_{i} = A t t n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}), \end{matrix}

(2)

where

W_{i}^{Q} \in R^{d_{m} \times d_{k}}

,

W_{i}^{K} \in R^{d_{m} \times d_{k}}

,

W_{i}^{V} \in R^{d_{m} \times d_{k}}

and

W^{0} \in R^{d_{m} \times d_{v} n_{h}}

are parameter matrices.

H_{i}

represents the attention matrix for different heads. In this paper,

n_{h} = 8

,

d_{m} = 256

,

d_{k} = d_{v} = \frac{d_{m}}{n_{h}} = 32

.

3.3.2. Fast Self-Attention Modules

The architecture of the proposed fast self-attention is shown in Figure 3. It applies a residual form of multi-head fast self-attention to adaptively adjust the weights from different positions in the feature map. As shown in Equation (1), spatial position encoding has to be introduced to distinguish the positional information of the input feature sequence. The sine function is applied to generate the spatial positional encoding. In the conventional approach, the attention matrix is multiplied with the value input to compute the final results. However, with the introduction of FSA, the matrix multiplication can be rearranged to approximate the outcomes of traditional attention mechanisms. Importantly, this approximation is achieved without the necessity of explicitly constructing a quadratic-size attention matrix. In essence, FSA optimizes the computational efficiency of attention mechanisms, offering a more streamlined and scalable solution.

Usually, the attention matrix

A (i, j) = K (q_{i}^{T}, k_{i}^{T})

, where

q_{i}

and

k_{i}

are the i th query and key vector, respectively.

K (*)

is the kernel function, which can be calculated by Equation (3):

K (x, y) = E [φ {(x)}^{T} φ (y)],

(3)

where

φ (u)

represents the random feature map of

u \in R^{d}

. For

\tilde{Q}, \tilde{K} \in R^{N \times R}

, the effective attention mechanism can be computed by Equation (4):

A t t {(Q, K, V)}_{F} = (\tilde{Q} ({\tilde{K}}^{T} V)),

(4)

where each row of

\tilde{Q}

and

\tilde{K}

are

φ {(q_{i}^{T})}^{T}

and

φ {(k_{i}^{T})}^{T}

, respectively.

A t t {(Q, K, V)}_{F}

represents the fast self-attention, which is the approximation of traditional self-attention. As shown in Figure 3, the space complexity and time complexity of the proposed fast self-attention are

O^{'} (N r + N d + r d)

and

O^{'} (N r d)

, which are lower than the space and time complexity of traditional self-attention (

O (N^{2} + N d)

and

N^{2} d

, respectively). N is the number of patches, and d represents the dimension of the patches. Unlike the traditional self-attention [19], r is the hyperparameter. In this work,

N = 64, d = 128, r = 8

. Thus, the spatial computational complexity is 79% of the traditional self-attention. The FSA module can be computed by Equation (5):

X_{F S A} = X + M u l t i H e a d (X + P_{x}, X + P_{x}, X),

(5)

where

P_{x} \in R^{N_{x} \times d}

is the spatial position encoding,

X_{F S A}

is the output of the multi-head fast self-attention,

X_{F S A} \in R^{N_{x} \times d}

.

3.3.3. Enhanced Cross-Attention Module

The enhanced cross-attention module is proposed to boost the feature information of the object area and improve the matching accuracy by reweighting the attention weights. As shown in Figure 4,

X_{F S A}

from fast self-attention are taken as the inputs of the ECA module. The inputs

X_{k v}

and

X_{q}

are fused by a multi-head cross-attention with a residual form. Similar to the self-attention module, spatial position encoding

P_{x}

is also used to enhance the cross-attention module. Additionally, a feed-forward network (FFN) is applied to enhance the fitting capability of the ECA module. The FFN can be computed by Equation (6):

F F N (x) = m a x (0, x W_{1} + b_{1}) W_{2} + b_{2},

(6)

where x is the input feature vector,

W_{1}

and

W_{2}

denote the weight matrix of the first and second layers, respectively. Thus,

b_{1}

and

b_{2}

are the corresponding bias vectors. The overall computation formulas for the enhance cross-attention modules are as follows:

X_{E C A} = {\tilde{X}}_{E C A} + F F N ({\tilde{X}}_{E C A}),

(7)

{\tilde{X}}_{E C A} = X_{q} + O u t p u t,

(8)

O u t p u t = \{\begin{matrix} s o f t m a x (\frac{(X_{q} + P_{q}) {(X_{k v} + P_{k v})}^{T}}{\sqrt{d_{k}}}) X_{k v} \times r a t e, \\ i f s o f t m a x (\frac{(X_{q} + P_{q}) {(X_{k v} + P_{k v})}^{T}}{\sqrt{d_{k}}}) X_{k v} > 0.7; \\ s o f t m a x (\frac{(X_{q} + P_{q}) {(X_{k v} + P_{k v})}^{T}}{\sqrt{d_{k}}}) X_{k v}, \\ i f s o f t m a x (\frac{(X_{q} + P_{q}) {(X_{k v} + P_{k v})}^{T}}{\sqrt{d_{k}}}) X_{k v} \leq 0.7; \end{matrix}

(9)

where

X_{q}

and

X_{k v}

are the two inputs of the ECA module,

P_{q}

and

P_{k v}

are the corresponding spatial position encoding. The attention scores of ECA can be calculated by the element-wise product of

X_{k v}

and

X_{q}

, as shown in Equation (9). When the scores in the attention map exceed 0.7, the scores are multiplied by a pre-defined coefficient to accomplish the enhancement of cross-attention. Firstly, the average attention level for each position in the search feature sequence is calculated, resulting in average attention weights. Then, the positions in the attention map are sorted based on their attention scores, and positions with high attention scores (exceeding 0.7) are reweighted by multiplying a rate to increase the weights of highly attentive positions by an additional 30%. The weights of other positions in the attention map are kept unchanged. Finally, the reweighted feature sequences are concatenated together to enhance the matching accuracy. Thus,

X_{k v}

is reweighted based on the reweighted attention scores, and the results are added to

X_{q}

to enhance the representation capability of the feature map.

X_{E C A} \in R^{N_{q} \times d}

is the output of the enhance cross-attention module.

3.4. Prediction Head Network

The prediction head network comprises both a classification branch and a regression branch. Each branch is constructed with a multi-layer perception (MLP) with three hidden layers and a ReLU activation function. For each fused feature, the prediction head generates predictions, computing foreground/background classification results and normalizing coordinates in relation to the size of the search region. Notably, the proposed GFETrack is capable of directly predicting the states of objects without the need for additional post-processing.

3.5. The Training Loss

The overall training loss of the proposed GFETrack can be divided into classification loss and regression prediction loss. The prediction head receives the feature vectors of

H_{s} W_{s}

and outputs binary classification and regression results. The feature vectors within the ground truth are viewed as positive samples, while the rest of the feature vectors are viewed as negative samples. All samples contribute to the classification loss, but only positive samples contribute to the regression loss. In order to mitigate the imbalance between positive and negative samples, the loss caused by negative samples has been reduced by a factor of 16. The standard binary cross-entropy loss is used for the classification, as shown in Equation (10):

L_{c l s} = - \sum_{i} [y_{i} l o g (p_{i}) + (1 - y_{i}) l o g (1 - p_{i})],

(10)

where

y_{i}

is the label of the ith sample,

y_{i} = 1

represents the foreground, while

y_{i} = 0

indicates the background.

p_{i}

is the probability predicted by the learning model that belongs to the foreground. For regression, as described in [9], the

L_{1}

loss and the GIoU loss [30] are utilized in this paper. Thus, the regression loss can be expressed as:

L_{r e g} = \sum_{i} [λ_{g} L_{G I o U} (b_{i}, \hat{b}) + λ_{1} L_{1} (b_{i}, \hat{b})],

(11)

where

b_{i}

is the ith prediction bounding box,

\hat{b}

is normalized ground truth bounding box,

L_{1}

and

L_{G I o U}

represent the

L_{1}

loss and the generalized IoU loss functions, respectively, and

λ_{1}

L_{1}

and

λ_{g} L_{G I o U}

are the hyperparameters determining the relative impact of the respective loss functions. We set it up in the experiment as

λ_{g} = 2

and

λ_{1} = 5

. Finally, the overall loss is defined as:

L = λ_{c} L_{c l s} + L_{r e g}

(12)

where

λ_{c} = 8.3

is the weight factor, and

L_{r e g}

and

L_{c l s}

are the losses of regression and classification, respectively.

4. The Experiments

The proposed GFETrack is implemented using Python 3.7 and PyTorch 1.7.1. The training procedure is implemented on two GeForce GTX 4090 GPUs. The entire training process requires approximately 180 h on two NVIDIA RTX 4090 GPUs, while the inference procedure of GFETrack is performed on an RTX 2070 GPU. Specifically, this section will describe the performance and implementation of GFETrack from four aspects, including the implementation details, some analysis of GFETrack, and quantitative and qualitative experiments on public datasets.

4.1. Implementation Details

In order to enhance the discrimination power of the fused feature and reduce the overfitting problem, the proposed model has been trained on the COCO2017 [31], TrackingNet [32], LaSOT [33], and GOT10K [34] datasets, respectively. We directly sampled images from video sequences to construct training samples. Image transformation skills are employed to generate image pairs for the COCO dataset. Additionally, data augmentation techniques such as rotating and color jittering are also applied to augment the training data. The size of the search region and template region are set at

256 \times 256

and

128 \times 128

, respectively. The pre-trained ResNet-50 is utilized to initialize the backbone network. The AdamW [35] optimizer is employed in the training process, the initial learning rate is set at

10^{- 5}

and is reduced by a factor of 10 after 600 epochs. Two NVIDIA RTX 4090 GPUs are used for training, with a batch size of 32 on each GPU. In general, we perform 1000 epochs, and each epoch consists of 1000 sample pairs.

4.2. Analysis of GFETrack

4.2.1. The Ablation Experiments

The impact of different components on the success rate of the OTB100 dataset is analyzed to gain insights into the proposed tracker’s performance. As presented in Table 1,

M 1

represents the proposed GFETrack, while

M 2

,

M 3

, and

M 4

indicate variations in the proposed tracker without the FSA module, ECA module, and positional encoding, respectively.

M 5

represents the baseline tracker.

Notably, when the ECA module is removed (comparing

M 1

and

M 3

), the success rate experiences a significant drop of 1.1%. This decrease underscores the critical role played by the ECA module in focusing on the most salient subset of the object, thereby contributing to the overall tracking accuracy. The ECA module facilitates enhanced feature discrimination, allowing the tracker to prioritize and attend to crucial object-related information. Additionally, comparing

M 1

and

M 2

reveals that the FSA module leads to a modest improvement in the success rate of 0.5%. Contrary to initial assumptions, this improvement is not driven by a direct enhancement of tracking accuracy. Instead, the primary purpose of the FSA module is to reduce the number of parameters and computational complexity. By achieving parameter efficiency, the FSA module indirectly enhances the tracking speed without compromising the success rate. This nuanced understanding emphasizes that certain modules, while not directly contributing to accuracy improvement, play vital roles in optimizing the overall performance and efficiency of the tracker.

4.2.2. The Speed and Number of Parameters

To analyze the number of parameters and running speed of the proposed GFETrack, we compared the tracking speed and the number of parameters of the proposed tracker with three similar algorithms, TransT [9], STARK-ST50 [25], and SiamRPN++ [36].

As illustrated in Table 2, the running speed of GFETrack is measured at 55 fps, surpassing the threshold of 25 fps. This indicates that our proposed method is well-suited for real-time object tracking. Additionally, GFETrack achieves a notable reduction of approximately 20% in the number of parameters compared to TransT. The reduction in parameters is attributed to our innovative FSA module, which employs orthogonal random features to implement fast self-attention within self-attention. This involves decomposing the attention matrix into the product of random non-linear functions of the original query and key. By doing so, the FSA module effectively reduces the model’s spatial complexity. This reduction in spatial complexity not only contributes to a more streamlined model but also enhances inference speed. The trade-off between speed and accuracy is a critical consideration in tracking scenarios. In the case of GFETrack, the FSA module achieves a delicate balance by reducing the number of parameters without compromising tracking accuracy. Orthogonal random features enable efficient self-attention computations, allowing the model to maintain a high level of accuracy while ensuring rapid inference. Generally, reducing the size and computational complexity of the model will result in an increase in fps, However, for accuracy, it may have the opposite effect. The effect in our article, such as Vot2018, is not very good, and it may also have this impact.

4.2.3. Analysis of the ECA Module

To analyze how the ECA module operates within GFETrack, cross-attention feature maps of different layers are visualized in Figure 1. From the four feature maps of different layers, we observe that the background interference is progressively suppressed, layer by layer, demonstrating the effectiveness of the ECA module in improving object recognition capability. As the number of layers deepens, the ECA module becomes more focused on the object itself. Compared to the TransT tracker, GFETrack achieves better discrimination between the object and background in the first two layers. This also demonstrates the effectiveness of our ECA module.

4.3. Quantitative Experiments on Public Datasets

Four widely used tracking benchmarks, VOT2018 [37], LaSOT [33], UAV123 [38], and OTB100 [39], are used to evaluate the performance of the proposed GFETrack.

VOT2018:

This dataset consists of 60 challenging video sequences. The accuracy (A), robustness (R), and expected average overlap (EAO) are used as the evaluation criteria for VOT2018. An excellent tracker typically has high A and EAO scores, but low R scores.

Table 3 reveals that GFETrack excels in accuracy with a score of 0.637, surpassing the baseline tracker (TransT). However, it is noteworthy that our proposed tracker demonstrates relatively average performance in terms of EAO and R scores. We attribute this observation primarily to a deliberate choice made during the feature extraction stage, wherein we removed the last layer of ResNet50. While this enhances accuracy by focusing on specific features, it inadvertently results in the loss of some local information from input images. This loss of local information likely contributes to the observed dip in both robustness (R) and expected average overlap (EAO) scores. The intentional removal of the last ResNet50 layer was aimed at optimizing the tracker for accuracy-focused scenarios. However, we acknowledge that this design choice comes at the cost of reduced robustness, as evidenced by the comparatively lower R and EAO scores. Future work may explore strategies to balance this trade-off, perhaps through the integration of additional contextual information or the use of alternative feature extraction techniques.

LaSOT:

LaSOT is a large-scale, long-term, visual object tracking dataset that includes over 10000 video sequences covering 142 different object categories. In Figure 5, the success and precision plots are presented. The proposed GFETrack outperformed other trackers, achieving the highest success rate of 67.5%, which is 2.6% higher than the baseline tracker, TransT. It also reached the highest precision rate of 73.2%, surpassing TransT by 4.2%.

To further analyze the proposed GFETrack, Figure 6 also presents the radar chart for various tracking challenges on the LaSOT dataset. As shown in Figure 6, compared to some state-of-the-art trackers, such as DiMP [42], DaSiaRPN [44], ATOM [41], TransT [9], SiamRPN++ [36], TrSiam [45], and PrDiMP [46], the proposed trackers demonstrate better performance in most of the tracking challenges, especially in viewpoint change, low resolution, and camera motion.

The detailed precision plots of different tracking challenges are shown in Figure 7. The proposed tracker performs well in various scenarios, such as viewpoint change, camera clutter, and low resolution. This indicates that the proposed method has significant potential in handling challenging scenarios. We believe the primary reason is that the proposed tracker focuses more on the object itself, leading to increased precision and robustness.

OTB100:

This dataset contains 100 sequences for visual object tracking with 11 tracking challenges. Figure 8 displays the results of the proposed tracker of precision and success rates. GFETrack achieved a promising performance on the OTB100 dataset. The success rate of GFETrack is 0.708, which is 1.9% higher than TransT, and the precision of GFETrack is 0.911, which is 2.5% higher than TransT.

In Figure 9, we compared the success plots of the proposed GFETrack with some state-of-the-art trackers under 11 tracking challenges. The results demonstrate that in most tracking challenges, the proposed GFETrack outperforms other trackers, such as TransT [9], SiamBAN [47], and SiamCAR [48]. Especially in the aspects of deformation and in-plane rotation, the proposed tracker achieved significant improvements when compared to the baseline TransT, which was higher by 3.4% and 2.6%, respectively. This also demonstrates that the proposed ECA module possesses strong discriminative capabilities, allowing us to focus more on the object itself.

UAV123:

UAV123 is one of the most widely used datasets in the fields of visual object tracking and unmanned aerial vehicles. The AUC and precision are the important evaluation metrics in the UAV123 dataset.

Table 4 shows the results of the proposed tracker on the UAV123 dataset. The AUC score of GFETrack is 67.3, which is 0.9 lower than TransT. We believe the reason is that the scale variations in objects in the UAV123 dataset are relatively small. This may lead to inaccuracies in object tracking since transformer-based models are less sensitive to objects that are small and have small-scale variations.

4.4. Qualitative Experiments on Public Datasets

To qualitatively compare the proposed tracker with other trackers, Figure 10 also shows the tracking results of six tracking algorithms, including TransT [9], MDNet [1], DaSiaRPN [52], SiamRPN [44], Staple [35], and Ocean [43], on five challenging sequences. The five sequences, from top to bottom, are Ironman, Shaking, Bird1, Singer2, and Soccer, respectively. These five sequences encompass nearly 11 different tracking challenges, and some conventional trackers show poor performance in these five sequences, making them prone to losing the object and resulting in tracking failures.

From Figure 10, we can see that, compared to the bounding boxes of other colors, the red bounding boxes and green bounding boxes have a higher degree of overlap. This also indicates that the proposed tracker, compared to other methods, can achieve better tracking results, and the ECA module can better direct the fused features to pay attention to the object’s region. Other trackers, such as TransT [9] and Ocean [43] in the last row, have indirectly demonstrated the robustness of our algorithm under different challenges.

Although the proposed method could handle most of the tracking challenges, we notice that the ECA modules may face some challenges and lead to a decrease in matching accuracy under some specific conditions, such as complex backgrounds or when the appearance of the object changes fast. As the proposed GFETrack is still in its theoretical stage, it may encounter hurdles in the real world. Thus, research on the better feature fusion and re-detection methods would be the next task in GFETrack.

5. Conclusions

GFETrack, which is an enhanced transformer-based tracking method aimed at augmenting the extraction and fusion capabilities of object features, is proposed in this paper. It leverages a transformer as a feature fusion network, which can aggregate global information from object features through self-attention and cross-attention mechanisms. To bolster the effectiveness of cross-attention, the ECA module is proposed, which plays a pivotal role in refining feature information within the object area by reweighting the highest attention–rank positions. It could enhance matching accuracy and contribute to overall robustness. To address the computational complexity inherent in self-attention, the FSA module is proposed, which achieves complexity reduction by introducing orthogonal random features. Thereby, it reduces spatial complexity from

O (N^{2} + N d)

to

O (N r + N d + r d)

, which not only ensures computational efficiency but also preserves high-quality representations of object features for further optimizing the overall tracking process. The experiments on some challenging datasets confirm the effectiveness and practicality of the proposed GFETrack.

Author Contributions

Conceptualization, S.W.; methodology, G.F.; software, G.F.; validation, J.W.; writing—original draft preparation, G.F. and J.W.; writing—review and editing, L.L. and S.N.M.; project administration, K.Z.; funding acquisition, S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Hebei Province, grant number F2022201013; the Scientific Research Program of Anhui Provincial Ministry of Education, grant numbers KJ2021A0528 and KJ2020A1202; the Startup Foundation for Advanced Talents of Hebei University, grant number 521100221003; and Anhui Shenhua Meat Products Co., Ltd. Cooperation Project, grant number 22100084.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. These datasets can be found here: VOT2018: https://www.votchallenge.net/index.html; LaSOT: https://pan.baidu.com/s/1k3xO6mJ-6a5UcIgM7lgKoQ#list/path=%2F, password: 9mtj; UAV123: https://cemse.kaust.edu.sa/ivul/uav123 and OTB100: https://pan.baidu.com/s/1TC6BF9erhDCENGYElfS3sw, password: 9x8q.

Conflicts of Interest

The authors declare that this study received funding from Anhui Shenhua Meat Products Co., Ltd. Cooperation Project. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

Abbreviations

The following abbreviations are used in this manuscript:

ECA	Enhanced cross-attention
FSA	Fast self-attention
NLP	Natural language processing
AUC	Area under curve

References

Nam, H.; Han, B. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4293–4302. [Google Scholar]
Bertinetto, L.; Valmadre, J.; Golodetz, S.; Miksik, O.; Torr, P.H. Staple: Complementary learners for real-time tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1401–1409. [Google Scholar]
Fan, H.; Ling, H. Siamese cascaded region proposal networks for real-time visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7952–7961. [Google Scholar]
Xing, J.; Ai, H.; Lao, S. Multiple human tracking based on multi-view upper-body detection and discriminative learning. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 1698–1701. [Google Scholar]
Liu, L.; Xing, J.; Ai, H.; Ruan, X. Hand posture recognition using finger geometric feature. In Proceedings of the 21st International Conference on Pattern Recognition, Tsukuba, Japan, 1–15 November 2012; pp. 565–568. [Google Scholar]
Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Vision Workshops, Amsterdam, The Netherlands, 7–10 October 2018; pp. 850–865. [Google Scholar]
Suljagic, H.; Bayraktar, E.; Celebi, N. Similarity based person re-identification for multi-object tracking using deep Siamese network. Neural Comput. Appl. 2022, 34, 18171–18182. [Google Scholar] [CrossRef]
Xu, Y.; Wang, Z.; Li, Z.; Yuan, Y.; Yu, G. Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12549–12556. [Google Scholar]
Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135. [Google Scholar]
Cui, Y.; Jiang, C.; Wang, L.; Wu, G. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13608–13618. [Google Scholar]
Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European Conference on Computer Vision, Tel-Aviv, Israel, 23–24 October 2022; pp. 341–357. [Google Scholar]
Di Nardo, E.; Ciaramella, A. Tracking vision transformer with class and regression tokens. Inf. Sci. 2023, 619, 276–287. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Vancouver, Canada, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Pu, Y.; Wang, Y.; Xia, Z.; Han, Y.; Wang, Y.; Gan, W.; Wang, Z.; Song, S.; Huang, G. Adaptive Rotated Convolution for Rotated Object Detection. arXiv 2023, arXiv:2303.07820. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar]
Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision transformer with deformable attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4794–4803. [Google Scholar]
Bayraktar, E.; Wang, Y.; DelBue, A. Fast re-OBJ: Real-time object re-identification in rigid scenes. Mach. Vis. Appl. 2022, 33, 97. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Wang, Q.; Zhang, L.; Bertinetto, L.; Hu, W.; Torr, P.H. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1328–1338. [Google Scholar]
He, A.; Luo, C.; Tian, X.; Zeng, W. A twofold siamese network for real-time object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4834–4843. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10448–10457. [Google Scholar]
Lin, L.; Fan, H.; Zhang, Z.; Xu, Y.; Ling, H. Swintrack: A simple and strong baseline for transformer tracking. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 16743–16754. [Google Scholar]
Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Weller, A. Rethinking attention with performers. arXiv 2020, arXiv:2009.14794. [Google Scholar]
Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating long sequences with sparse transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar]
Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-attention with relative position representations. arXiv 2018, arXiv:1803.02155. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Muller, M.; Bibi, A.; Giancola, S.; Alsubaihi, S.; Ghanem, B. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 300–317. [Google Scholar]
Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. LaSOT: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5369–5378. [Google Scholar]
Huang, L.; Zhao, X.; Huang, K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4282–4291. [Google Scholar]
Kristan, M.; Leonardis, A.; Matas, J.; Felsberg, M.; Pflugfelder, R.; Čehovin Zajc, L.; Vojír̃, T.; Bhat, G.; Lukežič, A.; Eldesokey, A.; et al. The sixth visual object tracking vot2018 challenge results. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for UAV tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 445–461. [Google Scholar]
Wu, Y.; Lim, J.; Yang, M.H. Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1834–1848. [Google Scholar] [CrossRef] [PubMed]
Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; Felsberg, M. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, HI, USA, 21–26 July 2017; pp. 6638–6646. [Google Scholar]
Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4660–4669. [Google Scholar]
Bhat, G.; Danelljan, M.; Gool, L.V.; Timofte, R. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6182–6191. [Google Scholar]
Zhang, Z.; Peng, H.; Fu, J.; Li, B.; Hu, W. Ocean: Object-aware anchor-free tracking. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 771–787. [Google Scholar]
Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 101–117. [Google Scholar]
Wang, N.; Zhou, W.; Wang, J.; Li, H. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1571–1580. [Google Scholar]
Danelljan, M.; Gool, L.V.; Timofte, R. Probabilistic regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7183–7192. [Google Scholar]
Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6668–6677. [Google Scholar]
Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6269–6277. [Google Scholar]
Mayer, C.; Danelljan, M.; Bhat, G.; Paul, M.; Paudel, D.P.; Yu, F.; Van Gool, L. Transforming model prediction for tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8731–8740. [Google Scholar]
Cui, Y.; Jiang, C.; Wang, L.; Wu, G. Target transformed regression for accurate tracking. arXiv 2021, arXiv:2104.00403. [Google Scholar]
Yu, Y.; Xiong, Y.; Huang, W.; Scott, M.R. Deformable siamese attention networks for visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6728–6737. [Google Scholar]
Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar]

Figure 1. The attention heat maps of different feature fusion layers. (a) The attention heat maps obtained by TransT [9]. (b) The attention heat maps obtained by the proposed GFETrack. From left to right, each row represents the heat maps from feature fusion layers 1 to 4, respectively. By comparing (a) and (b), we can find that the proposed algorithm could better focus on the objects.

Figure 2. The pipeline of proposed GFETrack tracker. It contains three basic components: the feature extractor, the attention enhancement fusion network, and the prediction head.

Figure 3. The FSA module, The FSA module is based on residual multi-head fast self-attention. Spacial location encoding is used to encode the position information. The FSA module can reduce the number of parameters in conventional self-attention and enhance contextual information. The dashed blocks represent the computational flow and the corresponding time complexity.

Figure 4. The ECA module. The inputs,

X_{q}

and

X_{k v}

, are the features obtained by FSA module from two different branches in attention enhancement fusion network.

Figure 4. The ECA module. The inputs,

X_{q}

and

X_{k v}

, are the features obtained by FSA module from two different branches in attention enhancement fusion network.

Figure 5. The success and precision plots of the proposed algorithm and some other state-of-the-art trackers on LaSOT dataset.

Figure 6. The AUC scores of different tracking algorithms under 13 tracking challenges. The AUC scores on each axis have been normalized.

Figure 7. The precision plots of different tracking algorithms under different tracking challenges on LaSOT dataset.

Figure 8. The success and precision plots of different tracking algorithms on OTB100 dataset.

Figure 9. The success plots of different tracking algorithms under different tracking challenges on OTB100 dataset.

Figure 10. The qualitative experiments of six tracking algorithms on five challenging sequences. The five sequences, from top to bottom, are Ironman, Shaking, Bird1, Singer2, and Soccer, respectively.

Table 1. The ablation experiments on OTB100 dataset. “Yes” indicates the corresponding proposed module is used, while “No” indicates the corresponding proposed module is not used.

Methods	FSA	ECA	Pos	Success
M1	Yes	Yes	Yes	70.8
M2	No	Yes	Yes	70.3
M3	Yes	No	Yes	69.7
M4	Yes	Yes	No	69.5
M5	No	No	No	69.3

Table 2. The comparison experiments of tracking speed and the number of parameters.

Trackers	Speed (fps)	Params (M)
GFETrack	55	20.1
TransT	54	23.0
STARK-ST50	42.2	23.3
SiamRPN++	35.0	54.0

Table 3. The comparison experiments of different trackers on VOT2018 dataset. The red, green, and blue fonts indicate the top 3 trackers.

	ECO [40]	ATOM [41]	SiamRPN++ [36]	DiMP [42]	Ocean-Online [43]	TransT [9]	GFETrack
EAO ↑	0.280	0.401	0.414	0.440	0.489	0.266	0.362
A ↑	0.484	0.590	0.600	0.597	0.592	0.587	0.637
R ↓	0.276	0.204	0.234	0.153	0.117	0.384	0.281

Table 4. The comparison experiments of different trackers on UAV123 dataset. The top 2 results are marked in red and blue, respectively.

Method	AUC(%)	P(%)
GFETrack	67.3	87.0
TransT [9]	68.1	87.6
ToMP-101 [49]	66.9	-
TREG [50]	66.9	88.4
TrDiMP [45]	67.5	-
SiamAttn [51]	65.0	84.5
SiamCAR [48]	61.4	76.0
DiMP [42]	65.4	-
ATOM [41]	64.3	-
SiamFC [6]	48.5	69.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, S.; Fang, G.; Liu, L.; Wang, J.; Zhu, K.; Melo, S.N. Transformer-Based Visual Object Tracking with Global Feature Enhancement. Appl. Sci. 2023, 13, 12712. https://doi.org/10.3390/app132312712

AMA Style

Wang S, Fang G, Liu L, Wang J, Zhu K, Melo SN. Transformer-Based Visual Object Tracking with Global Feature Enhancement. Applied Sciences. 2023; 13(23):12712. https://doi.org/10.3390/app132312712

Chicago/Turabian Style

Wang, Shuai, Genwen Fang, Lei Liu, Jun Wang, Kongfen Zhu, and Silas N. Melo. 2023. "Transformer-Based Visual Object Tracking with Global Feature Enhancement" Applied Sciences 13, no. 23: 12712. https://doi.org/10.3390/app132312712

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Transformer-Based Visual Object Tracking with Global Feature Enhancement

Abstract

1. Introduction

2. Related Work

3. The Proposed Method

3.1. Overall Architecture

3.2. Feature Extraction Backbone Network

3.3. Attention Enhancement Feature Fusion Network

3.3.1. Attention Mechanism

3.3.2. Fast Self-Attention Modules

3.3.3. Enhanced Cross-Attention Module

3.4. Prediction Head Network

3.5. The Training Loss

4. The Experiments

4.1. Implementation Details

4.2. Analysis of GFETrack

4.2.1. The Ablation Experiments

4.2.2. The Speed and Number of Parameters

4.2.3. Analysis of the ECA Module

4.3. Quantitative Experiments on Public Datasets

4.4. Qualitative Experiments on Public Datasets

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI