Slight Aware Enhancement Transformer and Multiple Matching Network for Real-Time UAV Tracking

: Based on the versatility and effectiveness of the siamese neural network, the technology of unmanned aerial vehicle visual object tracking has found widespread application in various ﬁelds including military reconnaissance, intelligent transportation, and visual positioning. However, due to complex factors, such as occlusions, viewpoint changes, and interference from similar objects during UAV tracking, most existing siamese neural network trackers struggle to combine superior performance with efﬁciency. To tackle this challenge, this paper proposes a novel SiamSTM tracker that is based on Slight Aware Enhancement Transformer and Multiple matching networks for real-time UAV tracking. The SiamSTM leverages lightweight transformers to encode robust target appearance features while using the Multiple matching networks to fully perceive response map information and enhance the tracker’s ability to distinguish between the target and background. The results are impressive: evaluation results based on three UAV tracking benchmarks showed superior speed and precision. Moreover, SiamSTM achieves over 35 FPS on NVIDIA Jetson AGX Xavier, which satisﬁes the real-time requirements in engineering.


Introduction
Visual object tracking is a fundamental and challenging task in the field of computer vision, aiming to continuously and steadily track the position and size of a given target based on its initial state in video frames. Due to the superior maneuverability of unmanned aerial vehicles (UAVs), UAV tracking technology has been widely applied in various fields such as aerial photography [1] and remote sensing mapping [2]. However, UAV tracking still faces several difficulties: (1) the targets are often small in size and frequently present with partial or full occlusion, rendering it challenging for the algorithm to acquire target information; (2) given that the tracking duration is prolonged and the platform experiences continuous motion, targets may undergo alterations in appearance and movement blurring, necessitating a strong degree of algorithmic robustness; (3) the limited computational capacity of UAVs requires algorithms to be lightweight to ensure real-time tracking [3]. Therefore, designing an efficient and robust tracker for unmanned aerial vehicles remains an extremely challenging task.
Current advanced tracking methods can be divided into two directions: correlation filters [4] and siamese networks [5]. Due to the limited computing power of unmanned aerial platforms, correlation filter-based methods are often employed for target tracking in practical engineering applications [6]. However, the poor performance in terms of feature representation and insufficient optimization strategies of correlation filter methods make it difficult for them to effectively cope with complex scenes that may arise during the tracking process. In contrast, trackers based on siamese neural networks have made significant progress in recent years in terms of precision and robustness, while also continuously improving in efficiency. The deployment of lightweight siamese neural network-based trackers in real time on UAV platforms has, thus, become an area for current research activity [7]. Unfortunately, the target feature information extracted by these algorithms is insufficient to deal with complex scenarios such as changes in target appearance. Additionally, the single cross-correlation feature matching method used is unable to fully perceive the foreground and background information of the response map.
In light of these shortcomings, this article proposes a novel Slight Aware Enhancement Transformer and Multiple matching network for real-time UAV tracking. It consists mainly of a Slight Aware Enhancement Transformer (Slight-ViT) feature extraction network and a Multiple matching network, as illustrated in Figure 1. Building on the baseline, we introduce a lightweight ViT feature extraction network that effectively enhances feature salience. Furthermore, through the Multiple matching network, we strengthen the ability to discriminate between targets and the background. Testing, as depicted in Figure 1, has demonstrated the precision and robustness of our algorithm. significant progress in recent years in terms of precision and robustness, while also continuously improving in efficiency. The deployment of lightweight siamese neural network-based trackers in real time on UAV platforms has, thus, become an area for current research activity [7]. Unfortunately, the target feature information extracted by these algorithms is insufficient to deal with complex scenarios such as changes in target appearance. Additionally, the single cross-correlation feature matching method used is unable to fully perceive the foreground and background information of the response map.
In light of these shortcomings, this article proposes a novel Slight Aware Enhancement Transformer and Multiple matching network for real-time UAV tracking. It consists mainly of a Slight Aware Enhancement Transformer (Slight-ViT) feature extraction network and a Multiple matching network, as illustrated in Figure 1. Building on the baseline, we introduce a lightweight ViT feature extraction network that effectively enhances feature salience. Furthermore, through the Multiple matching network, we strengthen the ability to discriminate between targets and the background. Testing, as depicted in Figure  1, has demonstrated the precision and robustness of our algorithm. Comparison between the baseline tracker and the proposed tracker. The figures, from top to bo om, are: workflow, tracking result, and center location error comparison. When facing the constantly occluded object, our method can emphasize and aggregate the effective information selectively, avoiding interference brought by the environment. In Tracking results, the blue box represents the true value, the yellow box represents the baseline, and the red box represents ours.
The main contributions of this work can be summarized as follows:


We propose a novel feature extraction network that adaptively aggregates the local information extracted by convolution and the global information extracted by the transformer, encoding both local and global information to enhance the algorithm's performance in terms of feature expression. When facing the constantly occluded object, our method can emphasize and aggregate the effective information selectively, avoiding interference brought by the environment. In Tracking results, the blue box represents the true value, the yellow box represents the baseline, and the red box represents ours.
The main contributions of this work can be summarized as follows: • We propose a novel feature extraction network that adaptively aggregates the local information extracted by convolution and the global information extracted by the transformer, encoding both local and global information to enhance the algorithm's performance in terms of feature expression. The rest of this article is organized as follows. Section 2 introduces the related work on visual object tracking and the real-time transformer tracking. Section 3 describes the network architecture of SiamSTM and its detailed design. Section 4 presents the experimental results on datasets and ablation study. Conclusions are drawn in Section 5.

Related Work
Previously, correlation filtering methods, such as KCF, promoted the development of UAV tracking [8], which expanded the sampling capacity using cyclic matrices while avoiding matrix pseudo-operation. In subsequent research, ARCF-HC [9] used the variation of continuous response maps to learn and suppress environmental noise. Autotracker [10] was developed to use local response graphs and global response graphs to adaptively adjust spatial and temporal weights through automatic spatiotemporal regularization. These methods, to some extent, have promoted the development of drone ground target tracking. However, correlation filtering is limited by its capacity for feature extraction and online updating strategies; its poor robustness and precision cannot meet the requirements of current practical engineering. Recently, the strong feature extraction ability of siamese neural networks has greatly improved the precision of object-tracking tasks, and siamese tracking algorithms have attracted wide attention in the field of object tracking.
Siamese neural networks consist of two independent feature extraction networks with shared weights. The template and search area are individually input into one of the two branches, extracting corresponding features. By matching the response maps, the search region with the highest similarity to the target template is identified to complete the tracking task. As a pioneering work, SINT [5] has utilized siamese neural networks to directly learn matching functions and solve tracking problems through similarity learning. SiamFC [11] introduces translational invariance, removes image size limitations, and uses five preset scale anchor boxes to regress the target box. However, the anchor boxes contain a large amount of unnecessary background information, resulting in low tracking precision. SiamRPN [12] has transformed the tracking task from a matching problem to a foregroundbackground classification problem by utilizing region proposal networks. By regressing on k anchor boxes of different sizes and aspect ratios at the same location, the precision and robustness of the task have been improved to some extent. Subsequent research has aimed to improve SiamRPN other properties, such as feature extraction, positivenegative sample ratio, and feature enhancement [13][14][15][16]. However, the anchor mechanism, which introduces many hyperparameters, has difficulty handling object deformation and pose changes. SiamGAT [17] uses graph neural networks for target matching, enhancing target features, and suppressing background information, thus, improving the tracking performance of the algorithm.
Transformer [18] architecture can overcome the limitation of receptive fields and can concentrate on global information. Inspired by the transformer, many methods have emerged in recent years to improve drone tracking performance [19]. TR-Siam [20] effectively enhances target tracking robustness through a transformer feature pyramid and response map enhancement method. However, this algorithm has a large computational cost and difficulty in meeting real-time requirements. HIFT [21] adopts a transformer Remote Sens. 2023, 15, 2857 4 of 18 feature fusion method that directly integrates multi-scale feature information, but it does not provide any lightweight improvement to the transformer.
In recent years, improved lightweight algorithms have emerged specifically for the field of drone tracking: SiamAPN++ [22] employs an attentional aggregation network to enhance the robustness of processing complex scenarios; SGDViT [23] enhances spatial features by utilizing time-context data from consecutive frames; TCTrack [24] introduces temporal context information and decodes temporal knowledge to accurately adjust similarity maps; and E.T.Track [25] enables the algorithm to obtain more accurate information about the target through executive attention. While these algorithms remain constrained by the AlexNet [26] feature extraction network, there is still scope to refine their tracking precision and robustness. A baseline algorithm using MobileNetV2 [27] and an anchor box framework to regress target boxes provides a viable solution that effectively ensures robustness in addressing changes in a target's pose, while maintaining real-time speeds. However, the feature extraction abilities of this algorithm are suboptimal, and the targetmatching process is relatively simplistic, indicating further opportunities for enhancement in both precision and robustness.

Proposed Method
In this section, our primary focus will revolve around elaborating the details of the Slight-ViT module and the multi-matching network, as well as exploring ways to integrate them seamlessly into the siamese network. The SiamSTM proposed in this paper is comprised of four principal components, which are illustrated in Figure 2: input, feature extraction, target localization, and output. cost and difficulty in meeting real-time requirements. HIFT [21] adopts a transformer feature fusion method that directly integrates multi-scale feature information, but it does not provide any lightweight improvement to the transformer. In recent years, improved lightweight algorithms have emerged specifically for the field of drone tracking: SiamAPN++ [22] employs an a entional aggregation network to enhance the robustness of processing complex scenarios; SGDViT [23] enhances spatial features by utilizing time-context data from consecutive frames; TCTrack [24] introduces temporal context information and decodes temporal knowledge to accurately adjust similarity maps; and E.T.Track [25] enables the algorithm to obtain more accurate information about the target through executive a ention. While these algorithms remain constrained by the AlexNet [26] feature extraction network, there is still scope to refine their tracking precision and robustness. A baseline algorithm using MobileNetV2 [27] and an anchor box framework to regress target boxes provides a viable solution that effectively ensures robustness in addressing changes in a target's pose, while maintaining real-time speeds. However, the feature extraction abilities of this algorithm are suboptimal, and the targetmatching process is relatively simplistic, indicating further opportunities for enhancement in both precision and robustness.

Proposed Method
In this section, our primary focus will revolve around elaborating the details of the Slight-ViT module and the multi-matching network, as well as exploring ways to integrate them seamlessly into the siamese network. The SiamSTM proposed in this paper is comprised of four principal components, which are illustrated in Figure 2: input, feature extraction, target localization, and output.

Overall Overview
Our proposed method utilizes Slight-ViT as the backbone for the feature extraction network, with an overall stride of 16. After obtaining the feature maps for the target and search regions, they are then fed into the multi-matching network. This network utilizes the target template as a kernel to match the feature map of the search region, thereby obtaining both category response maps and position response maps. This process is instrumental in improving the tracker's ability to distinguish between the target and background. Finally, the response maps are fed into the prediction head to yield the final tracking results.

Overall Overview
Our proposed method utilizes Slight-ViT as the backbone for the feature extraction network, with an overall stride of 16. After obtaining the feature maps for the target and search regions, they are then fed into the multi-matching network. This network utilizes the target template as a kernel to match the feature map of the search region, thereby obtaining both category response maps and position response maps. This process is instrumental in improving the tracker's ability to distinguish between the target and background. Finally, the response maps are fed into the prediction head to yield the final tracking results.

Slight Aware Enhancement Transformer
It is widely acknowledged that the feature extraction network is critical to visual object tracking [13]. In terms of feature extraction algorithms, there are generally two directions: Remote Sens. 2023, 15, 2857 5 of 18 one relies on convolution and pooling to obtain deep feature maps which represent the position and semantic information of the target [5]; the other employs a large-scale vision transformer to extract global target features and positional information [28]. As convolution neural networks generate feature information with spatial locality due to the use of locally shared convolution kernels, their ability to model long-term dependencies effectively is restricted, particularly in complex scenarios where target features are limited due to factors such as occlusion or loss of sight. Moreover, transformer structures tend to focus on global features but lack local inductive bias, which hinders their ability to model local details and multi-scale information effectively [29]. Finally, the extensive computational and parameter requirements of transformers make it difficult to run them in real time.
Building upon the limitations of existing methods, this paper proposes a lightweight transformer structure that leverages both convolution neural networks and transformers. This design incorporates the global modeling capabilities of transformers while retaining the spatial locality of convolution information, thereby improving both local detail representation and global information perception. The Slight-ViT block, as shown in Figure 3, employs the Unfold-Transformer-Folding structure for global feature modeling and uses the Fusion module to merge transformer outputs with the original input feature map. The Feed Forward Network in the figure is three fully connected layers with identical input and output channels, which are used to integrate feature sequence information. Normalization is achieved through Layer Normalization [30], which standardizes individual samples and stabilizes the distribution of feature maps, thereby avoiding problems such as gradient vanishing that can lead to model degradation. The network uses one multi-head attention layer to compute input features, with the specific formulas as follows:  Multi-Head A ention is designed to handle serialized data formats, which requires transforming feature maps into sequence information. However, in traditional transformer structures, images are directly fed into the a ention mechanism after being flattened, resulting in excessive computations and parameters between tokens. This approach fails to meet real-time requirements for drone tracking. In light of this, this paper proposes In these formulas, which utilize query (Q), key (K), and value (V) to enhance feature representation , d k represents the number of channels, N represents the number of the attention head, and W Q i , W K i , W V i , and W represent the weight coefficient, which is obtained through network training. This module calculates the similarity between Q and K, multiplies V by the normalized weight, and realizes the feature enhancement of V. Additionally, removing √ d is to ensure the stability of training. Multi-Head Attention is designed to handle serialized data formats, which requires transforming feature maps into sequence information. However, in traditional transformer structures, images are directly fed into the attention mechanism after being flattened, resulting in excessive computations and parameters between tokens. This approach fails to meet real-time requirements for drone tracking. In light of this, this paper proposes a special Unfold and Folding method that divides the image into four categories of tokens, which accurately distinguishes each token from those around it. Each token only undergoes attention calculation with tokens of the same category, significantly reducing computation and parameter requirements while meeting the real-time demands of drone tracking, as illustrated in Figure 4. Multi-Head A ention is designed to handle serialized data formats, which requires transforming feature maps into sequence information. However, in traditional transformer structures, images are directly fed into the a ention mechanism after being flattened, resulting in excessive computations and parameters between tokens. This approach fails to meet real-time requirements for drone tracking. In light of this, this paper proposes a special Unfold and Folding method that divides the image into four categories of tokens, which accurately distinguishes each token from those around it. Each token only undergoes a ention calculation with tokens of the same category, significantly reducing computation and parameter requirements while meeting the real-time demands of drone tracking, as illustrated in Figure 4. The feature extraction network used in this paper has a lower down-sampling rate, resulting in minor differences between adjacent pixel information. Additionally, in Slight-ViT, the convolution network associates image features locally with their surrounding The feature extraction network used in this paper has a lower down-sampling rate, resulting in minor differences between adjacent pixel information. Additionally, in Slight-ViT, the convolution network associates image features locally with their surrounding pixels. Using traditional transformer methods would result in increased computational costs far exceeding their benefits in terms of feature salience.
Our proposed Slight-ViT method proposed in this paper successfully incorporates the transformer's exceptional global modeling capabilities while only increasing the computational requirements by 1/4 and eliminating feature map data redundancies. Following attention operations, the Folding method recovers the four classes of tokens as image feature maps and sends them to the Fusion module. This module adaptively fuses results from both the improved transformer and convolution network by channel concatenation and down-sampling, enhancing the network's ability to capture both local and global information. By inserting the designed Slight-ViT block into layers [layer3, layer4, layer5] of MobilenetV2, the Slight-ViT feature extraction network is obtained. In the Figure 5, heat maps [31] have attested to the effectiveness and robustness of this approach in providing accurate object tracking and generating feature maps.

Multiple Matching Network
The visual tracking task of a siamese network tracker is represented as a similarity matching problem [32]. Almost all popular siamese trackers achieve similarity learning through cross-correlation between the target branch and the search branch, sending the response map to the classification and regression networks. However, the two networks focus on different tasks, with the classification network emphasizing the similarity between the extracted features and target category, while the regression network is more concerned with the target's position and scale for bounding box parameter adjustment. Therefore, considering the differences in content focus in classification and regression, it is necessary to use different branches for computation. In this paper, the Multiple matching network proposes a carefully designed target-matching method that fully perceives the response map information of the target. Its structure is shown in Figure 6.
Our proposed Slight-ViT method proposed in this paper successfully incorporates the transformer's exceptional global modeling capabilities while only increasing the computational requirements by 1/4 and eliminating feature map data redundancies. Following a ention operations, the Folding method recovers the four classes of tokens as image feature maps and sends them to the Fusion module. This module adaptively fuses results from both the improved transformer and convolution network by channel concatenation and down-sampling, enhancing the network's ability to capture both local and global information. By inserting the designed Slight-ViT block into layers [layer3, layer4, layer5] of MobilenetV2, the Slight-ViT feature extraction network is obtained. In the Figure 5, heat maps [31] have a ested to the effectiveness and robustness of this approach in providing accurate object tracking and generating feature maps. Figure 5. Comparison between similarity maps before Slight-ViT (second column) and after Slight-ViT (third column). It can be seen that after using Slight-ViT, the algorithm focuses more on the target itself and is less affected by various complex scene.

Multiple Matching Network
The visual tracking task of a siamese network tracker is represented as a similarity matching problem [32]. Almost all popular siamese trackers achieve similarity learning through cross-correlation between the target branch and the search branch, sending the response map to the classification and regression networks. However, the two networks focus on different tasks, with the classification network emphasizing the similarity between the extracted features and target category, while the regression network is more concerned with the target's position and scale for bounding box parameter adjustment. Therefore, considering the differences in content focus in classification and regression, it is necessary to use different branches for computation. In this paper, the Multiple matching network proposes a carefully designed target-matching method that fully perceives the response map information of the target. Its structure is shown in Figure 6. Most prevalent algorithms tend to extract multi-scale features using methods such as feature pyramids to a ain higher precision while significantly escalating the overall computational cost [20,21]. To surmount this obstacle, we clip the central region of the target template as a focal point for the target. The crux of the tracking task is to differentiate between the foreground and background in the search region. We opine that incorporating the focus information of the target can enhance the network's ability to discriminate foreground information and achieve equilibrium between the precision and efficiency.
In this network, we perform feature matching on Template, Temp_Center, and the search region separately. As the Template branch contains richer feature information, we adopt PWCorr [33] to preserve the target's boundary and scale information more effectively in the response map. To generate diverse response maps while ensuring that the network is lightweight, we use DWCorr [13] to calculate cross-correlation results channel Most prevalent algorithms tend to extract multi-scale features using methods such as feature pyramids to attain higher precision while significantly escalating the overall computational cost [20,21]. To surmount this obstacle, we clip the central region of the target template as a focal point for the target. The crux of the tracking task is to differentiate between the foreground and background in the search region. We opine that incorporating the focus information of the target can enhance the network's ability to discriminate foreground information and achieve equilibrium between the precision and efficiency.
In this network, we perform feature matching on Template, Temp_Center, and the search region separately. As the Template branch contains richer feature information, we adopt PWCorr [33] to preserve the target's boundary and scale information more effectively in the response map. To generate diverse response maps while ensuring that the network is lightweight, we use DWCorr [13] to calculate cross-correlation results channel by the channel in the Temp_Center branch, which highlights the focus information of the target and achieves efficient information association under the precondition of being lightweight. We cascade the Template branch with the Temp_Center branch at the channel level and perform channel-wise down-sampling to adaptively fuse response map information, enhancing the network's reasoning ability for the foreground and background.
Inspired by the Decoupled head idea [34], we employ two response maps, each map dedicated to classification tasks or regression tasks, to minimize the information exchange between the two tasks as much as possible, making the learning of classification and regression networks more independent. This effectively enhances the network's perception ability for categories and discrimination ability for position and scale.

Results
In this section, we standardized the evaluation of the performance of the SiamSTM tracker by selecting three video benchmarks: UAV123 [35], UAV20L, and UAVDT [36], to conduct precision and robustness analyses. The UAV123 benchmark consists of 123 aerial videos, while the longer video sequences in the UAV20L benchmark are better suited to realworld drone scenarios. The UAVDT benchmark is more focused on complex environments. These datasets all use precision and success rate as evaluation indicators. Precision refers to the percentage of frames with a center location error (CLE) smaller than a specific threshold in the total number of frames. The threshold is usually set to 20 pixels, and the formulas used are as follows: f N In the formula, (x gt ,y gt ) and (x pr ,y pr ) represent the center coordinates of the ground truth box and predicted box, respectively. The center location error (CLE) refers to the Euclidean distance between the two center locations. The success rate of the tracking algorithm refers to the percentage of frames where the overlap score (S) is greater than a specific threshold in the total number of frames. The threshold is usually set to 0.5. The overlap score (S) refers to the intersection over union (IoU) ratio of the predicted box and the ground truth box. The formulas involved are as follows:

Implementation Details
The algorithm in this article uses Python 3.7 and PyTorch 1.10 as the training environment, and the experimental equipment is equipped with an Intel i7-12700 CPU, NVIDIA RTX 3090 GPU, and 32 GB RAM. The GOT-10K [37], COCO [38], and VID datasets were Remote Sens. 2023, 15, 2857 9 of 18 used as the training sets to train the network. During the training process, stochastic batch gradient descent was used as the optimization strategy with a batch size of 64. Weight decay was utilized to control the learning rate. Warm-up training was conducted for the first five iterations, during which the initial learning rate was set to 0.001 and increased by 0.001 for each iteration. After the warm-up phase, a learning rate gradient descent was used to continue training the network. The entire training process included 30 iterations and lasted for a total of 6 h.

Experiments on the UAV123 Benchmark
The UAV123 dataset is currently the largest and most comprehensive drone-to-ground tracking dataset. The videos captured by drones in UAV123 have diverse scenes and targets, and the targets have various actions in the scene. The UAV123 benchmark includes twelve challenging attributes including Viewpoint Change (VC), Partial Occlusion (PO), Full Occlusion (FO), and Fast Motion (FM). In this article, our algorithm was compared with state-of-the-art algorithms, including SiamRPN++, SiamRPN, SiamAPN++, SiamAPN, SiamDW [39], as well as SiamFC and Staple [40] provided by the toolboxes.
Overall Evaluation: The success rate and precision curves are depicted in Figure 7. Our SiamSTM algorithm achieves a success rate of 0.618 and precision rate of 0.809, demonstrating noticeable improvements in comparison with the benchmark algorithm in terms of both the success rate and precision rate. It is noteworthy that our algorithm outperforms the tracking algorithm SiamRPN++, which employs a heavyweight ResNet-50 network for feature extraction, despite adopting a lightweight feature extraction network. A ribute-Based Evaluation: We have selected challenging a ributes that are frequently encountered and difficult to overcome in unmanned aerial vehicle tracking, including similar object (SO), viewpoint change (VC), scale variation (SV), partial occlusion (PO), full occlusion (FO), and fast motion (FM). As shown in Figures 8 and 9, our tracker ranks first on all of these challenging a ributes and exhibits significant improvements compared to the benchmark algorithm. In summary, our algorithm can significantly enhance the accuracy and robustness of UAV tracking. Attribute-Based Evaluation: We have selected challenging attributes that are frequently encountered and difficult to overcome in unmanned aerial vehicle tracking, including similar object (SO), viewpoint change (VC), scale variation (SV), partial occlusion (PO), full occlusion (FO), and fast motion (FM). As shown in Figures 8 and 9, our tracker ranks first on all of these challenging attributes and exhibits significant improvements compared to the benchmark algorithm. In summary, our algorithm can significantly enhance the accuracy and robustness of UAV tracking.
A ribute-Based Evaluation: We have selected challenging a ributes that are frequently encountered and difficult to overcome in unmanned aerial vehicle tracking, including similar object (SO), viewpoint change (VC), scale variation (SV), partial occlusion (PO), full occlusion (FO), and fast motion (FM). As shown in Figures 8 and 9, our tracker ranks first on all of these challenging a ributes and exhibits significant improvements compared to the benchmark algorithm. In summary, our algorithm can significantly enhance the accuracy and robustness of UAV tracking.   Compared to baseline, it can be seen that SiamSTM has significantly improved its success rate and precision in scenarios with significant changes in the appearance or background interference of targets such as similar object, partial occlusion, full occlusion, and scale variation. This demonstrates that the Slight-ViT and Multiple matching networks can enhance feature saliency, effectively perceive target information, and remove background noise. Furthermore, our algorithm offers great improvements in precision as illustrated by precision plots. This indicates that our algorithm can more precisely eliminate the background and locate the center of the target. This is consistent with the experimental conclusion of the heat map reported previously [31].
Speed Evaluation: Our method aims to enhance the robustness of tracking algorithms to cope with complex scenarios of UAV tracking. Therefore, to comprehensively demonstrate the efficiency and performance comparison between our algorithm and other state-of-theart (SOTA) algorithms in UAV tracking, we conducted further comparisons against the UAV123 benchmark. As shown in the Table 1, SiamSTM exhibits improved performance compared to other SOTA trackers and achieves the best computational speed on both GPU and AGX_Xavier platforms.

Experiments on the UAV20L Benchmark
The UAV20L benchmark is comprised of 20 long sequences, with an average of nearly 3000 frames per sequence, As the frame spacing increases, the position changes of objects between frames become increasingly large and irregular, making target tracking more challenging. Therefore, the performance of long-term tracking can more intuitively reflect the performance of tracking algorithms in actual unmanned aerial vehicle ground tracking scenarios. The results of the UAV20L benchmark test are usually used to demonstrate the long-term tracking ability of the tracker. We conducted comparative experiments between our algorithm, the benchmark algorithm, and the following SOTA algorithms with publicly available results: SiamRPN++, SiamFC++ [42], SiamAPN++, SiamAPN, SESiamFC, and DaSiamRPN [43].
Overall Evaluation: The success rate and precision rate curves are depicted in the Figure 10. The success rate of our algorithm, SiamSTM, is 0.580, and the precision rate is 0.742. Compared to the baseline algorithm, SiamSTM exhibits a significant improvement in both the success rate (4%) and precision rate (3.1%). It is worth noting that our algorithm achieves higher precision compared to SiamRPN++ and SiamFC++, which employ largescale networks for feature extraction; this indicates that SiamSTM can effectively improve the performance of tracking algorithms while ensuring computational speed. A ribute-based Evaluation: To further demonstrate the extent of improvements achieved by our algorithm, a radar chart was used to intuitively display the precision of various algorithms under multiple challenging a ributes, with the precision of SiamSTM represented by the enclosed numbers. As shown in Figure 11, our approach not only performs admirably in coping with changes in target appearance (VC, CM) but also exhibits exceptional performance in preventing model degradation (such as when the target disappears from view or is occluded). These challenges are commonplace in real-world UAV tracking scenarios, which fully demonstrates the effectiveness of our algorithm in longterm UAV tracking scenarios. Figure 11. Success rates of different a ributes on the UAV20L benchmark. It can be seen that our method outperforms other excellent algorithms in these complex scenarios.
It can be seen that the algorithm in this work has achieved improvement in meeting the challenge of two types of target appearance changes, namely, changes in perspective and occlusion. This indicates that the Slight-ViT and Multiple matching network proposed here can reduce the impact of environmental noise and obtain more precise target features. Attribute-based Evaluation: To further demonstrate the extent of improvements achieved by our algorithm, a radar chart was used to intuitively display the precision of various algorithms under multiple challenging attributes, with the precision of SiamSTM represented by the enclosed numbers. As shown in Figure 11, our approach not only performs admirably in coping with changes in target appearance (VC, CM) but also exhibits exceptional performance in preventing model degradation (such as when the target disappears from view or is occluded). These challenges are commonplace in real-world UAV tracking scenarios, which fully demonstrates the effectiveness of our algorithm in long-term UAV tracking scenarios. Figure 10. UAV20L comparison chart. Each evaluation index is the same as the previous figur results illustrate that our method achieves superior performance against other SOTA trackers A ribute-based Evaluation: To further demonstrate the extent of improvem achieved by our algorithm, a radar chart was used to intuitively display the precis various algorithms under multiple challenging a ributes, with the precision of Siam represented by the enclosed numbers. As shown in Figure 11, our approach not onl forms admirably in coping with changes in target appearance (VC, CM) but also ex exceptional performance in preventing model degradation (such as when the targe appears from view or is occluded). These challenges are commonplace in real-world tracking scenarios, which fully demonstrates the effectiveness of our algorithm in term UAV tracking scenarios. It can be seen that the algorithm in this work has achieved improvement in me the challenge of two types of target appearance changes, namely, changes in persp and occlusion. This indicates that the Slight-ViT and Multiple matching network prop here can reduce the impact of environmental noise and obtain more precise target fea Figure 11. Success rates of different attributes on the UAV20L benchmark. It can be seen that our method outperforms other excellent algorithms in these complex scenarios.
It can be seen that the algorithm in this work has achieved improvement in meeting the challenge of two types of target appearance changes, namely, changes in perspective and occlusion. This indicates that the Slight-ViT and Multiple matching network proposed here can reduce the impact of environmental noise and obtain more precise target features.

Experiments on the UAVDT Benchmark
The UAVDT benchmark concentrates on complex environmental conditions such as weather conditions, altitude, camera views, vehicle types, and obstructions. The benchmark is comprised of 50 sequences and 37,084 frames. In Table 2, we present the overall precision and success rate, and analyze four common attributes of UAV tracking challenges, namely, object blurring (OB), large occlusion (LO), scale variations (SV), and camera motion (CM). Table 2. Attribute-based evaluation of the SiamSTM and 7 SOTA trackers on the UAVDT benchmark. The best two performances are, respectively, highlighted in red and blue.

Trackers
Overall The novel Slight-ViT and Multiple matching network used in the development of SiamSTM, achieved performance improvements in all attributes, particularly in object blurring, where the success rate and resolution improved by 4.5% and 3.3%, respectively. In large occlusion scenarios often encountered by unmanned aerial vehicles, our algorithm made significant progress with a 4.2% and 7.8% increase in the success rate and precision, respectively, compared to the baseline algorithm.
Nevertheless, our algorithm still has limitations. In the situations of scale variation, although the precision increased by 5.9% compared to the baseline algorithm, it is slightly lower than the UpdateNet algorithm, as the estimation of the algorithm's center point is affected due to the long-term accumulation of changes in the target's appearance.

Ablation Study
To validate the effectiveness of the proposed Slight-ViT and Multiple matching network structures in this paper, Table 3 presents the success rate and precision in four challenges of SiamSTM and the baseline, with different components on the long-term UAV20L benchmark. Analyzing the experimental data in Table 3, it can be found the Slight-ViT can enhance feature saliency, precision, success rate, and ability to handle complex scenarios when compared to the baseline algorithm. This is because the introduction of Slight-ViT provides the proposed algorithm with global spatial contextual information and highlights the saliency of the target's features. It possesses, therefore, strong robustness when dealing with changes in objects' spatial information and appearance. Furthermore, with the addition of the Multiple matching network, the information in the response map was fully utilized, highlighting the center position of the target and further improving the precision and robustness of SiamSTM.

Qualitative Evaluation
In order to visually demonstrate the performance of our proposed tracker in handling complex environments, we conducted a qualitative analysis experiment on the UAV123 benchmark by comparing our algorithm to the baseline algorithm.
Focusing on Bike1: The primary challenges of the Bike1 subset are camera motion, scale variation, and similar object. Near frame 110, due to the rapid changes in target scale and appearance, both the baseline tracker and our algorithm experienced scale confusion. In frame 851, the benchmark algorithm appears tracking drift due to interference from similar targets. Despite some errors due to the influence of complex scenarios, SiamSTM was able to redetermine the target position and achieve impressive long-term tracking performance.
Using Bike3: as shown in Figure 12b, in the Bike3 subset, the size of the target is small, making it difficult to obtain feature information, and there may be partial or even full occlusion. When the target begins to be occluded in frame 106, all algorithms are looking for the target. When the target is completely occluded in frame 121, baseline shows tracking drift. However, the algorithm presented in this paper can still track the target correctly after the target reappears, and the CLE throughout the entire process meets the requirements.

Qualitative Evaluation
In order to visually demonstrate the performance of our proposed tracker in handling complex environments, we conducted a qualitative analysis experiment on the UAV123 benchmark by comparing our algorithm to the baseline algorithm.
Focusing on Bike1: The primary challenges of the Bike1 subset are camera motion, scale variation, and similar object. Near frame 110, due to the rapid changes in target scale and appearance, both the baseline tracker and our algorithm experienced scale confusion. In frame 851, the benchmark algorithm appears tracking drift due to interference from similar targets. Despite some errors due to the influence of complex scenarios, SiamSTM was able to redetermine the target position and achieve impressive long-term tracking performance.
Using Bike3: as shown in Figure 12b, in the Bike3 subset, the size of the target is small, making it difficult to obtain feature information, and there may be partial or even full occlusion. When the target begins to be occluded in frame 106, all algorithms are looking for the target. When the target is completely occluded in frame 121, baseline shows tracking drift. However, the algorithm presented in this paper can still track the target correctly As shown in Figure 12c, the main difficulty of the Car4 video sequence is that the car is fully occluded and is constantly moving. In frames 370 to 415, when the vehicle is completely obscured by an obstruction, the baseline algorithm has encountered tracking failure, while our proposed method can still achieve stable tracking of the target despite slight fluctuations. Through frame 1261, it can be seen that although there are similar vehicles around, the proposed algorithm can still continuously and stably track the target of interest.
The qualitative experiments mentioned above showned that with Slight-ViT, clear and accurate target features were obtained and similarity extraction ability improved and with the Multiple matching network, interference from background information was reduced. SiamSTM can, thus, effectively handle occlusion and interference from similar objects, thereby precisely and reliably obtaining the target's position and size. This proves the effectiveness of our algorithm in dealing with complex scenarios.

Conclusions
In order to meet the performance and feasibility requirements of real-time tracking for drones, this paper proposes a new type of twin neural network tracker for real-time drone tracking based on a lightweight visual perception transformer and multi-matching network called SiamSTM. To enhance the discriminative ability and robustness of the feature extraction network, Slight-ViT is proposed. Furthermore, by fully perceiving response graph information through the multi-matching network, robust and clear feature information and position scale information can be obtained. Finally, performance comparisons on multiple challenging UAV benchmarks show that the proposed method can significantly improve robustness and precision. The utility of the tracker is validated through speed tests. Therefore, we believe that our work can promote the development of UAV tracking-related applications.  As shown in Figure 12c, the main difficulty of the Car4 video sequence is that the car is fully occluded and is constantly moving. In frames 370 to 415, when the vehicle is completely obscured by an obstruction, the baseline algorithm has encountered tracking failure, while our proposed method can still achieve stable tracking of the target despite slight fluctuations. Through frame 1261, it can be seen that although there are similar vehicles around, the proposed algorithm can still continuously and stably track the target of interest.
The qualitative experiments mentioned above showned that with Slight-ViT, clear and accurate target features were obtained and similarity extraction ability improved and with the Multiple matching network, interference from background information was reduced. SiamSTM can, thus, effectively handle occlusion and interference from similar objects, thereby precisely and reliably obtaining the target's position and size. This proves the effectiveness of our algorithm in dealing with complex scenarios.

Conclusions
In order to meet the performance and feasibility requirements of real-time tracking for drones, this paper proposes a new type of twin neural network tracker for real-time drone tracking based on a lightweight visual perception transformer and multi-matching network called SiamSTM. To enhance the discriminative ability and robustness of the feature extraction network, Slight-ViT is proposed. Furthermore, by fully perceiving response graph information through the multi-matching network, robust and clear feature information and position scale information can be obtained. Finally, performance comparisons on multiple challenging UAV benchmarks show that the proposed method can significantly improve robustness and precision. The utility of the tracker is validated through speed tests. Therefore, we believe that our work can promote the development of UAV tracking-related applications.