Aerial Video Trackers Review

Target tracking technology that is based on aerial videos is widely used in many fields; however, this technology has challenges, such as image jitter, target blur, high data dimensionality, and large changes in the target scale. In this paper, the research status of aerial video tracking and the characteristics, background complexity and tracking diversity of aerial video targets are summarized. Based on the findings, the key technologies that are related to tracking are elaborated according to the target type, number of targets and applicable scene system. The tracking algorithms are classified according to the type of target, and the target tracking algorithms that are based on deep learning are classified according to the network structure. Commonly used aerial photography datasets are described, and the accuracies of commonly used target tracking methods are evaluated in an aerial photography dataset, namely, UAV123, and a long-video dataset, namely, UAV20L. Potential problems are discussed, and possible future research directions and corresponding development trends in this field are analyzed and summarized.


Introduction
Visual target tracking is an important topic in the field of computer vision. Its purpose is to accurately locate, identify and track the target after obtaining continuous images through the collector. An overview of research progress and visualization achievements at home and abroad reveals that visual target-tracking technology has unique social application value in terms of convenience, high efficiency, safety, reliability, high cost performance and low energy consumption [1] in the fields of medical diagnosis, human-computer interaction, public safety [2], video surveillance and posture estimation [3].
However, there are some differences between aerial target tracking technology and standard ground target tracking technology. The differences of among aerial photography instruments, environments and target states, which lead to high information content, multiple heterogeneity and high dimensionality of aerial photography images or videos. Available image processing algorithms such as image denoising [4], image enhancement [5] and image mosaicking [6] can satisfy the real-time processing requirements of aerial image target recognition, but difficult problems and challenges remain in the realization of target tracking, including the following.

Target Specificity
Aerial photography instruments have light sensitivities that differ among targets and are limited by their own flight height. In aerial photography images, there are often targets that are visible to the naked eye but have a small pixel size and image objects that blur or resemble the actual background color texture [7,8]. This study conducts a classification based on the characteristics of aerial photography targets according to the following six types: 1.
Dim small targets: Targets for which the imaging size is relatively small due to the shooting angle and shooting distance-namely, targets for which the imaging size is less than 0.12% of the total number of pixels [9]. 2.
Weakly fuzzy targets: Targets for which the image is blurred due to the exposure time or flight jitter. 3.
Weak-contrast targets: In a recognition environment with low noise and a low signal-to-noise ratio (SNR), the recognition target and moving background are similar in terms of color features and texture features. Hence, the contrast between the recognition target and the background is low, and the texture feature is not readily identified, but there is no missing target category. 4.
Occluded targets: Targets that are temporarily occluded by the complex environmental background or are hidden for a long time during aerial photography tracking. 5.
Fast-moving targets: Targets that exhibit dodging, fleeing and fast movement, which include image debris that is caused by the shaking of the UAV fuselage, obstacle avoidance and the influence of wind speed. 6.
Common targets: Targets with normal behavior and clear images.

Background Complexity
Aerial photography can be roughly divided into three types: urban architectural landscape (e.g., urban road, urban building, and large-scale event site) photography, suburban open area (plain, grassland, and open area in an urban suburb) photography and complex and harsh environment scene (desert, mountain, gully and natural disaster site) photography. Due to the diverse environment, the pixel values of aerial photography targets and backgrounds are relatively low, namely, the texture features, spatial features and color features of the background differ substantially, which causes strong interference with aerial photography targets, especially in the case of complex environmental changes, sudden unknown static or mobile threats to aerial photography equipment, and other challenges in aerial photography. This paper summarizes the methods for overcoming target occlusion that is caused by a high-resolution pixel ratio in aerial photography and high feature complexity dimension.

Tracking Diversity
Aerial image acquisition equipment results in a variety of data forms, which include ordinary red, green, blue (RGB) color images (visible light images), infrared thermal images (gray images), GPS navigation information and acquisition equipment number information. Therefore, by combining various data features, the identification and tracking of occluded targets and weak targets can be realized. By using a single-UAV working mode or multi-UAV collaborative tracking mode, the number of available target features (spatial three-dimensional and multiangle features) can be increased to increase the tracking accuracy and tracking success rate. However, problems such as collaborative path planning, data normalization and image edge calculation are encountered.
According to the characteristics of the aerial video shooting target, this study conducts classification comparison of target-tracking methods and identification of the characteristics of various methods and usage scenarios. The main contributions of this paper can be described as follows.

•
We conduct a comprehensive benchmark test of aerial video trackers based on handcrafted feature and deep learning.

•
We take the target scale and definition as the classification criteria and conduct a complete comparative analysis of the three tracking schemes.

•
We benchmark 20 trackers based on handcrafted feature, depth Feature, siamese network and attention mechanism.

•
We compare the performance of the tracker in various challenging environments, so that relevant researchers can better understand the research progress on aerial video tracking.
The remainder of this paper is organized as follows. In Section 1, we explain the definition of aerial video target tracking from three perspectives: the target type, the shooting background and the tracking method. In Section 2, we compare the relevant datasets that can be used for aerial target tracking. In Section 3, we relevant tracking methods are introduced from three aspects: ordinary targets, weak targets and moving targets. In Section 4, we investigate and compare the structures of neural network trackers. In Section 5, we show the evaluation results of different trackers under UAV123 and UAV20L standards through experimental comparison and discuss the comparison between different trackers and the potential problems of aerial target tracking. In Section 6, we discuss the future research direction of aerial target tracking.

Aerial Video Datasets
Due to differences in the sensors of aerial photography equipment, parameters may vary among datasets [10]. Any single-frame image in a dataset contains multiple targets, but the frequency of the targets is not stable, and the target position and attitude change with the shooting angle. Therefore, although various traditional aerial photography datasets can reflect the application requirements of the real world, their application degree is typically low.
Aerial photography data are typically acquired by low-altitude drones. The number of videos in Table 1 represents the number of videos in the dataset, shortest video frames represents the number of frames in the video sequence with the fewest frames in the dataset, longest video frames represents the number of frames in the video sequence with the most frames in the dataset, total video frames represents the sum of the numbers of frames of all the video sequences in the dataset, and average video frames is obtained by dividing the total number of frames in the dataset by the number of videos. OTB and VOT are common target datasets, which are suitable for short-term tracking. The LaSOT dataset, consisting of 3.52 million manually annotated images and 1400 videos, is focused on long-term tracking and is by far the largest target dataset with dense annotation. However, these datasets contain substantial amounts of nonaerial target information and are not suitable for aerial target tracking. UAV123, ALOV300++, and Temple Color 128 are excellent special aerial photography datasets with rich types. Among them, the objects, such as dancers, completely transparent glass, octopuses, birds and camouflaged soldiers, exhibit occlusion, complete occlusion and sudden movement of the target, which are more in line with practical scenarios. UAV123 has a wide variety of scenes, which include urban landscapes, roads, buildings, sites, beaches and ports. The targets include cars, trucks, ships, people, groups and air vehicles, and the activities include walking, cycling, water skiing, driving and swimming. The long-term complete and partial occlusions of the target, scale changes, light changes, view changes, background clutter, camera motion and other effects are labeled. UAV123 has recently become increasingly popular due to its practical applications, such as navigation, wildlife surveillance, and crowd surveillance.

Traditional Target Tracking Algorithm
The combination of UAV with infrared equipment can solve the tracking problem of weak targets and hidden targets [21]. However, due to the high data feature dimensions, it is not suitable for the tracking analysis of fast-moving targets and exhibits low real-time performance. Many challenges remain in the real-time tracking of aerial photography. In addition, target loss that is caused by target deformation and different scales is an urgent problem to be solved. This section summarizes the problem in terms of the target category in the problem definition. Weak targets are defined in Section 1.

Common Targets
The traditional template-matching target tracking strategy is to construct a tracer based on sparse representation. The best candidate box can be identified by using template matching, but the background and target cannot be distinguished well. Reference [22] proposes the adaptive structural local sparse appearance (ASLA) algorithm, which increases the tracking accuracy and reduces the influence of occlusion by aligning the pooling operation on sparse code. Next, augmented quantum space learning and sparse representation are adopted in the update module to address drift and partial occlusion.
Various target trackers realize satisfactory short-time tracking performance, whereas others realize satisfactory long-time tracking performance. In Reference [23], the MUlti-Store Tracker (MUSTer) algorithm combines these two types of trackers-For short-time tracking, a powerful integrated correlation filter (ICF) method is used for short-term storage. The use of key-point matching tracking and random sample consensus [24] estimation in integrated long-term modules enables the integration of long-term memory and provides additional information for output control.
To overcome the high data feature dimensions, Reference [25] utilized the principal component analysis and scale-invariant feature transform (PCA-SIFT) algorithm, which improved SIFT and introduced PCA to reduce the dimensionality of aerial target features. Due to the loss of information in dimensionality reduction, this method is suitable for processing only clear aerial video images of targets. To overcome background interference and background shade, Reference [26] uses the appearance of the target and the background environment to build a tracker from two angles. The tracker is robust to changes in the appearance of the target during tracking. First, background patch information and foreground patch information are obtained, and multiangle information is associated through camera calibration. An adaptive model update strategy based on response distribution and prior tracking results is used to reduce the possibility of model drift and enhance tracking stability. Reference [27] designed a robust tracker that is based on a key patch sparse representation and designed patches for the occlusion part. First, using patch sparsity, patches are obtained from known images, and scores are provided. Second, key patches are selected according to the position and occlusion scenario, and corresponding contribution factors are designed for sampling patches to emphasize the contributions of selected key patches. This method increases the accuracy of partial occluded target tracking.

Weak Targets
In weak target tracking, two main challenges are encountered. First, the distance between the aerial photography equipment and the tracking target is relatively large, and the target occupies a relatively low percentage of pixels on the imaging plane and is vulnerable to interference by various types of noise clutter, thereby resulting in a missing target or target loss. Second, environmental factors (complex background, wind speed, and equipment jitter) lead to target blur and target loss. In this paper, weak targets, weakly contrasted targets and weak blurred targets are discussed and analyzed.

Dim Small Targets
To reduce the omission rate of dim small targets and increase the tracking accuracy, the relative local contrast measure (RLCM) multiscale detection algorithm was used in Reference [28].
The algorithm calculates the multiscale RLCM for each pixel of the original infrared image to enhance the real target and suppress all types of interference (such as high brightness background, complex background edges and pixel-sized noise with high brightness). An adaptive threshold is used to extract the real target. Formulas (1)-(3) calculates the RLCM of the center pixel of the center cell at each location.
where Imean 0 Imean i can be understood as an enhancement factor for the central cell [that is, cell(0)] in the i th direction, and Imean 0 and Imean i denote the average gray values of the K 1 or K 2 max pixels in cell(0) and cell(i), respectively. K 1 and K 2 are the numbers of maximal gray values that are considered, and G j 0 and G j i are the j th maximal gray values of cell(0) and cell(i), respectively. In Reference [29], an online multitarget tracker was designed by using high confirmations (strong detections) and low confirmations (weak detections) in the framework of the probability hypothesis density particle filter, which performed well in terms of tracking accuracy, number of missing targets and speed. The calculation flowchart is presented in Figure 1.

Early association
Inherit Inherit Dependency Angle update resample State estimation Figure 1. Probability hypothesis density particle filter framework calculation process. At time k, strong detection Z + k and weak detection Z − k are associated with the predicted state that is calculated from the predicted particles. After the early association, two detection subsets are used for tracking. Detection ∧ Z k inherits the identity of the corresponding trajectory and is used to track the state, and ∨ Z k are unassociated strong detections and are used to initialize new states. After updating and resampling of the perspective, particle x k is used to estimate state X k .
Strong detections are used to propagate target tags and promote target initialization, whereas weak detections are used only to support label propagation. Early association (EA) is executed prior to the trust angle update phase to reduce the extensive computational cost that is incurred by the labeling process. The federated data ∧ Z k inherit the corresponding identity information and are only used to track the status. After the EA phase, weak target detections that are not connected are discarded, while unassociated strong detections ∨ Z k are retained for the initialization of new particles. Strong detection generates new particles, as expressed in Formula (4), where N(·) is a Gaussian distribution, x i k represents the relative weight of each new particle, and X i k,λ is the i th particle. Strong detection generates new detection particles independently modeled from the estimated state according to the function N(·) and dynamically updated based on parameters such as the detection size and video frame rate using covariance matrix ∑ . Moreover, unassociated strong detections initialize a new particle, as expressed in Formula (5), where | · | is the specified set and Z k represents combined detections. ∑ k is a standard deviation matrix that changes with time. It defines the relationship between the target detection tracking box and the weight of the new particles. These values can be learned from the training set, and state evaluation is conducted, as expressed in Formula (6), where each state x k,λ ∈ X k is estimated as the average of all resampled particles sharing the same identity.
Reference [30] realizes the feature binding of the target's grayscale and spatial relation via compressed perception, thereby constructing a gaussian target to overcome high similarity between the small target and the background noise. Reference [31] combines particle swarm optimization (PSO) and a particle filter to optimize the sampling process of the particle filter to overcome small target feature poverty. In addition, the algorithm introduces the local PSO reset method to overcome the particle collapse problem in the particle filter for multitarget detection and tracking.

Weak Blurred Targets
The infrared detection system is typically used to find and track weak blurred targets. Reference [32] applied the Wiener filter to the processing of the original infrared image. First, motion blur is processed, and noise interference is suppressed. The gradient method is then used to sharpen the processed image to enhance the target edge. This method can substantially reduce the motion blur, increase the image quality and enhance the performance of the detection system. Reference [33] constructed a nonlinear blurred core with multiple moving components. A blind deconvolution technique that used a piecewise linear model was introduced to estimate the unknown kernels. This method is combined with noise reduction technology that is based on wavelet multiframe decomposition and the peak signal-to-noise ratio (PSNR). This algorithm is highly effective in accurately identifying various blurred cores and provides important research strategies for image defuzzing. Reference [34] proposes a new motion blurred computing method for ray tracking. This method provides analysis data of the blurred visibility of each ray motion and considers the time dimension. The algorithm can use any standard ray tracing acceleration structure without modification. Reference [35] proposes a frame-by-frame intermittent tracking method that is driven by an actuator, which is used for the motion-free blurred video shooting of fast-moving objects. By controlling the frame and shutter timing of the camera to reduce the motion blur and by synchronizing the vibration with the free-vibration-type actuator, the motion blur can be reduced in free-view high-frame-rate video shooting.

Weak-Contrast Targets
For the recognition and tracking of weakly contrasted targets, most algorithms require prior information about targets; otherwise, they would be affected by heavy noise clutter [36]. Reference [37] proposed a new method based on image fusion and mathematical morphology. Based on the description of the manipulatable pyramid, the original image is fused, and the target tracking of the fused image is realized via the mathematical morphology method. Reference [38] conducted an in-depth analysis of the background characteristics, weak target characteristics, and motion characteristics and proposed a moving average method. Based on foreground extraction, the difference calculation of adjacent frames that are related to the continuity of a moving target is conducted to eliminate the interference points and reduce the false alarm rate. The pretracking detection method proposed in Reference [39] operates directly on the original sensor signal without the need for a separate explicit detection stage. The probability density function of the target state is generated from the original pixel level, the probability indicator of the target presence is calculated, and the Bayesian particle filter is used to complete the target tracking. Reference [40] proposed a feedback neural network for weakly contrasted target motion tracking against a natural cluttered background. To form a feedback loop, the model delays the output time and forwards the feedback signal to the previous neural layer.

Occluded Targets and Fast-Moving Targets
In the course of UAV dynamic tracking, especially if fast movement occurs [41] and relabeling is necessary after the target is lost for a short time [42], the typical method determines the target area continuously through the video sequence [43]. Scholars at home and abroad have also proposed the correlation filter tracking algorithm [44] and the circular structure of tracking by detection with kernels (CSK) algorithm [45]. The tracking efficiency is high, but the tracking performance for multiscale targets is poor, and it is difficult to resume tracking of a missing target. To overcome this problem, reference [46] improved the scale-adaptive multifeature fusion (SAMF) algorithm on the basis of kernelized correlation filters (KCF) [47]. A multifeature (grayscale, histograms of oriented gradients (HOG), and color names (CN)) fusion method was used to realize feature complementation, and a multiscale search strategy was used to realize scale-adaptive tracking to increase the tracking accuracy. However, because the algorithm must conduct seven types of scale detection calculations, the speed is much lower than that of KCF. Reference [48] combines filter and context-aware information [49] and uses an intermittent learning method to enhance the network context awareness to increase the modeling performance of the network for occluded objects. In Reference [49], the frame with the best tracking results was used as the key frame in the follow-up tracking, which optimizes the quality of the training set and reduces the computational cost, thereby overcoming the poor robustness of the filter method in complex scenes.
Reference [50] used vector field guidance for multitarget tracking in aerial videos. By improving the vector field guidance method of a single UAV and defining a variable confrontation tracking track, the cooperative confrontation tracking of the UAV group on a moving target group is used to solve the problem of the visual range of the UAV when tracking multiple ground targets, which is suitable for processing aerial video images of a fast-moving target. To solve the problem of visual control of target tracking in visible light aerial photography, Reference [51] adopted a ground target tracking control strategy based on vision to realize the real-time tracking of aerial photography targets. Aiming at solving the regional cooperative search problem of multi-UAVs, Reference [52] described the changes in the environment and target state with the search process based on the search information graph model and established a motion model for the dynamic analysis of UAVs to ensure the accuracy of model prediction, thereby realizing the accurate tracking of complex targets with motion trajectories. To address the abnormal filter response caused by background interference in aerial video, a clipping matrix and regularization term were introduced in Reference [53] to expand the search area and suppress the distortion. The spatially regularized correlation filter (SRDCF) algorithm, which was proposed in Reference [54], adds spatial penalty terms on the basis of discriminative correlation filters (DCF) to solve for the boundary utility and realize superior performance in large-scale movement and complex scenes. However, the need to review used multiframe information in the tracking process creates a computational cost problem. The spatial-temporal regularized correlation filters (STRCF), proposed in Reference [55], add spatial and temporal regular terms on the basis of the problems encountered with SRDCF, and tracking requires only the information of the previous frame to ensure time efficiency. Most available filter algorithms attempt to introduce a predefined regularization to improve the learning relationship of the target object, but they are difficult to adapt to special scenarios in practice. To overcome this problem, Reference [56] proposed an online adaptive learning spatiotemporal regularization method. By introducing spatial local change information into spatial regularization, DCF can focus on the trusted part of the target object. The algorithm realizes satisfactory tracking performance on four aviation datasets. Reference [57] evaluated the target state by establishing an unscented Kalman filter based on a multi-interaction model, which reduces the network's evaluation error of the moving target but also increases the computational consumption.

Target Tracking Algorithm Based on a Deep Learning Network
With the development of computer vision, many visual target tracking frameworks have been proposed and applied to aerial video target tracking. This section briefly introduces a tracking algorithm based on depth Features, a tracking algorithm based on a Siamese network and a target tracking algorithm based on an attention mechanism.

Depth Features
A deep learning network that is represented by a convolutional neural network (CNN) can automatically learn all the effective features of the target from many training sets, which not only effectively overcomes the background noise but also realizes satisfactory tracking performance [58,59].
Reference [60] designed a lightweight CNN for learning the common attributes of a multidomain video to address scenarios such as target occlusion and target deformation in practical tracking. The network tracking structure uses online fine-tuning to improve the real-time performance of the tracking algorithm. Reference [61] added RoIAlign on this basis to accelerate feature extraction and classify internal targets through multitask loss, adding discriminative parameters to targets with similar semantics. The network structure is illustrated in Figure 2. First, the first three layers of convolution share the multiple-domain features learned by the network (e.g., the illumination change, motion blur, or robustness to size changes), and the adaptive RoIAlign extracts CNN features of each region of interest (RoI) to improve the feature quality and reduce the computational complexity. Layers FC4 and FC5 are mainly used to distinguish the background and the target, and the unique characteristics of each video domain are stored into the FC6 branch with softmax cross-entropy loss.  The online tracking process of the RT-MDNet algorithm is described in Algorithm 1.

Algorithm 1 Online tracking process of RT-MDNet algorithm
Input: Pretrained RT-MDNet convolution weights w{w i }, where w i is the weight value of a convolution layer, and the initial target state X l . Output: Adjusted target status X * .
1: Random initialization of the last domain-specific layer weights w 6 .
2: Use bounding box regression technique to train boundary box regression function bbox. 3: for : do 4: If (image==1) 5: Acquire a convolution feature of the first frame image α(W). Acquire convolution features of the second frame and subsequent images α(w γ ). Use S + i and S − i to update w w j : 10: Set long-term update frame index T i l and short-term update frame index T i S .

11:
Draw target candidate sample state X i .

12:
Find the optimal state of the target position: is the score of the target of the network evaluation. 13: is the rate of change of the appearance of the long-term target. 15: if T s > t s , then T s = T s /{min v ∈ t v s }t s , where t v s is the rate of change of the appearance of the short-term target . 16: Use bbox to adjust the optimal state of the target position: x * i = bbox(x * ).

17:
If (i%10 ==0) 18: then use S + V∈t l and S − V∈t s to update w{w j } : w = conv(S + V∈t l , S + V∈t s ). 19: 20: then use S + V∈t s and S − V∈t s to update w{w j } : w = conv(S + V∈t s , S + V∈t s ).

21: end for
Reference [62] proposed the EArly Stopping Tracker (EAST) to convert the adaptive tracking problem into a decision-making process. The network structure is illustrated in Figure 3. The network uses the offline reinforcement learning method to learn an agent for a single-frame image. Based on this agent, it decides to select a layer in a series of feature layers to realize target monitoring or to use the next layer to conduct the same processing. However, this method exhibits reduced accuracy with increasing speed.  Figure 3. EArly Stopping Tracker (EAST) network structure. Judgment of the optimal feature layer by action.
The action selection process for the EAST network is described in Algorithm 2, where action_4 denotes four groups of actions, and action is an action(i) value.

Algorithm 2 Action selection process for the EAST network
Input: Feature map, action index: eigth_actionindex{}, the action value h l from the first four layers, action list: action {action (i)} (i ∈ 1, 2...8). Output: Current conv layer action value. 1: Calculate the corresponding average value F l of the first N layers: F l = ∑ l k=1 F k /l. 2: Construct the current state of the feature map:(F l , h l ). 3: Use vector merging to calculate the following feature sequence: f eature_list = F l + action_4. sam_action=sam(feature_map,eigth_actionindex). 6: if sam_action=Stop then "EAST"(early stop) at the subsequent target location will not be conducted. 7: else then output the value of sam_action.
The discriminative correlation filter [63] shows substantial advantages in visual target tracking. The combination of a filter tracking framework and a deep neural network effectively improves the performance of the tracking algorithm [64,65]. Reference [66] proposed the multiple experts using entropy minimization (MEEM) algorithm within a tracking-by-detection framework to overcome the model drift caused by tracking failure or misalignment of the training samples. Aiming at solving this problem, the efficient convolution operators for tracking (ECO) algorithm was proposed in Reference [67], and continuous convolution operators (C-COT) [68] were simplified by modifying the number of model update frames, thereby reducing the model size, increasing the speed and reducing the risk of model overfitting. Simultaneously, according to the tracking results of the training set, components are generated by using the Gaussian mixture model (GMM) to ensure the diversity of the training set. However, the deep features of the network are not sufficiently effective, and the large amount of data calculation reduces the tracking speed of the network. Based on ECO, Reference [69] divided and conquered its depth features and shallow features, which substantially increased the robustness and tracking accuracy of the network structure.
To increase the network robustness, the multicue correlation filter tracking algorithm (MCCT), proposed in Reference [70], analyzes the fusion results that are obtained from the decision layers of multiple trackers to ensure the reliability of the results. The superimposed selection of adaptive strategies successfully distinguishes unreliable samples (in which there are occlusions or deformed data) to further avoid the problem of insufficient training due to sample contamination. Reference [56] combined the output of the Conv3 layer of the VGG-M [71] network with HOG-CN to increase the robustness of the model.
To overcome the difficulty of matching the training depth feature with the actual target information, the target-aware deep tracking (TADT) method, proposed in Reference [72], uses the global average of the backpropagation gradient to complete feature screening, evaluates the importance of each filter through a regression function, and applies a weighted supplement to the deep feature.

Siamese Network
To overcome the high computational burden and low speed of the previous deep neural network method, a Siamese network for introducing similarity learning into the matching process of the target image and search image was proposed, which balanced the costs of the tracking speed and tracking accuracy and gradually has become the preferred solution to the tracking problem [73,74].
Simplification of the target tracking problem to learn a common similarity mapping problem is an effective solution. The Siamese instance search for tracking (SINT) algorithm, proposed in Reference [75], learns a matching function through a Siamese network. The target feature of the first frame is used as a template, the subsequent sampling feature is matched with it for calculation, and the target with the highest score is selected as the final target. The algorithm uses a region pooling layer to realize model acceleration and demonstrates the feasibility of combining a deep neural network with traditional methods. Reference [76] also calculated the similarity between each position of the template and the image to be tested through template matching and selected the target with the highest similarity as the final target. The discriminative subspace learning model (DSLM) network, proposed in Reference [77], solves the problems of target occlusion and background interference by learning the relationship between the target module and the characteristics of the search area. Reference [78] constructed an asymmetric Siamese network (CFNet) that not only ensures the tracking accuracy but also simplifies the network structure. In Reference [79], DCF was used to complete the filtering, a probability heat map of the calculated result that was mapped to the target position was used to complete the online learning and tracking, and end-to-end training was realized.
These trackers simplify the problem of target tracking to the problem of learning a generic similarity map by learning the correlation between the feature representation of the target module and the search area. They do not consider the complex and changeable target scale, appearance or pixels in the actual tracking process. In Reference [80], tracking was decomposed into two parallel and collaborative threads-fast discriminative scale space tracking (FDSST) was used for fast tracking, and a Siamese network was used for accurate verification, thereby realizing both high accuracy and high speed. The Siamese region proposal network (SiamRPN) algorithm, which is proposed in Reference [81], overcomes the limitation of spatial invariance of the Siamese network. It is composed of a Siamese subnetwork and a region proposal subnetwork. The network completes the offline end-to-end training via large-scale image analysis, constructs a one-shot detection task to avoid time-consuming multiscale tests and obtains accurate candidate regions. SiamRPN increases the model accuracy and reduces the model size. DaSiamRPN, proposed in Reference [82], enriches the types of training data in the dataset via data augmentation, reduces the impacts of difficult negative samples on the network training, and improves the network generalization and discrimination performances. The interference recognition module in the network overcomes the low recognition accuracy caused by the lack of a self-updating model.
The Siamese network is not a deep network due to the lack of translation-invariance. The SiamRPN++ algorithm, proposed in Reference [83] based on Reference [81], effectively solves this problem by modifying the sampling strategy. The network structure is illustrated in Figure 4. The method recombines the positioning features and deep semantic features obtained by ResNet and improves the feature expression performance according to the sequence of features from low to high, from small to large, and from thin to thick. The traditional image feature pyramid network (FPN) [84] is similar to it. For the loss of clipping invariance caused by padding, the model shifts the training sample labels to alleviate the centralization problem caused by the deep network.  Figure 4. SiamRPN++ network structure. In the case of a specified target template and search area, the output intensive prediction is obtained by fusing the outputs of SiamRPN blocks. The middle siamrpn block is displayed on the right, which is divided into two parts: a classification branch and a boundary box regression branch.
The SiamRPN block of the SiamRPN++ algorithm is described in Algorithm 3.

Algorithm 3 SiamRPN block
7: Remove the anchor sequences with label=-1 from A cls W * h * 2k . 8: The cross-entropy function is used to calculate the classification results of the step 7 results. 9: Output the regression results of bbox for step 6 and classification results for step 8.
Siam R-CNN, proposed in Reference [85], is a redetection architecture based on the trajectory dynamic programming algorithm (TDPA). Based on the Siamese framework, the self-motion and mutual motion of all potential objects are modeled, and the detected information is summarized into tracklets to complete the detection. This method is suitable for long-term tracking and is sufficient for addressing tracking failure after the target has been blocked for a long time. The Siamese box adaptive network (SiamBAN), proposed in Reference [86], simplifies the tracking problem into a problem of parallel classification and regression and directly conducts classification and regression operations on targets in a unified fully convolutional network (FCN). This avoids the computational complexity of the Siamese network due to the introduction of RPN and increases the network flexibility and generalization performance. The unsupervised deep tracker (UDT), proposed in reference [87], applies unsupervised learning to target tracking, uses three consecutive frames to evaluate the prediction deviation to increase the accuracy of the tracker, and applies a sensitive loss function to allocate a weight to each sample to overcome the noise caused by the random initialization of the target box in the unsupervised training.

Attention Mechanism
Challenges remain in ensuring the real-time performance and application of the tracker, and the available partial tracking algorithms cannot distinguish between the target and the background, which renders it difficult to address the changes of the target shape and background in real time. The attention mechanism module within the deep learning network reinforces important features in the image, thereby helping address issues such as target tracking failures [88].
Reference [89] proposed the residual attentional Siamese network (RASNet) algorithm and reconstructed the filtering mode of the Siamese network based on a CNN, thereby effectively avoiding the overfitting problem. The algorithm separates representational learning from discriminant learning and enhances the discrimination performance and adaptability of the algorithm. Real-time tracking is realized. The network structure is illustrated in Figure 5, which contains three attention mechanisms. General Attention refers to the introduction of the attention mechanism to integrate the common features of targets and highlight the commonness of features. Residual attention considers differences in learning objectives. Channel attention adapts to various objectives and eliminates noise. The attention fusion process of the RASNet algorithm is described in Algorithm 4.

Algorithm 4 Attention Fusion Process of the RASNet algorithm
Input: Feature map. Output: Trace box q with the largest response value.
1: The feature map is downsampled and upsampled by the residual attention mechanism to obtain the target semantic feature sequence: f eature_R.
2: The general attention mechanism is used to extract the information of multiframe feature maps, and the common feature sequence of the feature maps is obtained: f eature_G.

5:
The fusion feature sequence is calculated: f eature_list = f eature_D ⊗ channel_score. Reference [90] proposed spatial attention (SCSAtt), which ensured the model's speed and increases its robustness. SCSAtt uses weight allocation to highlight the importance of the feature of the channel-namely, the channel attention module-and uses the spatial attention module to highlight the area with the most information on the feature diagram to determine the target location. The network structure is summarized in Figure 6.  The Channel-Spatial attention calculation process in the SCSAtt algorithm is described in Algorithm 5.

Algorithm 5 Channel-Spatial attention calculation process in the SCSAtt algorithm
Input: Feature map F H * W * C M . Output: Channel-Spatial attention Λ(φ(z)).
1: Use global max-pooling to obtain the F M object feature: )))).
3: Use elementwise summation to fuse two feature vectors:ϕ c (·) 1 * 1 * C = σ(F 1 * 1 * C max ⊕ F 1 * 1 * C avg ). Similar to SCSAtt, the feature integrated correlation filter network (FICFNet) algorithm, proposed in reference [91], is a two-branch parallel connection network structure that unifies the three processes of feature extraction, feature integration and DCF learning. The feature integration module of the network cascades the shallow feature and the deep feature and uses the channel attention mechanism to adaptively combine the channel weight into the integrated feature, and the obtained target timing information can solve the problems of target occlusion and target deformation.

Baseline Assessment
To accurately evaluate the model performance, experiments were conducted on aerial datasets UAV123 [11] and UAV20L [11]. UAV123 contains 123 fully annotated HD video sequences over 110K frames from the perspective of low-altitude aviation. Each video sequence has 12 attribute categories

Evaluating Indicators
In this paper, two evaluation indicators, accuracy and success, are used to complete the quantitative analysis of the model. Accuracy refers to the percentage of the target center position error that is in the specified range, and the center position error is defined as the average Euclidean distance between the center position of the real box(x gt 0 ,y gt 0 ) and the center position of the tracking prediction box(x tr 0 ,y tr 0 ), as illustrated in Figure 7a. The proportion of the overlap scores (which is calculated from the intersection ratio) of the real box and the prediction box that exceed the threshold frames in the video timing sequence is the success degree, as presented in Figure 7b. The error of the center position is a widely used standard, which cannot be easily used to evaluate the performance of the tracker in the case of target loss. The accuracy curve is generated accordingly, and the corresponding value of 20 pixel points is adopted as the accuracy evaluation index [16]. When the center position error cannot be used to evaluate the target scale change, the performance of the tracker can be compensated by an evaluation index that is based on the area overlap ratio and is generated accordingly, as expressed in Formula (7).
where R tr represents the real target boundary box, R gt represents the prediction box of the tracking results, and ∪ and ∩ represent the union and intersection, respectively, of the two areas. This article uses the one-pass evaluation (OPE) accuracy and success graph to complete the model evaluation by ranking the tracking algorithms using the area under the curve (AUC) from the success graph. The parameter standards follow the default UAV123 settings.  The algorithm codes are implemented in the server with an NVIDIA TITAN V GPU by MATLAB and PYTHON, and the configuration parameters of the experimental environment are shown in Table 2.The codes of the trackers we reproduced are obtained from the GitHub repository, and the URLs are shown in Table 3. The training models of all tracking algorithms adopt the original models without retraining.

Overall Evaluation
In this paper, a total of 20 tracking algorithms are compared. Figure 8 presents the results for the algorithms on UAV123, which is the aerial photography dataset. Table 4     In the UVA 123 dataset,according to a comparison of the Siamese network model structures, SiamRPN++ utilizes a deep network, namely, ResNet, to fully extract target features by recombining features of shallow and deep layers. The network structure is relatively complex, but the advantage lies in the combination of a Siamese network and a deep structure to complete feature extraction. Siam R-CNN uses a Siamese network to apply the Faster R-CNN to solve the tracking problem and uses dynamic programming to address occlusion and target disappearance, which is suitable for long-term tracking and severely occluded scenes. However, the network structure is the most complex, and the computational burden is large. SiamBAN uses the representational capability of a fully convolutional network to simplify the tracking problem into classification and regression, thereby avoiding the hyperparameter problem. The accuracy and success rates of the SCSAtt tracker are 0.776 and 0.69, respectively; hence, the attention mechanism is an effective mechanism that helps the network increase the tracking accuracy. Since the structure of the DaSiamRPN algorithm cannot utilize deep features, there are gaps in the accuracy and success rates compared with the methods based on deep features, which demonstrates the importance of deep features. The UDT algorithm is the first unsupervised tracking algorithm to be implemented in a Siamese network framework, and its accuracy is consistent with that of SRDCF.
The trackers that are based on deep characteristics are being gradually optimized. While the tracking speed of RT-MDNet far exceeds that of ECO, it realizes the same success rate and accuracy as ECO; hence, the multidomain combination method is effective. By introducing deep features on the basis of the STRCF algorithm, the result of the DeepSTRCF algorithm is improved substantially compared with that of the STRCF algorithm.
Which models perform best? Compared with other tracking algorithms, SiamRPN++, SiamBAN, and SCSATT networks have the best tracking performance, which can not only meet various challenges but also meet the real-time requirements. This is because these algorithms do not update the network parameters during online tracking, thus avoiding the time consumption caused by a large amount of computation.
Which models are more robust?
The Siam R-CNN algorithm uses the TDPA mechanism to address the problem of tracking failure after serious occlusion and target loss in online tracking, thus improving the robustness of the model. The ECO algorithm uses GMM to ensure the diversity of training sets and reduce the risk of model overfitting. DeepSTRCF improves the robustness of the model by fusing CNN features, HOG and CN. The MCCT algorithm comprehensively considers the tracking results of multiple trackers to ensure the reliability of the tracking results, and filters unreliable samples through an adaptive strategy to improve the robustness of the model.
Which models are lightweight?
The Siam-BAN algorithm simplifies the tracking problem to parallel classification and regression and directly classifies targets in FCN, which reduces the computational complexity and ensures a simple network structure and strong flexibility. The RT-MDNet algorithm simplifies the tracking problem to target recognition and achieves a higher tracking effect by considering the interference of similar objects in the loss function. The TADT algorithm assumes that the tracking task needs only the information of specific channels related to the target, eliminates other redundant channels, reduces the feature information used in the tracking process, and speeds up the tracking speed.
Which models are suitable for long-term tracking?
The DaSiamRPN algorithm improves the generalization ability of the model by enhancing the diversity of training samples, and uses a local-to-global strategy to solve the problem of target loss during long-term tracking.

Attribute Evaluation
To fully evaluate the performance of the tracker in a variety of challenging scenarios, this article compares 12 different attributes in terms of accuracy and success on the UAV123 dataset. Tables 5 and 6 presents the evaluation results of these attributes by all target tracking algorithms, and Figure 9 compares the methods that are based on deep learning. According to the experimental results, the trackers based on Siamese networks can effectively handle various challenging scenes; for the scenes in the categories of Aspect Ratio Change (ARC), Camera Motion (CM), Illumination Variation (IV) and Viewpoint Change (VC), the results are especially outstanding. Hence, the Siamese network structure performs satisfactorily in solving tracking problems such as target scale change, target rapid motion and target background similarity interference. In addition, compared with the "attention mechanism" approach, the "deep feature tracker" approach performs better in the categories of Background Clutter (BC), Full Occlusion (FOC), Low Resolution (LR) and Partial Occlusion (POC), thus, rich depth features can well overcome the problems of target occlusion and deformation. Table 5. The precision results of various trackers under the UAV123 dataset attribute. The best-performing tracker is displayed in red, and the second-best performer is in yellow.  Table 6. The successful results of various trackers under the UAV123 dataset attribute. The best-performing tracker is displayed in red, and the second-best performer is in yellow. The visualization results of each tracker on the aerial photography dataset UAV123 are shown in Figure 10. Among them, the first line is the tracking result of the video sequence bike, the second line is the tracking result of the video sequence building, the third line is the tracking result of the video sequence group, and the fourth line is the tracking result of the video sequence boat. We can see in Figure 10 that under the condition of a simple background, as in bike, the trackers show good tracking effects. However, when the background contains objects similar to the target, as in building and group, the background is seriously affected by interference, and some trackers encounter difficulty distinguishing the target from similar objects. We can also see that when the target is severely occluded or temporarily disappeared, as in group, the trackers fail to track. When the target size is small, as in boat, due to the small proportion of the target in the image, it is difficult to obtain features, and the tracking accuracy of some trackers is poor.

Tracker
The speed comparison among all the trackers is shown in Figure 11 where the success rate vs. fps is plotted for the UAV123 dataset.Compared with other algorithms, SN-based trackers have higher frame rate, This is because the network parameters are not updated during online tracking. Among the CNN-based trackers, RTMDNet has the highest frame rate and outperforms the other CNN-based trackers. This is because RTMDNet adds an adaptive ROI layer between the convolution layer and the full connection layer. This method can greatly reduce the computational complexity of the tracking process and enable it to achieve higher frame rate in the tracking process. Among the CF-based trackers, ECO has the highest frame rate and outperforms. The factorized convolution operation makes the tracker more efficient, enabling it to achieve higher frame rate and better performance. ECO-HC using only hand-crafted features (HOG and Color Names), thus further reducing the computation of the model, thereby allowing it to achieve a higher fps than ECO. It is also seen that the KCF has a high fps, but it has the lowest success rate due to the tracker only extracts HOG features.

Evaluation in UAV20L
UAV20L is a representative aerial long-video dataset. This paper compares the performances of 10 representative long-video trackers. According to the evaluation report in Figure 12, Siamese network trackers still perform at a high level and far surpass other trackers that are based on depth characteristics. In addition, we analyzed the evaluation results of 12 independent attributes that were provided by UAV20L: Among them, the MCCT algorithm uses an adaptive strategy to remove contaminated samples. It is effective in working with background interference and realizes a success rate nearly 20% higher than that of the Siamese network. The TADT algorithm uses a callback function to ensure that the deep convolutional network retains the positioning features of the target after convolution learning to cope with complete occlusion and low resolution. The Full Occlusion(FOC) success rate is 0.307, and the Low Resolution(LR) success rate is 0.432, which exceeds that of Siam R-CNN by 6%.

Comparison and Summary
For a single target, the available tracking algorithms are relatively mature when the motion trajectory and background are relatively simple, and better results can be obtained by using filters, deep learning and other methods. For the problem of multicamera collaborative tracking, methods of combining geographic information have been proposed, but they still cannot solve the problem of multi-man-machine collaborative tracking of multiple targets in complex scenarios. Table 9 summarizes and compares 35 aerial photography target tracking algorithms with better performance.  [32] Blurred objectives Blurred target Single target Vector field characteristics [50] Fast/multitarget Fast-moving speed/wide field of vision Many objectives Feedback ESTMD [40] Moving small target Complicated background Single target ARCF [53] Moving target Severe occlusion/background interference Single target DSST [41] Moving target Common scenario Single target KCF [47] Moving target Common scenario Single target SRDCF [54] Moving target Large range of motion/complex scenes Single target STRCF [55] Moving target Common scenario Single target AutoTrack [56] Moving target  [66] Multiscale target General background Single target C-COT [68] Common objectives General background Single target ECO [67] Common objectives General background Single target ECO+ [69] Common objectives Background complex/multiscale Single target MCCT [70] Common objectives Target occlusion/complex background Single target TADT [72] Target deformation Background interference/common scenario Single target DeepSTRCF [55] Similar objectives Common scenario Single target Siamese network SiamFC [76] Target deformation General background Single target PTAV [80] Common objectives Common scenario Single target SiamRPN [81] Weak small targets Common scenario Single target Da SiamRPN [82] Moving target Long track Single target SiamRPN++ [83] Moving target Various scenarios Single target Siam R-CNN [85] Multiscale target Severe occlusion/common scenario Single target SiamBAN [86] Common objectives Various scenarios Single target UDT [87] Multiscale target Severe occlusion Single target Attention mechanism RASNet [89] Common objectives General background Single target SCSAtt [90] Common objectives Target scales vary substantially Single target FICFNet [91] Moving target Severe deformation/occlusion of the target Single target For aerial photography target tracking with various ranges, environments and targets, both the tracking speed and the recognition accuracy must be considered. Therefore, the methods discussed in this paper can be divided into two categories: those that realize increased accuracy and those that realize increased tracking speed. Target position information can be used to establish a motion model that has a fast tracking speed, but the accuracy of tracking is poor; when tracking is implemented by model matching, the tracking accuracy is high, but the processing speed is slower. Due to the successful application of the correlation filtering algorithm in the single target tracking field, the algorithm transforms the data processing from the real domain into the frequency domain, and the processing speed is substantially increased. Therefore, for a single target with a relatively simple motion trajectory and background, the available target tracking algorithms and technologies are relatively mature, and the method combining filtering and deep learning can yield superior results.
Compared with the traditional method of correlation filtering, target tracking based on deep learning realizes substantial improvements in terms of accuracy and detection speed, especially the network structure based on Siam. However, due to the strong dependence of deep learning on data and the insufficient amount of data in target tracking, the current framework cannot yield satisfactory results, and the explanatory performances of related methods of deep learning is insufficient. To summarize the available target tracking algorithms, we still must overcome the following challenges.

1.
Changes in the target attitude. Multiple postures of the same moving target reduce the accuracy of target recognition, which is a common interference problem in target tracking. When the target attitude changes, its characteristics differ from those at the original attitude, and the target is easily lost, thereby resulting in tracking failure. An attention mechanism can help networks focus on important information regarding targets and reduce the probability of target loss during tracking. The utilization by deep learning network algorithms of an attention mechanism to ensure the accurate positioning of network targets is a promising research direction.

2.
Long-term tracking. In a long-time tracking process, due to the height and speed limit of aerial photography, the tracking target scale in the images in the video change with increasing tracking time. Since the tracking box cannot utilize adaptive tracking, it contains redundant background feature information, thereby leading to parameter update error of the target model. In contrast, the accelerated flight causes the target scale to increase continuously. Since the tracking box cannot contain all characteristic information of the target, parameter update error also occurs. According to the experimental results of this paper, the Siamese network realizes satisfactory performance in long-term tracking but cannot conduct online real-time tracking. The construction of a suitable long-term target tracking model according to the characteristics of long-term tracking tasks and their connection points with short-term tracking that combines the depth characteristics and migration learning remains a substantial challenge.

3.
Target tracking in a complex background environment. Against a complex background such as night, substantial changes in illumination intensity or too much occlusion, the target exhibits reflection, occlusion or transient disappearance during movement. If the moving target is similar to the background, tracking failure will occurs because the corresponding model of the target cannot be found. The main strategies for solving the occlusion problem are as follows: The depth characteristics of the target can be fully extracted to ensure that the network can handle the occlusion problem. During the offline training, occluded targets can be added into the training samples so that the network can fully learn coping strategies when a target is blocked and the trained offline network can be used to track the target. Multi-UAV collaborative tracking can utilize target information from multiple angles and effectively solve the problem of target tracking against a complex background. 4.
Real-time tracking. Real-time tracking is always a difficult problem in the field of target tracking. The current tracking method based on deep learning has the advantage of learning from a large amount of data. However, in the target tracking process, only the annotation data of the first frame are completely accurate, and it is difficult to extract sufficient training data from the network. The network model of deep learning is complex and has many training parameters. If the network is adjusted online in the tracking stage to ensure the tracking performance, the network tracking speed is severely affected. Large-scale datasets obtained via aerial photography are gradually becoming available, which include rich target classes and involve various situations that are encountered in practical applications. Many tracking algorithms have continued to learn depth characteristics from these datasets via an end-to-end approach, which is expected to further enable target tracking algorithms to realize real-time tracking while ensuring satisfactory tracking speed.

Cooperative Tracking and Path Planning of Multiple Drones
As the sensing field of a single UAV is limited and the 3D feature information of the target and scene is lost, it is necessary to cooperatively utilize multiple UAVs. However, in multiple-UAV cooperative tracking, since the information surveillance camera is discrete, there is a lack of information for the rapid integration mechanism among multiple cameras, and multicamera coordination is necessary for efficient target tracking [92,93]. Thus, the problem of cooperative path planning [94] is also encountered. Although satisfactory planning and design results have been obtained, multiple challenges are faced, such as challenges regarding locally optimal solutions [95] and the iteration time [96].

Long-Term Tracking and Abnormal Discovery
With the frequent occurrence of abnormal events in public areas, technology for the detection of abnormal crowd behavior based on aerial video has become a research hotspot at home and abroad in recent years [97]. Long-time tracking and monitoring are required, which pose new challenges in aerial photography tracking. In terms of degree, abnormal events can be divided into two groups: abnormal group events and abnormal individual events [98]. These events must have occurred during the process of tracking the abnormal behavior detection alarm. The use of target behavior prediction and security situational awareness to realize real-time anomaly warning is the key problem to be solved in the future.

Visualization and Intelligent Analysis of Aerial Photography Data
UAVs rely on a variety of wireless network technologies to realize real-time video surveillance and air transfer of related images or videos to a mobile command platform or background system for intelligent identification and analysis and to provide a decision-making basis for manpower deployment, emergency response and technical support. However, due to the lack of corresponding technical support and solutions, information sharing among aerial video equipment to establish and improve the aerial video application integration platform is not convenient, which constrains the role of the intelligent monitoring system in public security. Based on intelligent analysis, with the deployment of the 5G network, the realization of real-time tracking and security situational awareness prediction via a visual approach is essential for the future application of the visualization platform.