GPK-YOLOv5s: Content-Aware Reassembly of Features and Self Attention for High Altitude Parabolic Detection and Tracking

. High altitude parabolic is difficult to identify because of its small size, fast speed and changeable state, which makes it difficult for subsequent forensics and accountability. This paper proposes a high-altitude parabolic detection and tracking method GPK-YOLOv5s, which integrates Content-Aware Reassembly of Features (CARAFE) and self-attention to realize parabolic detection and tracking. For the detection network, the backbone integrates C3Ghost module to extract effective features and simplify the network. C3Transformer module is embedded in the feature extraction and fusion layer to pay attention to the global context information. The feature fusion layer uses CARAFE module for up sampling to perceive effective features, and integrates shallow features and deep features to form a new large-scale detection layer (Output4) to further obtain smaller receptive fields. Improved multi-scale detection heads are embedded with CBAM to enhance the expression ability of targets. To overcome the frame loss of real-time detection, improved multi-scale detection heads are externally connected with Kalman filter to track targets. This experiment verifies that the detection Precision, Recall and F1 value of GPK-YOLOv5s reached 99.0%, 98.6% and 98.8% respectively, which are 2.8%, 4.1% and 3.5% higher than YOLOv5s respectively. And GPK-YOLOv5s is lighter, and the calculation consumption is reduced by 0.4 GFLOPs.


Introduction
High altitude parabolic detection based on computer vision is one of the applications of intelligent video surveillance, which has very important realistic significance.High altitude parabolic is difficult to identify because of its small size, fast speed and variable state.However, there are few literatures on the detection and tracking of high altitude parabolic.Xu W et al. [1] proposed a multi-target tracking algorithm for high altitude parabolic, the improved sorting algorithm and adaptive Gaussian mixture background model are used for detection, and the joint intersection (IOU) combined with correlation filter is used for tracking.Liang X et al. [2] proposed a tracking method based on AprilTa recognition, which can track the color features around the label and improve the tracking accuracy when the target is blocked.Feng W K et al. [3] proposed a moving target detection method based on improved fuzzy C-means clustering (FCM) algorithm.The traditional FCM algorithm is combined with genetic algorithm and Kalman filter algorithm to track and detect moving targets.Murate T et al. [4] proposed a fast target tracking method of learning moving convolutional neural network (CNN) feature extraction.The introduction of self attention mechanism can enhance the expression of target information.Lu X et al. [5] proposed an attention map neural network (AGNN) to complete pixel level target segmentation.The task is described as an iterative information fusion process on the data map, and information is captured from relational visual data through parametric message passing.However, detection speed of above methods is not fast enough, and the recognition accuracy is not high enough, especially a lack of research for the recognition and tracking of highaltitude parabolic.One-stage target detection algorithms have high detection accuracy, fast speed, and meet requirements of real-time.In particular, YOLOv5 developed by ultralytics is used in wide fields because of its lightweight, high detection accuracy and fast detection speed.Therefore, this paper adopts one stage target detection algorithms to realize high-altitude parabolic detection and tracking.The main contributionsare as follows.
(1) A high altitude parabolic detection and tracking method GPK-YOLOv5s is proposed, which integrates CARAFE and self attention.Improved multi-scale detection CBAMheads are externally connected with Kalman filter to detect and track high-altitude parabolic.
(2) C3Ghost is integrated in the backbone to obtain effective features and make the network lightweight.C3Transformer is embedded between feature extraction and fusion layers to capture global context information.(3) For feature fusion layers, CARAFE is used for upsampling to fuse effective features and reduce redundancy.For multi-scale detection layers, further fuse shallow features and deep features to form a new largescale feature map to obtain smaller receptive fields.Improved multiscale detection CBAMheads could enhance expression ability of targets.(4) The effectiveness and applicability of the proposed GPK-YOLOv5s are verified by using the self-made highaltitude parabolic dataset OTFH.

Methodology
The overall scheme of high-altitude parabolic detection and tracking in this paper is shown in Fig. 1.For data acquisition, we take high altitude parabolic images by CMOS infrared surveillance camera.For image preprocessing, we use Mosaic data enhancement, spatial chrominance transformation, random clipping and left/right flipping to enhance the expression of effective information.The proposed GPK-YOLOv5s realizes the recognition and tracking of high-altitude parabolic.

Data acquisition
Considering the method of testing and cost, two surveillance cameras with zoom range of 4.7mm-94mm, 200W pixels and 100 m infrared imaging are used in residential or office areas.The installation and layout of high-altitude parabolic surveillance cameras need to consider the installation location, the distance between buildings, household privacy and safety.In this paper, two cameras are installed on the monitoring support 20m away from the building to monitor the high and low floors.

Data acquisition
Aiming at the problem of difficult identification and few research of high-altitude parabolic, this paper designs the GPK-YOLOv5s network for high-altitude parabolic detection and tracking.As shown in Fig. 2, the overall network is composed of detection network and Kalman tracking module.Fig. 2 The GPK-YOLOv5s network for high altitude parabolic detection and tracking The detection network is composed of backbone, feature fusion layers (Neck) and multi-scale prediction heads.Firstly, the Focus module is used to slice the image, in order to reduce the amount of calculation and improve the speed.Secondly, GhostConv and C3Ghost modules are used to extract features, SPP module adopts convolution kernels with uniform step size but different sizes to realize feature fusion, and embeds C3Transformer module to enhance the context information of backbone.Feature fusion layers adopt CARAFE module [6] for upsampling to fuse deep features and shallow features.Multi-scale detection heads fuse CBAM [7], and feed back the detected target position to the Kalman filter to realize the tracking of high-altitude parabolic.

C3Ghost
Due to the feature extraction network plays an important role in detection results.This paper combines the CSPBottleneck of the original backbone network and three convolution layers to form C3. C3 fuses GhostBottleneck to form C3Ghost module, which is used to extract effective features and simplify the network.As shown in Fig. 3 and Fig. 4, the GhostBottleneck is composed of Ghost modules [8].Given input data  ����� , where c is the number of input channels, ℎ and  are the height and width of the input image, respectively.An arbitrary convolutional layer used to generate  feature maps.
�  �  �  (1) In formula (1),  and  � � �� � �� denote the input and output of images respectively. ������� represents the filter,  denotes the offset.The convolution filter in ℎ � and  � are the height and width of the output data.The backbone of this paper integrates Ghost module, which reduces the redundant calculation of feature map.These intrinsic feature maps are usually small and produced by ordinary convolutional filters.Specifically,  intrinsic feature maps  �  � � �� � �� are generated, as formula (2).
� �  �  � (2) In formula (2),  �  ������� is the utilized filter,  �  and the bias term is omitted for simplicity.In order to get  feature maps, a series of cheap linear operations are used on each intrinsic feature in  � , to generate  ghost features as formula (3).
In formula (3),  � � is the i-th intrinsic feature map in  � , Φ �,� is the j-th (except the last one) linear operation for generating the j-th ghost feature map  �,� . �  �  feature maps  � � �� ,  �� , … ,  �� � as the output of a Ghost module as shown in Fig. 3.The step size of the GhostBottleneck module used in this paper is 2, which is formed by stacking two Ghost modules, as shown in Fig. 4. The first Ghost module acts as an extension layer, increasing the number of channels.
The second Ghost module reduces the number of channels to match the shortcut path.The shortcut is connected between the input and output of the two Ghost modules.BN and ReLU nonlinearity are applied after each layer.

C3Transformer
Considering the diversity of igh altitude parabolic, this paper embeds C3Transformer in the backbone, which can capture global information and context information.The self attention of its module models the dependencies of input and output sequences, which can establish dynamic and longer distance dependencies.Multi heads can learn different tasks respectively, which makes the model more capable of parallel computing.C3Transformer is composed of CSPBottleneck, three convolution layers and Transformer module [9].The Transformer module is shown in Fig. 6.As depicted in Fig. 5.The Transformer receives as input a 1D sequence of token embeddings, and reshape the image    ����� into a sequence of flattened 2D patches  �   ���� � ��� .� � � is the resolution of the original image, C is the number of channels,  � the resolution of each image path, � � �� � ��/ � is the resulting number of patches, which also serves as the effective input sequence length for the Transformer.The Transformer uses constant latent vector size D through all of its layers, and flatten the patches and map to D dimensions with a trainable linear projection as formula (4), the output of this projection is patch embeddings.A learnable embedding to the sequence of embedded patches ( � � �  ����� ), whose state at the output of the Transformer encoder (  � � ) serves as the image representation  as formula (7).Both during pre-training and fine-tuning, a classification head is attached to  � � .The classification head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at fine-tuning time.
) Position embeddings are added to the patch embeddings to retain positional information.The resulting sequence of embedding vectors serves as input to the encoder.The Transformer encoder consists of alternating layers of multiheaded self-attention and MLP blocks in formula ( 5)- (6).Layer norm (LN) is applied before every block, and residual connections after every block.The MLP contains two layers with a GELU non-linearity.

CARAFE
Considering that the traditional upsampling module is easy to lose important feature information and computational redundancy, this paper uses CARAFE module for upsampling to avoid the loss of feature information.The Channel Compressor reduces the channels of the input feature map, and the content encoder takes the compressed feature map as the input, encodes the content to generate recombination kernels, and the kernel normalizer normalizes each recombination kernel.CARAFE module predicts recombination kernels according to the underlying content information, uses adaptive and optimized recombination kernels in different positions, and recombines features in predefined nearby areas.The reorganization of the content aware kernel of this module includes two steps.The first step is to predict the reorganization core of each target location according to the content of the target location, and the second step is to use the expected features for reorganization.Given a feature map of size  � � � � and an upsample ratio  (supposing it is an integer), CARAFE will produce a new feature map  � of size  � � � � .For any target location  � � �� � ,  � � of the output  � , there is a corresponding source location  � ��, � at the input  , where � � � � /,  �  � /.Here we denote �� � , � as the  �  sub-region of  centered at the location ,i.e., the neighbor of  � .In the first step, the kernel prediction module  predicts a location-wise kernel  � � for each location  � , based on the neighbor of  � , as shown in formula (8).The reassembly step as formula (9), where  is the content-aware reassembly module that reassembles the neighbor of  � with the kernel  � � .

CBAMheads
Considering that the high-altitude parabolic is generally far from the camera, the multi-scale detection layers need to detect small targets.In this paper, CARAFE is used for upsampling, the shallow features and deep features are further fused to form a new large-scale feature map detection layer (Output4) to obtain smaller receptive fields.Moreover, CBAM is embedded in improved multiscale detection layers to form multi-scale detection CBAMheads.CBAM module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement.� .These descriptors are sent to the shared network to generate the channel attention map  �   ����� .The shared network consists of an MLP with a hidden layer.The calculation of channel attention is shown in formula (13).
) Where  denotes the sigmoid function, the MLP weights,  � and  � .Spatial attention module uses convolution layer to generate spatial attention map  � ��   ��� , which encodes where to emphasize or suppress, aggregate channel information of a feature map by using two pooling operations, generating two 2D maps:  ��� �   ����� and  ��� �   ����� .These images are connected and convoluted through convolution layers to generate 2D spatial attention maps.The calculation of spatial attention is shown in formula (14).

Kalman tracking
To overcome the frame loss of real-time detection, this paper realizes the tracking of high-altitude parabolic by using the improved multi-scale detection headers and external Kalman filter.The target prediction box value and center coordinate value are obtained by the detection network, and the Kalman filter is used to predict the next frame position according to the previous frame position of the target, so as to realize high-altitude parabolic tracking and trace.Kalman filter includes two parts: target state and measurement.
� �  �,���  ��� � � ��� �  ��� (15)  � �  �  � �  � (16) As formula (15)-( 16),  � 、 ��� are actual state vector of the system at time  � 、 ��� . �,��� is the state change matrix  ��� to  � ,  is system control matrix,  � is state measurement vector at time  � ,  � represents the measurement matrix,  ��� represents input of the system. � denotes process noise of the system,  � denotes measurement noise of the system.The Kalman filtering process includes prediction and correction, and its realization of target tracking is divided into three steps.Firstly, the detection network predicted the target position, and initialize the Kalman filter  ��� � and  ��� , record the current frame time  � .Secondly, state estimation as shown in formula ( 17) -(18), using the current frame time  � and the previous frame time  ��� for Kalman prediction, an get predicted target status  � � and predicted covariance  � � , and search for the best match in the region to obtain the real target state value  � .Lastly, the status update is as follows ( 19) -( 21), using the real state vector  � as the input of Kalman filter to update, and get updated status  � � and corrected covariance  � .
� represents a priori state estimate,  � � and  ��� � is the posterior state estimate,  � � is the optimal estimation after filtering. � � represents the covariance of a priori estimation, and the Kalman filter gain  � is obtained through the minimum error. � represents the covariance of a posteriori estimation, which will be used as the input of the next iteration after updating. � is the measurement noise covariance,  ��� is the process noise covariance. � �  �  �,��� � reflects the error between predicted value  �  �,��� � and actual measurement value  � .

OTFH self-made dataset
Because there is no public dataset for high-altitude parabolic.This paper uses OTFH self-made high-altitude parabolic dataset for experiment.OTFH dataset is made for daily life and office areas, including images of six categories of objects: bottles, books, boxes, balls, iron blocks and pockets.There are 1000 samples in each category, a total of 6000 samples, sample images of this dataset as Fig.

Implementation details
The experimental environment of this paper is based on Windows 10 system.Graphics card: NVIDIA GeForce GTX 1080Ti, learning framework is Pytorch.The programming language is Python 3.7, the programming tool is Pycharm.This experiment uses stochastic gradient descent (SGD) with a mini-batch size of 16 for 200 epochs.Warmup is used to warm up the learning rate during training.Set the learning rate of 0.01 to warm up the training for 10 epochs, so as to improve the stability of deep model training.After Warmup, cosine annealing learning algorithm is used to update the learning rate [10].This experiment uses CIOULoss function to calculate position loss, BCEWithLogitsLoss function to calculate confidence loss, and FocalLoss function to calculate target class loss.

Evaluation metrics
In order to quantitatively evaluate the detection performance of the network, precision (P), recall (R), average precision (AP) and F1 value are usually used as indicators of the measurement model as formula ( 22) -(25).The positive samples of OTFH dataset refer to highaltitude parabolic targets, and the image background and different types of targets are regarded as negative samples.
Where TP is the real value, FP is the false positive value, and FN is the false negative value.When the intersection union ratio (IOU) of the prediction box and the truth box is greater than 0.5, the target is detected.9 and Fig. 10 show detection precision and recall curves of OTFH validation set for each model.Through ablation experiments, the detection precision and recall rate of GPK-YOLOv5s in can be rapidly improved, which is significantly higher than other models.The detection precision and recall rate are as high as 99% and 98.6% respectively.Compared with above curves, the embedding of each module can optimize the detection network.The detection error convergence speed of this method is faster, the model stability speed is faster, and the generalization ability of the model is enhanced.

Test results
Table 1 shows test results of OTFH dataset by each model, mAP/0.5 represents the average accuracy with a threshold of 0.  In Table1-3, v5s represents YOLOv5s.Considering the complexity of the model, through the ablation experiment, GPK-YOLOv5s realizes the high-precision and rapid detection of high-altitude parabolic.Compared with v5s, v5s+C3G consumes less calculation, has higher detection precision, and reduces 6.7 GFLOPs.The detection precision, recall rate, F1 score, map/0.5 of v5s+C3G+C3T were increased by 0.4%, 0.8%, 1.1%, 0.6% respectively compared with v5s.C3G module lightens the backbone and extracts effective feature information.C3T module can capture global context information.The detection precision, recall rate, F1 score, map/0.5 of v5s+C3G+C3T+CRF were increased by 2.0%, 2.9%, 2.5%, 2.0% respectively compared with v5s, and the consumption of calculation was significantly reduced.ablation experiments, GPK-YOLOv5s has made a great breakthrough in these difficult to identify targets.The detection precision of bottles, books, boxes, balls, iron blocks and pockets is 7.9%, 1.2%, 0.4%, 0.2%, 5.7% and 1.2% higher than that of YOLOv5s respectively.From ablation experimental results, each optimization module plays an important role in this method.Table 3 shows the recognition recall rate of various highaltitude parabolas of OTFH dataset by each model.YOLOv5s has a higher recall rate than YOLOv3 and YOLOv3-tiny.In this paper, GPK-YOLOv5s has increased the recognition recall rate of bottles, books, boxes, balls, iron blocks and pockets by 3.5%, 1.9%, 2.0%, 6.5%, 4.7% and 5.8% respectively compared with YOLOv5s.Due to the high-altitude parabolic target captured by the camera is generally small.Comparing with the latest small target detection methods, this paper verifies the effectiveness of GPK-YOLOv5s for small target detection in complex background.Comparison results with other small target detection methods are shown in Table 4.

Other methods
Pixels mAP/0.5 FPS RSRGAN [11] 640*512 98.0% 55.0 YOLOv4-KCF [12] Improved SSD [13] Improved YOLOv3 [14]  Table 4 shows the comparison between GPK-YOLOv5s and other small target detection methods.RSRGAN [11] is faster to detect small targets in lower resolution images, and the mAP reaches 98.0%, which is 0.5% less than that of GPK-YOLOv5s in this paper.The accuracy of improved SSD [13] for small target detection can reach 96.9%, but its detection speed is slow.YOLOv4-KCF [12] has lower value of mAP for small targets.The detection accuracy of proposed method GPK-YOLOv5s for highaltitude parabolic in low resolution images reaches 98.5%, the detection frame rate reaches 41.7, which are better than other small target detection methods mentioned.

Conclusion
For intelligent detection network, the backbone integrates C3Ghost to extract effective features and simplify the network.C3Transformer is embedded between feature extraction and fusion layers to strengthen feature extraction and capture global context information.CARAFE is used for upsampling, which has a good effect on pixel level prediction.The new large-scale feature map detection layer can obtain smaller receptive fields, which is convenient for detecting small targets.Multiscale detection heads fused with CBAM can enhance the expression ability of targets.This paper verifies the effectiveness of the proposed method through ablation experiments.The detection results show that the detection precision, recall rate and F1 value of GPK-YOLOv5s for high-altitude parabolic reached 99.0%, 98.6% and 98.8% respectively.The calculation consumption is reduced by 0.4 GFLOPs, and the detection frame rate is 41.7, which is significantly better than other models.To overcome the frame loss during real-time detection, Kalman filter is connected to CBAMheaders of the detection network to realize the tracking of high-altitude parabolic.In future research, the way of training model needs to be optimized, considering the prediction of limited samples, and trying to avoid the dependence of network on big data.

Fig. 6
Fig.6 CARAFE module (A feature map with the size  � � � � is upsampled by a factor of  � �) As shown in Fig.6, CARAFE is composed of a Kernel Prediction Module and a Content-aware Reassembly module.The Kernel Prediction Module generates a recombination kernel, which is composed of a Channel Compressor, a Content Encoder, and a Kernel Normalizer.The Channel Compressor reduces the channels of the input feature map, and the content encoder takes the compressed feature map as the input, encodes the content to generate recombination kernels, and the kernel normalizer normalizes each recombination kernel.CARAFE module predicts recombination kernels according to the underlying content information, uses adaptive and optimized recombination kernels in different positions, and recombines features in predefined nearby areas.The reorganization of the content aware kernel of this module includes two steps.The first step is to predict the reorganization core of each target location according to the content of the target location, and the second step is to use the expected features for reorganization.Given a feature map of size  � � � � and an upsample ratio  (supposing it is an integer), CARAFE will produce a new feature map  � of size  � � � � .For any target location  � � �� � ,  � � of the output  � , there is a corresponding source location  � ��, � at the input  , where � � � � /,  �  � /.Here we denote �� � , � as the  �  sub-region of  centered at the location ,i.e., the neighbor of  � .In the first step, the kernel prediction module  predicts a location-wise kernel  � � for each location  � , based on the neighbor of  � , as shown in formula(8).The reassembly step as formula(9), where  is the content-aware reassembly module that reassembles the neighbor of  � with the kernel  � � .� � � ��� � ,  ������� �� (8)  � � � � ��� � ,  �� �,  � � � (9) For the kernel prediction module, each source location on  corresponds to  � target locations on  � .Each target locations requires a  �� �  �� reassembly kernel,  �� is the reassembly kernel size.This module outputs the reassembly kernels of size  �� � � � � , where  �� �  �  �� � .For each reassembly kernel  � � , the Content-

8 . 8
The experimental dataset is divided into training set: verification set: test set is 6:2Sample images of the OTFH self-made dataset

Fig. 9
Fig.9 Comparison of Precision of the validation set For convenient expression of the experiment, C3Ghost module, C3Transformer module, CARAFE module and multi-scale detection headers with CBAM are recorded as C3G, C3T, CRF and CBAMHeaders respectively.Therefore, the detection network of GPK-YOLOv5s includes C3G 、 C3T 、 CRF and CBAMHeaders.To verify the effectiveness of each module in GPK-YOLOv5s, ablation experiments of the model were carried out on OTFH dataset.

Fig. 10
Fig.10 Comparison of Precision of the validation set Fig.9and Fig.10show detection precision and recall curves of OTFH validation set for each model.Through ablation experiments, the detection precision and recall rate of GPK-YOLOv5s in can be rapidly improved, which is significantly higher than other models.The detection precision and recall rate are as high as 99% and 98.6% respectively.Compared with above curves, the embedding of each module can optimize the detection network.The detection error convergence speed of this method is faster, the model stability speed is faster, and the generalization ability of the model is enhanced.

Fig. 11
Fig.11 Loss comparison for training Fig.11 shows the loss values of 200 epochs trained by each model on OTFH dataset.Through a series of ablation experiments, the loss value of GPK-YOLOv5s was significantly reduced to 0.0210, and its loss value was 0.0117 lower than that of YOLOv5s.

5 .
GFLOPs refers to billions of floating-point operations per second.Considering the real-time performance of the detection, one-stage detection such as SSD, YOLOv3, YOLOv3-tiny and YOLOv5 are used for comparison.YOLOv3-tiny is a lightweight version of YOLOv3.

Table 2
shows the detection precision of various highaltitude parabolas of OTFH dataset by each model.SSD and YOLOv3 have low detection accuracy and speed for high-altitude parabolic.In addition, most of bottles are transparent, the iron block moves fast, and the shape changes greatly, so they are difficult to identify.Through MATEC Web of Conferences 363, 01012 (2022) https://doi.org/10.1051/matecconf/202236301012AMME 2022