Semi-Supervised Person Detection in Aerial Images with Instance Segmentation and Maximum Mean Discrepancy Distance

: Detecting sparse, small, lost persons with only a few pixels in high-resolution aerial images was, is, and remains an important and difﬁcult mission, in which a vital role is played by accurate monitoring and intelligent co-rescuing for the search and rescue (SaR) system. However, many problems have not been effectively solved in existing remote-vision-based SaR systems, such as the shortage of person samples in SaR scenarios and the low tolerance of small objects for bounding boxes. To address these issues, a copy-paste mechanism (ISCP) with semi-supervised object detection (SSOD) via instance segmentation and maximum mean discrepancy distance is proposed (MMD), which can provide highly robust, multi-task, and efﬁcient aerial-based person detection for the prototype SaR system. Speciﬁcally, numerous pseudo-labels are obtained by accurately segmenting the instances of synthetic ISCP samples to obtain their boundaries. The SSOD trainer then uses soft weights to balance the prediction entropy of the loss function between the ground truth and unreliable labels. Moreover, a novel evaluation metric MMD for anchor-based detectors is proposed to elegantly compute the IoU of the bounding boxes. Extensive experiments and ablation studies on Heridal and optimized public datasets demonstrate that our approach is effective and achieves state-of-the-art person detection performance in aerial images.


Introduction
Search and rescue (SaR) of survivors involves a race against time, which is of great significance to the construction of aerial SaR systems. The traditional method puts both pending rescuers and SaR workers in danger because of bad terrain and inefficient rescue measures. Unmanned aerial vehicles (UAVs) can provide immediate situational awareness over large areas. This makes SaR operations much cheaper and safer by reducing the time and number of rescuers needed in emergencies. In recent years, SaR missions based on drones, which allow for quick aerial views of huge regions with potentially difficult-to-reach terrain, have been developed in many countries [1][2][3][4]. While object detection as a crucial SaR mission step has advanced somewhat, it is still far from meeting the technical requirements for ground application, which may warrant further research.
In recent years, aerial-based person detection (APD) has gradually become a hot and challenging research topic in the field of low-altitude remote sensing. Researchers have attempted to create datasets of persons in aerial images and have obtained some promising results by utilizing existing creative object detection algorithms based on natural images. However, the results have been weak and lack robustness [5][6][7][8]. Compared to remote sensing objects [9][10][11][12][13][14][15], such as ships, vehicles, and airplanes, persons in aerial images are frequently costly to identify in SaR scenes, difficult to label manually, have fewer available datasets, and have multi-view shooting features that vary greatly. As a result, detecting persons is more difficult and prevents implementation of applications. The key step in APD is to collect a sufficient number of clear drone-based person objects and labeled instances, also addressing the issue of the person's weak features and sparse distribution in highresolution aerial images. Existing methods present a range of problems which urgently need to be solved, which are discussed below.
To tackle the few-shot problem, some popular approaches aim to get enough pseudolabels through semi-supervised learning, which can effectively improve the performance of the anchor-based object detector. Since there are not many person labels in SaR scenes, the detector needs to be trained on unlabeled data that was generated by synthetic or online techniques. Several studies have used transfer learning [16] and active learning [17] to achieve high-accuracy results with less training data, using consistency-based and pseudolabel-based methods. First, [18] used unlabeled data to learn the model's consistency, while [19] created fake labels to enable model training. The authors of [20] proposed a consistency-based semi-supervised object detection (CSD) method that can work on both single-stage and two-stage detectors by flipping the unlabeled samples horizontally and feeding them into the feature map of the detector network to calculate the consistency loss at the corresponding position. In [21], hard pseudo-labels were used for self-training and incorporated consistent regularization as a data augmentation principle into the training. The authors of [22] devised the co-rectify scheme, which involved training two models to check and correct each other's pseudo-labels, which stopped false predictions from increasing and improved model accuracy by obtaining pseudo-labels online. In [23] the average RPN score of the proposal was used, which was identified through multi-stage learning, as a measure of uncertainty at the image level to make pseudo-labels and solve the problem of label noise overfitting caused by direct fitting. The authors of [24] suggested that there was a natural category imbalance in object detection. The authors designed an "unbiased teacher", which uses the structure of a "mean teacher" to make a fake label to supervise both RPN and ROI heads. However, existing works focus solely on confidence scores for pseudo-labels, which cannot guarantee the localization accuracy of pseudo-boxes; in terms of consistency training, the widely used random resizing training only considers label-level consistency, while ignoring feature-level consistency. In this article, cooperative training strategies are designed for a soft threshold between reliable and pseudo-labels and candidate bounding boxes are sorted by probability predictive entropy to improve detection performance.
To detect small and indistinct persons in high-resolution aerial images, anchor-based detectors with intersection over union (IoU) for label assignment can be a reliable choice. IoU is the ratio of two boxes' intersection to their union by calculating the overlap of the ground truth (GT) and prediction bounding boxes (PBBs), which has been used to assess the performance of object detection and segmentation. In anchor-based object detectors, several candidate bounding boxes are often generated simultaneously by the model, the boxes are sorted by their confidence level, and then the IoU [25] is calculated between the boxes to determine which one is the real object with others being deleted by non-maximal suppression (NMS). Some typical algorithms for optimizing IoU, for example, ref. [25], determine the similarity of two images in terms of their shapes, but they cannot show their distance and aspect ratio.The authors of [26] solved the problem of comparing the distance between two bounding boxes that do not overlap by generalizing to improve the way that the bounding boxes' overlap is calculated. However, this still does not give a good picture of how close two bounding boxes are in terms of their distance and aspect ratio. In [27], how the two boxes overlapped, how far apart they were, and how big they were were considered. Then, the distance between them was made as small as possible, which made the loss function converge more quickly. To better reflect the difference between the two boxes, the CIoU of [27] added an image similarity influence factor to the DIoU of [27]. Ref. [28] broke down the aspect ratios and added anchor-based focal focus quality to improve the regression accuracy, as adding aspect ratios to CIoU can make it difficult for the model to find the best way to maximize similarity. Although the above methods have been shown to be effective in detecting natural images, using a fixed IoU threshold in aerial images could result in the loss of the small-scale object boxes, and the performance of detectors could be severely harmed by a person's slight deviation.
In this article, we propose a simple and efficient semi-supervised learning method for detecting persons in high-resolution aerial images. We introduce an instance segmentation copy-paste mechanism (ISCP) to increase the number of hard examples, and solve the training of GT and pseudo-labels by use of SSOD in unlabeled images. A maximum mean discrepancy distance (MMD) [29] is designed to improve the anchor-based detector's evaluation metrics in SaR scenes. This work represents a significant step forward for person detection in aerial images.
In summary, the article makes the following contributions: • A semi-supervised learning strategy with pseudo-labels is developed for SaR scenarios, which mainly consists of three steps: training high-confidence teacher models from reliable labels, data augmentation with the instance segmentation copy-paste mechanism, and use of a teacher-student model with consistent loss function. • To further boost the performance and effectiveness, the algorithm utilizes an MMD distance to evaluate the detector's metrics, which can be easily embedded in singlestage and two-stage object detectors. • The detection results of both public and optimized datasets are compared. Moreover, the detection results are also compared with other detection algorithms. The experimental results show that our proposed method achieved SOTA. • To explore the robustness of person detection from aerial images for different SaR applications, datasets with multiple scenes are created and evaluated for non-commercial purposes.
The work presented in the article is structured as follows: Section 2 briefly describes anchor-based detectors with semi-supervised learning and maximum mean discrepancy distance for person detection, and presents a precise description of our approach. Section 3 describes a detailed investigation of detection results and quantitative performance evaluation with and without our optimized datasets and improved algorithms. Moreover, we discuss how different training strategies and hyperparameter settings affect detector performance. Section 4 summarizes the work and discusses future directions.

General Framework of Our Proposed Approach
In this work, we present a person detection method for aerial images. The overall framework, along with its technical components, is illustrated in Figure 1. The system is mainly categorized into four different modules: an object implantation module (OIM), a semi-supervised training module (SSTM), a maximum mean discrepancy distance evaluation module (MMD) and a detector. The OIM's primary function is to generate masks from hard or not-detected samples and to implant them around the original objects, while retaining information about other objects. The SSTM takes real labels and high-confidence pseudo-labels to train and update the model by setting different weight ratios between the source teacher and the adapted student. The MMD evaluates the model's performance parameters by approximating the object's bounding boxes (BBs) as a new two-dimensional distribution function and comparing the distribution differences between the candidate P-BBs and the ground truth (GT). Any two-stage or single-stage anchor-based detector is acceptable as the detector. We adopt a practical and efficient detection paradigm available today for person detection, i.e., [30], that was already trained for the multiple models from the MS-COCO and VOC datasets and added many tricks from the latest detection algorithms. An overview of the framework is provided in the following subsection. OIM is used to generate rich and diverse samples with ISCP; SSTM improves the generalization ability of the model by iteratively training the authenticity labels; MMD is an optimized bounding box loss evaluation method that replaces IoU analysis to detect performance metrics; All of the above modules are embedded into the detector. Note that there are slight modifications in the tactics employed in the training and detection phases.

Anchor-Based Detectors with Maximum Mean Discrepancy Distance
The maximum mean discrepancy distance (MMD) [29] is used to see if two samples are from the same distribution. In transfer learning, MMD is a common metric used to measure how similar the source and target domains are in a regenerated Hilbert space. The training and test sets are taken from different but related distributions. Given the low tolerance of existing IoU evaluation methods for tiny object bounding boxes [31], for example, when computing IoU in the green box of Figure 2, slight deviations reduce the value of IoU; however, different size of objects results in varying degree of deterioration. (Reference our experimental result-small: IoU AB = 0.486, IoU AC = 0.091 (0.395↓); middle: IoU AB = 0.623, IoU AC = 0.375 (0.248↓); large: IoU AB = 0.755, IoU AC = 0.613 (0.142↓)). A better metric is designed for tiny objects based on the maximum mean discrepancy distance since it can consistently reflect the distance between distributions even if they have no overlap. Therefore, the new metric has better properties than IoU for measuring the similarity between tiny objects. The details are as follows: Pixels of tiny objects and backgrounds in the rectangular bounding box are skewed and do not accurately represent the genuine object boundaries. Specifically, more pixels of the object are found towards the center, while more pixels of the background are found near the bounding box. To better describe the weights of the different pixels in the bounding boxes, the bounding box can be modeled into two dimensions with a K-rank Gaussian distribution, where the center pixel of the bounding box has the highest weight, and the importance of the pixel decreases from the center to the boundary. In this article, we follow the paradigm of taking the center point of the bounding box as the Gaussian distribution mean vector.
The horizontal bounding box is defined as R = (bb_cx, bb_cy, bb_w, bb_h), where (bb_cx, bb_cy), bb_w and bb_h denote the center coordinates, width and height, respectively. The inner ellipse of the horizontal rectangular box is shown in the Figure 2 distance evaluation module. The closer the inner ellipse is to the rectangular boundary, the higher the rank, so the equation of its inscribed ellipse can be represented as As shown in Equation (1), which translates the samples from the original space to the high-dimensional space using a Gaussian distribution function, applying a Gaussian kernel function yields linearly differentiable samples in the high-dimensional space. Furthermore, the similarity between the bounding boxes A and B can be converted to the distribution distance between two Gaussian distributions.
where σ x , σ y are the lengths of the semi-axes along the x-and y-axes. The largest of σ x and σ y are chosen from Equations (2) and (3) as the bandwidth of the Gaussian kernel. Multiple sample points within the rectangular box distribution function can be obtained using Equation (1) to generate the two bounding box sample space of Equations (4) and (5), and multiple rectangular boxes are similarly processed.
The moments of a random variable can be used to characterize high-dimensional random events but do not have direct distribution variables. Because the meanness and variance are not always perfectly reflective of a distribution, higher-order moments are required to be characterized. When the two distributions are not identical, the moment with the greatest difference between them should be utilized as the measure of the two distributions. MMD [29] is a typical loss function used in migration learning and is frequently used to estimate the distance between two distributions, which can be simplified as Equation (6): where BB 1 and BB 2 are two bounding boxes of the sample space, f (N) and f (M) are probability density functions of BB 1 and BB 2 , and φ(N) and φ(M) denote distribution functions of BB 1 and BB 2 . So, Equation () provides the biggest difference (H) in the mean distances between two distributions (φ(N) and φ(M)) in Hilbert space (F ).
where n samples are assumed in the source domain (BB 1 ) and m samples are assumed in the target domain (BB 2 ). Until now, the key to MMD has been to select an appropriate φ(x) as the mapping function. The Gaussian kernel method is useful since it does not require explicit representation of the mapping function to obtain the inner product of two vectors. As shown in Equation (7), the MMD is squared, simplified to obtain the inner product, and written down as a kernel function.
where the inner product of x i , y j in the feature space equals the result of their calculation by the kernel function k in the original sample space. It is frequently simplified to a matrix form, i.e., equation, to aid calculation of Equations (8) and (9), where K is the kernel matrix and L is the MMD matrix. The K matrix can be fed into the Gaussian kernel function to calculate the distance between two bounding boxes, and so on. After calculating the distance between multiple bounding boxes, the least in order of distance is chosen as the positive sample matching result, while the others are considered negative samples. As shown in Figure 2, The UAV-recorded person samples are used to train the detection model. The model learns the new features of the dataset and performs detection using the testing dataset. The additional trained improved layer is interfaced with the previously trained model with transfer learning. The results of the trained and pretrained models are assessed and processed for SaR purposes.

Synthetic Data Generation by Object Implantation
Samples from SaR are typically associated with non-urban and hidden terrains, and the available persons are inevitably reduced. The OIM mainly generates a sufficient number of pseudo-labels in unlabeled images, while the ISCP mechanism of the OIM module expands numerous object instances, which will not affect the accuracy metrics of the detectors during the training. As shown in Figure 3, this incorporates the three steps listed

•
The detector infers unlabeled images using the pretrained model and computes IoU matching between the GT and P-BBs to obtain undetected object instances. • The generator obtains (as shown in Figure 3) masks of undetected object instances. • Synthetic samples are created by randomly combining unreliable instances and backgrounds according to the objects' mask library (OML, which comes from the results of instance segmentation).
To optimize the model parameters, the original samples are fed back into the trainer along with the pseudo-labels. However, because of their unreliability, the majority of the pixels may not be used. We believe that all the prediction probabilities of objects are treated in a unified loss function with different weights on both unlabeled and labeled data, using a dual input layer to optimize the prediction entropy of the loss function between the GT and unreliable labels. The mask generator shown in Figure 4 is based on [30] with the added semantic segmentation head of [32], and creates a segmented dataset on the detected cropped maps. As shown in Figure 4, ARM and FFM modules are added to the neck of [30], which perform small-step spatial pathways that maintain spatial position data to build high-resolution feature maps, and design a semantic path with a fast down-sampling rate to obtain objective perceptual fields. The results are divided into two parts: the detection result of yolov5 with dimension 25,200 × 177, where the first 85 columns represent the result of each detection box and the last 32 columns represent the mask coefficients of each detection box, and the segmentation result of BiSeNet, which contains: a prototype mask with dimension 32 × 81 × 81, where post-processing of instance segmentation equals the weighted sum of the mask coefficients within the bounding box and the prototype mask to acquire the best performance of instance segmentation. It is worth noting that the other modules are the same as [30]. For additional details on the implementation of the object implantation with instance segmentation copy-paste mechanism, please refer to our Algorithm 1.

Pseudo-Label-Assisted Trainers with Semi-Supervised Learning
Typically, labeling takes more time and is much more valuable than data, which is usually fairly easy to obtain. Some parts of the data are labeled to create supervised data, and can then effectively use additional unlabeled data. Semi-supervised learning has the following advantages: • It reduces the reliance of machine learning models on labeled data significantly, especially when the project is in its early stages. • Even if the data is unlabeled, the distribution of the unlabeled data can provide a wealth of information to guide model iteration.

•
In most cases, unmarked data is easily accessible, and the amount of data is large. Quantity is more important than quality. When used correctly, it can be extremely beneficial.
The most common and effective way to promote the performance of detection is self-training, which involves jointly training the model using a small amount of labeled data and a large amount of unlabeled data. First, a model is trained using labeled data with no constraints on the training method. The trained model is then used to predict the unlabeled data and obtain the label, which is referred to as a pseudo-label. Next, from the pseudo-label, some samples are chosen and placed in the labeled data. The selection method is also adaptable; for example, it can be used with greater confidence in labeled data. Finally, the model is retrained, and iterations resume. Labeling the synthetic samples mentioned in Figure 3 one-by-one would be a time-consuming and labor-intensive task. We present a simple and effective semi-supervised training method that uses a high confidence detection model to infer unlabeled images so that pseudo-labels are obtained, adding original training data to strengthen the trainer's robustness. As shown in Figure 5, the model is trained with three datasets (original data, pseudo-labeled data, original and pseudo-labeled data) and the trainer outputs are fused at the decision level. Concretely, loss of the trainer is computed with labeled data (Loss LabeledData ), and also loss of the trainer is computed with unlabeled data (Loss UnlabeledData ), with soft weights of the total loss assigned between different datasets. Meanwhile, the detection results are optimized with ensemble learning. For additional detail on the implementation of expanding samples with semi-supervised learning and model ensemble learning, please refer to our Algorithm 2.

Algorithm 2:
Training process of semi-supervised learning and model ensemble learning based on the YOLOv5 detector Input: The set of original labeled images for current batch, L bi ; The set of unreliable labeled images for current batch, U bi ; samples ratio of labeled and unlabeled datasets; ζ: weight threshold; λ 1 /λ 2 . Co-training ensemble of detectors on former batches, CE n−1 ; Output: Co-training ensemble of detectors on the current batch, CE n ; 1: Extracting the set of reliable ground truth and/or pseudo-labels P bi from U bi with help of L bi and detectors; 2: Training ensemble of detectors CE on L i ∪ P bi , with help of data in former batches, where samples ratio is set to ζ; 3: CE n = CE n−1 cupCE; 4: Training ensemble of detectors with different weight threshold λ 1 or λ 2 with help of data L i ∪ P bi ; 5: CE n = CE n−1 cupCE / λ 1 or λ 2 ; 6: Detecting samples in U bi − P bi by CE n ; 7: Deleting some weak detecting in CE n so as to keep the capacity of CE n ; 8: return CE n ;

Experimental Results and Analysis
This section focuses on benchmarking some representative object detectors and effectiveness verification of our proposed method. Firstly, the experimental settings, including the datasets, parameter settings and evaluation metrics, are presented. Then, a large-scale benchmark based on a comprehensive series of detectors is provided. Finally, the effectiveness of our proposed method is validated by extensive comparative experiments and ablation studies.

Datasets
Pixels smaller than 32 × 32 are generally viewed as tiny objects, which have been defined in the MS-COCO dataset. It has been found through research and experience that persons essentially meet this condition in the drone scenario. Descriptions of some typical publicly available datasets containing persons are as follows: • VisDrone dataset [33], which contains details of urban areas and neighborhoods. It includes 288 video clips, 261,908 frames, and 10,209 still images, with labels covering three domains: object detection, object tracking, and congestion counting. • TinyPerson dataset [16], which consists of 1610 labeled images with 72,651 person instances and 759 unlabeled images, referring to dense and complex environments at sea or the seaside in faraway and large-scale scenes. • Heridal dataset [5] based on collection of images by unmanned helicopters, including 1650 high-resolution images (4000 × 3000) containing persons from non-urban areas, such as mountains, forests, oceans, and deserts. • AFO dataset [4], which contains 3647 images with close to 40,000 labeled floating objects (human, wind/sup-board, boat, buoy, sailboat, and kayak).
In these datasets [4,5,16,33], we pay particular attention to samples of persons, removing the influence of other categories. Table 1 presents a comparison of the original and optimized datasets. The resolution of all samples in the optimized dataset is 1536 × 1536. Figure 6a-d refer to samples, rectangular labels, center points, and the aspect ratio for persons for different datasets, respectively. These datasets have smaller label sizes than the MS-COCO dataset, but the person instance in the Heridal dataset has the smallest label size. When the aspect ratio and centroid of the labels in those datasets are recorded, the K-means cluster first calculates the best preset size of anchors in the different datasets. Samples from the four datasets are combined [4,5,16,33] to create a new dataset (VHTA), with the goal of validating our proposed methods in rich scenarios with numerous viewpoints on SaR sceneries, and establishing a theoretical foundation to prepare for acquisition of our own dataset in the future. There is no validation set for the Heridal and TinyPerson datasets, which is critical for the training of machine learning models, so, samples are re-parted on the optimized dataset.

Parameter Settings
The experiments were carried out on a Red Hat 4.8.5-44 operating system, an Intel(R) Xeon(R) Gold 6134 CPU @ 3.20 GHz, with an NVIDIA GeForce Tesla V100 × 2 GPU (16 GB of each memory on a single card), CUDA and cuDNN versions v10.2 and v8.0.1, Pytorch version 1.8.0, JetBrains PyCharm Community Edition 2021.3.2 × 64, and Python 3.8. The ImageNet pretrained ResNet-101 was used for the backbone. All models were trained using the stochastic gradient descent (SGD) optimizer for 300 epochs with 0.9 momenta and 0.0001 weight decay. The initial learning rate was set to 0.01 with decay at epoch 10. The input resolution of images was [1536, 1536], but multiscale training was not used. Scratch-low values were used in the nano and small models, while scratch-high values were used for the rest. mAP@val was used for single-model and single-scale on these datasets. Test time augmentation (TTA) included reflection and scale augmentation. In the inference stage, a preset score of 0.25 to remove background bounding boxes and NMS were used with an IoU threshold of 0.45 to obtain the top 3000 confident bounding boxes. The above training and inference parameters were used in all experiments unless specified otherwise.

Evaluation Metrics
Recently, the evaluation metrics for object detection have generally consisted of two categories: the prediction metrics (IoU) and the classification metrics with precision, recall, etc., where precision and recall merely show the proportion of accurate and inaccurate instance predictions to all cases. Current main object detectors generally set the default IoU to 0.45 in the training stage. The mAP (mean average precision), including mAP@0.5, which refers to the mean AP with an IoU threshold higher than 0.5, and mAP@0.5:0.95, which refers to the mean AP above the IoU threshold (from 0.5 to 0.95 with a step size of 0.05), is used to evaluate the detection performance of our proposed method. The models are also tested with the COCO evaluation parameters. Specifically, AP 0.5 means the IoU threshold defining true positive (TP) is 0.5, AP 0.75 means the IoU threshold defining TP is 0.75, and AP 0.5:0.95 means the average value from AP 0.5 to AP 0.95 , with an IoU interval of 0.05. Note that AP 0.5 , AP 0.75 and AP 0.5:0.95 take objects of all scales into consideration. Furthermore, in the Heridal dataset, AP l , AP m , and AP s are for large-, medium-, and small-scale evaluations, respectively. Average recall with a maximum detection number of 1, 10, and 100 is denoted by AR 1 , AR 10 , and AR 100 , respectively.

Experiments on Optimized Heridal Dataset
The optimized Heridal_ForestPerson dataset was used in the experiments, representing the first captured real and high-resolution images for the purpose of person detection from UAVs. We demonstrate how our proposed approach can greatly improve the performance of object detectors by removing their sensitivity limitations when dealing with small objects. Four groups of experiments are considered, including experimental results for the MMD evaluator, analysis results with OIM and SSTM, and the ablation study, with discussion provided of these.

Experimental Results of MMD Evaluator
The CIoU loss was modified by our distance evaluation method based on the YOLOv5 framework and was tested under different depth models to validate the effectiveness of the MMD evaluator. The training curves are presented in Figure 7. As shown, our proposed method outperformed CIoU at AP 0.5 and AP 0.5:0.95 , with the magnitude of the improvement varying with the depth of model. As illustrated in Table 2, pretrained model 'x' achieved better results in the MMD evaluation experiment, with the training evaluation metrics (recall, AP val 0.5 , AP test 0.5 , AP val 0.5:0.95 , AP test 0.5:0.95 ) improving compared to YOLOv5 by 3.27%, 3.0%, 3.77%, 1.5%, and 1.51%, respectively. However, the precision decreased by 3.43%. The pretrained model 's' achieved better detection performance, with only 6.7M parameters and 15.2M GFLOPs. Detection accuracy was acceptable to some extent. Table 2 also reveals that MMD performed better on the 's' model; hence, the primary pretrained model ('s') was utilized in the following tests. It should be noted that, based on training experience, batch size has little or no effect on detection performance, so the varied batch sizes were simply used to increase training efficiency and were not compared under the same batch size.  The blue markers are the maximum percentage of growth, and the bold fonts are the optimal value of the metrics.

Analysis Results with OIM and SSTM
Since there are not enough person examples in the Heridal_ForestPerson dataset, a copy-paste mechanism to insert instances into the original images was used and then semisupervised learning was employed to fine-tune the training model. Specifically, positive and negative examples from the Heridal Patch library (the Heridal Patch library refers to saving of crops that groundtruth objects from all samples of Heridal datasets with unified instance size of 81 × 81) were copied onto the samples in a specified proportion, with the repeat option to increase samples. Then, 1000+ segmented samples on the Patch library were manually labeled as the training set for instance segmentation, segmenting of all instances in the Patch library was finished with the help of YOLOv5 and BiSeNet to obtain clean masks without redundant backgrounds, and then these were pasted into the original samples. As shown in Table 3, the sample size for copy-paste with instance segmentation was used as for the simple copy-paste. The original instance size in the Patch library was 81 × 81, but all instances were resized to 32 × 32 to confirm that they belonged to small objects. It is worth noting that the samples were not raised in the test subset, but instead the same sample as Heridal_ForestPerson's test subset was maintained. Objects implanted by copy-paste with instance segmentation in Table 3 lack label information, so that re-inference of all newly generated samples is once more required on the pretrained model, achieving higher confidence labels as pseudo-labels. As a result, pseudo-labels and real labels are added to the trainer in a certain proportion. The trainer's loss function is divided into two parts, the loss of real labels and the loss of pseudo-labels, and the complete loss should be the weighted sum of the two. It should be noted that the CIoU loss is based on the YOLOv5 framework, rather than our proposed MMD-based loss in this subsection. As shown in Figure 8, object implantation with instance segmentation and semi-supervised learning with pseudo-labels can effectively improve the training metrics (recall, AP 0.5 , AP 0.5:0.95 ), but improvement in precision is not obvious. The likely explanation is that increasing the number of instances exponentially raises the leakage rate but has less influence on the detection accuracy. Figure 8. Training curves on the Heridal_ForestPerson dataset with object implantation and semisupervised learning. (a-d) refer to metrics of object detection, including precision, recall, mAP_0.5, and mAP_0.5:0.95. On the Heridal_ForestPerson dataset, the red, green, and blue curves reflect the training outcomes of YOLOv5, YOLOv5 + copy-paste with instance segmentation, and YOLOv5 + copy-paste with instance segmentation + semi-supervised learning, respectively. Different lines represent various object implantation strategies. All the experimental results were tested with the 's' pretrained model. Table 4 defines four critical fine-tuning parameters: copy-paste refers to the number of object implantations, Scale means instances scaling to samples, Loss_Weights refers to the percentage of loss between the real and pseudo-labels. Repeat refers to adjusting different parameters to obtain better detection performance. It was observed that when the parameters [copy-paste, Scale, Loss_Weights, Repeat] = [3, 0.5, 1, No], the Baseline method achieved the best results. When the parameters [copy-paste, Scale, Loss_Weights, Repeat] = [3, 0.9, 0.5, Yes], the OIM method achieved the best results. When the parameters [copypaste, Scale, Loss_Weights, Repeat] = [3, 0.9, 1, Yes], the OIM + SSTM method achieved the best results. As a result, we believe that more implantation of objects is not better, but too many cases may lead to model overfitting. Second, the scaling of instances was kept as close to the native instance size as possible, and the implanted instances were either too large or too tiny to deteriorate the model's performance. Due to the issue of training time, the trainer had as many real labels as pseudo-labels, no more ablation experiments were performed, and there was still space for improvement afterwards. The parameter of repeated samples was used in all of our approaches to improve the model's robustness. By the above method, the best performance metrics [precision, recall, AP test 0.5 , AP test 0.5:0.95 ] were [0.8914, 0.7162, 0.8117, 0.5436].

Error Analysis
In order to evaluate causes of the decline in mAP, numerous mistake categories are designed, which can be useful for analyzing the model's strengths and weaknesses; they can also assist in determining which flaws a trick corrects, hence improving the mAP metrics [34]. As shown in Figure 9, the IoU thresholds t f and t b were set to 0.5 and 0.1, respectively. The definitions of the six mistake categories are: • Classification error (Cls), IoU max ≥ t f , which means localized correctly but classified incorrectly. • Localization error (Loc), t b ≤ IoU max ≤ t f , which means classified correctly but localized incorrectly. • Classification and localization errors (Both), t b ≤ IoU max ≤ t f , which means classified incorrectly and also localized incorrectly. • Duplicate detection error (Dupe), IoU max ≥ t f , which means multiple detection boxes with various confidence levels. • Background error (Bkg), IoU max ≤ t b , which means background detection boxes, but no instance. • Missed GT error (Miss), which means all undetected ground truths, other than Cls and Loc errors.  The bold fonts are the optimal value of the metrics. Fixing an error can facilitate the mAP and evaluate the importance of the error category, by comparing the improvement of mAP to determine the importance of an error. Twelve strategies are analyzed in Table 4 by six error types mentioned in Figure 9. As shown in Table 5, six main errors (Cls, Loc, Both, Dupe, Bkg, Miss) and two special errors (false positive detection (FP), false negative detection (FN)) are counted. Because the Heridal_ ForestPerson dataset contains only one class (person), Cls and Both errors are zero. Miss and FN have the highest errors, which are caused primarily by: (1) Detailed features of small persons practically vanishing after multi-layer convolution, causing a failed regression map to a person's specific location in the original images; (2) When computed for loss by CIoU, minor position and posture deviation of persons is treated as a negative sample, lowering the detection performance. As shown in Figure 10, the percentage of Miss is more than 50% in all methods, and the Miss error becomes more apparent with the more objects that are implanted. However, the Bkg error achieves lower values to alleviate the imbalance of the positive and negative samples. In addition, object implantation combined with pseudo-label training can help to reduce Loc and Dupe errors. Therefore, under certain parameter settings, methods 5-12 (ours) achieve better detection performance. Table 5. Statistics and analysis of multiple error types with various strategies mentioned in Table 4 on the Heridal_ForestPerson dataset.

Method
No. The bold fonts are the optimal value of the metrics.

Ablation Study
The effectiveness of our strategy was evaluated with the addition of various modules, including CP (copy-paste), OIM (object implantation module), SSTM (semi-supervised training module), CIoU and MMD (maximum mean discrepancy). As shown in Table 6, the detection performance for different combinations of OIM, SSTM, and MMD was compared, where CP and CIoU based on YOLOv5 was set to Baseline. It was observed that the AP test 0.5 obtained with OIM+CIoU was 0.7855, a 3.35% improvement over Baseline with CIoU alone, and a 2.25% improvement over Baseline with CP+CIoU. The AP test 0.5 of OIM+SSTM+CIoU achieved 0.8073, a 2.18% improvement over OIM+CIoU, which showed that semi-supervised learning strategies were effective. Following numerous ablation experiments, the optimal training parameter metrics [precision, recall, AP test 0.5 , AP test 0.5:0.95 ] were [0.9152, 0.7389, 0.8079, 0.5436] with the combination of OIM+SSTM+MMD. Furthermore, we found that the reuse of CP and OIM did not always work well, and may have caused the AP test 0.5 to drop (from 0.8115 to 0.8079, and decline by 0.36%). The blue markers represent our proposed method's percentage improvement over the Baseline.

Discussions
During the inference stage, a variety of practical tips were applied to improve the accuracy of the trained model, most notably including: test-time augmentation (TTA), which entails making multiple augmented copies of each image in the test subset, allowing the model to make predictions, and then returning the set of predictions in each image; model ensembling (ME), which fuses multiple trained models based on the voting method (this article's method) to achieve better detection results for a fusion-based multimodel; weighted boxes fusion (WBF), which combines and sorts boxes by decreasing the order of the confidence scores; and low-precision parameter quantization (LPPQ) with batch inference (BI) to accelerate model inference. As shown in Table 7, TTA with 2400 × 2400 of input size achieved better AP test 0.5 than the Baseline (improved 1% or so). Compared to TTA, TTA+ME improved by 0.5%; TTA+ME+WBF also resulted in improvement by 1.1%. As a result, employing various augmentation methods during the inference process was able to improve the model's detection performance by 2.1%. Then, BI and PQ methods were performed to speed up the inference process whereby many tips were stacked on top of each other.  The bold fonts are the optimal value of the metrics.

Comparisons with the State-of-The-Art
Firstly, five current techniques were chosen to contrast with our proposed data augmentation, including: Copy-Paste [35], Simple Copy-Paste [36], Mixup [37], CutMix [38] and Mosaic [39]. All the experimental results are shown in Table 8. It can be seen that the CP+SSL method improved the AP IoU=50 metric of the above methods by 3.45, 1.7, 2.26, 0.42 and 0.6 AP points, respectively. Secondly, the detection performance in label alignment with IoU evaluation on the basis of data augmentation was investigated, including: CIoU [25], DIoU [27], GIoU [26], EIoU [28], Alpha-IoU [40], SIoU [41], and label alignment with distance evaluation, including, DotD [42] and NWD [43]. Distance evaluation yielded higher detector AP metrics than IoU evaluation, and the CP+SSL+MMD method outperformed the best distance evaluation method by 1.16%. Finally, multiple detection algorithms on the basis of data augmentation, including single-stage anchor-based detectors-YOLO series [30,39,44,45], SSD [46], RetinaNet [47]-and a two-stage anchor-based detector, Faster R-CNN [48], were compared to our proposed multi-strategy collaboration method. When compared to existing methods, our algorithm offered some advantages in both AP and AR. The performance improvements were encouraging and were even more obvious when the objects were extremely tiny.

Experiments on Other Datasets
To further demonstrate the effectiveness of our proposed method, we compared it to typical detection algorithms based on MMDetection (https://github.com/open-mmlab/ mmdetection, accessed on 15 March 2023) and the YOLOAir (https://github.com/iscyy/ yoloair, accessed on 15 March 2023) toolkit. The original and optimized datasets used in the trials can be found in Table 1.

Comparative Experiments on Optimized Datasets
To verify the resilience of our approach on diverse datasets, it was compared separately to the SOTA algorithms. The default public test results were performed on the original dataset, and the training metrics on the optimized dataset were performed with our method in this study. As shown in Table 9, our proposed approach significantly improved detection performance on the optimized dataset, paving the way for commercial applications of SaR tasks. Figure 11 depicts some visualization findings for baseline detectors and our proposed detectors in five scenarios, which are from our created VHTA dataset. The detection results show a significant improvement when compared to the baseline detectors. In particular, the following observations can be made: The most noticeable improvement was that our proposed method was able to significantly reduce FN. When detecting microscopic objects, FN is a common scenario with baseline detectors due to a lack of supervisory information. It was also shown that anchor-based detectors can learn enough supervision information from positive small samples when equipped with YOLO series. Furthermore, a considerable number of FPs may be detected in the SSD detection results, indicating that SSD fails to categorize correct predictions from numerous detection candidates. Surprisingly, our method can correctly handle FP detections, meaning that the assigned positive/negative samples are of higher quality. The bold fonts are the optimal value of the metrics. The blue markers represent the outcomes of various types of optimal methods when compared to our proposed method. The blue markers are the optimal value of the metrics. Figure 11. Visualization of detection results using baseline detectors (first four rows) and our proposed detector (the fifth row) of our created VHTA dataset. The test results of the various methods have been partially enlarged for clarity.

Conclusions
In this article, a method for person detection in aerial images using multi-strategy collaboration is proposed, which aims to tackle the difficulty of small object detection and the problem of few samples in the SaR task. Practical, reliable and multi-scene SaR data are created to fusion person (also including pedestrians, people, and human beings) labels from publicly available datasets. In response to small and unbalanced samples, synthetic samples are generated with object implantation methods and virtual synthesis software; meanwhile, combined instance segmentation with semi-supervised learning achieves more unreliable pseudo-labels. To address the low tolerance of existing IoU evaluation methods for tiny objects, a better metric is designed based on the maximum mean discrepancy distance to measure the similarity of the bounding boxes. The experimental results obtained demonstrate that the proposed method is more accurate than the considered benchmark methods for person detection in SaR scenarios.