Deep learning-based small object detection: A survey

Small object detection (SOD) is significant for many real-world applications, including criminal investigation, autonomous driving and remote sensing images. SOD has been one of the most challenging tasks in computer vision due to its low resolution and noise representation. With the development of deep learning, it has been introduced to boost the performance of SOD. In this paper, focusing on the difficulties of SOD, we analyze the deep learning-based SOD research papers from four perspectives, including boosting the resolution of input features, scale-aware training, incorporating contextual information and data augmentation. We also review the literature on crucial SOD tasks, including small face detection, small pedestrian detection and aerial image object detection. In addition, we conduct a thorough performance evaluation of generic SOD algorithms and methods for crucial SOD tasks on four well-known small object datasets. Our experimental results show that network configuring to boost the resolution of input features can enable significant performance gains on WIDER FACE and Tiny Person. Finally, several potential directions for future research in the area of SOD are provided.


Introduction
Object detection (OD) [1][2][3][4][5][6][7][8] is an essential task that forms the basis of many other computer vision tasks, such as object tracking [9,10], instance segmentation [11][12][13], action recognition [14][15][16][17][18], environment surveillance [19][20][21], video checking in sports [22,23], scene understanding [24][25][26][27][28], etc. Though large-scale datasets, such as Microsoft Common Objects in Context (MS COCO) [43], ImageNet [44], and PASCAL VOC [45] and, have contributed to the growth of object detection methods, these methods fail to accurately detect small objects. Taking Co-DETR [46], i.e., one of the state-of-art methods, as an example, the mean average precision (mAP) metric of small objects on COCO obtained by Co-DETR was only 48.4%, which significantly lags behind that of objects with medium and large sizes (67.1 and 77.3% respectively). The main reason for the poor performance for SOD is that small objects have lower resolution and occupy fewer pixels than larger objects; the spatial position information loss by performing down-sampling and a pooling operation in the convolutional networks makes it more challenging for the detection head to locate the small objects. The large scarcity of small object datasets is another obstacle to the advancement of SOD. Existing small object There are also recent surveys on SOD. In the review by Chen et al. [63], the four pillars of SOD are discussed. However, they did not connect the basic module design of the detector to the challenges in SOD; rather, they only reviewed studies on SOD from the viewpoint of the model framework ( Figure 2), such as MMDetection [64], which divides the framework of the detector into a backbone, neck and head. The current SOD methods based on deep learning have been reviewed by Tong et al. [65] from five perspectives; they analyzed the evaluation results for two general datasets. Tong et al. limited their work to generic SOD and did not consider a model developed for SOD tasks. In addition to summarizing and contrasting current deep learning approaches for SOD, Liu et al. [66] also provided a brief overview of related methods, such as traditional object detection, face detection, picture segmentation and remote sensing images. However, they only evaluated the performance of a few networks: Faster R-CNN, SSD, YOLO and SSD. Partial performance evaluation cannot illustrate the broad picture of SOD. Tong and Wu refined and differentiated between small and tiny objects [67]. To examine this expanding field, Rekavandi et al. [68] presented a thorough analysis of more than 160 research publications released between 2017 and 2022. Other significant works include that by Cheng et al. [69], who constructed two large-scale SOD datasets, SODA-D and SODA-A, focusing on the driving and aerial scenarios, respectively. In contrast to these earlier object detection surveys, we focus on the difficulties related to SOD, investigate recent deep learning-based SOD algorithms and thus present a taxonomy to illustrate the novel strategies developed to improve SOD performance. In addition to providing an in-depth description of deep learning-based SOD algorithms developed in three areas, our study also offers meaningful comparisons of the associated experimental results. It reviews the deep learning-based object detection models and the difficulties of SOD.
These reviews offer a thorough summary of object detection.
However, they concentrate on regularsized object detection rather than small objects.
Object detection in optical remote sensing images: A survey and a new benchmark [58] ISPRS 2020 It constructs DIOR, a large dataset of remote sensing.
Imbalance problems in object detection: A review [59] arXiv 2020 It reviews the imbalance problem of object detection.
Continual object detection: A review of definitions, strategies, and challenges [60] arXiv 2022 This survey investigates continual object detection.
New generation deep learning for video object detection: A survey [61] TNNLS 2022 It systematizes the latest video object detection models and analyzes the performance of these models on two datasets.
A survey of deep learning-based object detection [62] IEEE Access 2022 It reviews detection methods, general datasets and typical applications. Deep learning-based detection from the perspective of small or tiny objects: A survey [67] IVC 2022 Aims to discuss small-or tiny-object datasets, detection techniques and the performance of these techniques.  [68] arXiv 2022 Reviews the SOD methods and investigates the performance of SOD in maritime environments.
Towards large-scale small object detection: Survey and benchmarks [69] arXiv 2022 It presents a detailed study of SOD and yields two large-scale benchmarks for a driving scenario and aerial scene.

Title Publication Strengths Limitations
Deep learning based small object

detection: A survey Ours
We comprehensively discuss the definition of small objects, the challenges encountered in detecting small objects, the strengths and weaknesses of generic SOD algorithms and three crucial SOD tasks. We also analyze the performance of SOD on three datasets and summarize meaningful conclusions.
In summary, our contributions are as follows: 1) Systematic overview of deep learning-based SOD algorithms. We analyze state-of-the-art deep learning-based SOD algorithms in accordance with the challenges in SOD, and we provide a taxonomy that summarizes the strategies for improving SOD performance from the perspective of boosting the resolution of input features, scale-aware training, incorporating contextual information and data augmentation. Additionally, we provide a thorough review of the methods of crucial SOD tasks, including small face detection, small pedestrian recognition and aerial image detection.
2) Performance evaluation of SOTA deep learning-based SOD algorithms. We not only analyze the performance of generic SOD methods with the general large-scale dataset, but we also evaluate the performance of state-of-the-art SOD methods on three crucial SOD tasks, including small face detection, small pedestrian detection and aerial image detection.
3) Finally, according to the taxonomy methods and performance analysis of SOD, we discuss potential directions for future research, including suitable metrics for SOD optimization, weakly supervised SOD methods, multi-task joint optimization, and open world or few-shot SOD.
The remainder of the paper is organized as follows. The generic SOD algorithms are discussed in Section 2. Section 3 summarizes the methods proposed for three SOD tasks. We provide datasets and evaluation metrics in Section 4 and evaluate generic SOD methods, small face, small pedestrian, and aerial image SOD methods. Future directions are discussed in Section 5. Finally, the conclusion of this paper is presented in Section 6.

Generic SOD algorithms
In this section, we will extensively review the methods of generic SOD. To deal with the challenges of SOD, existing SOD methods typically have complex designs added to the current pipeline that excels at generic object detection. We will describe these methods from four perspectives, including boosting the resolution of input features, scale-aware training, incorporating contextual information and data augmentation. The advantages and disadvantages of each perspective, as shown in Tables 2-6, are then discussed in detail.

Boosting resolution of input features
The difficulty in precisely locating small objects is mainly due to the down-sampling operation of the CNN, which causes the features of small objects to disappear, and the low spatial resolution of the high-level feature map seriously loses the spatial position information of small objects. A fairly rational solution to that is to use high-resolution feature maps or high-resolution images. However, employing high-quality images or increasing the feature map resolution will result in higher computing costs. Numerous scholars have constructed feature pyramids by reusing multi-scale feature maps produced by network forward propagation, followed by the use of low-level high-resolution feature maps with more minute spatial details to detect small objects. Additionally, some models have learned the mapping function from low-resolution features to high-resolution features to achieve the same detection effect as large objects. Both approaches substantially increase the resolution of the predictive feature layer. Several typical models that boost the resolution of input features are shown in Figure 3.  [74], MDSSD [75] and GAN-based SOD [87]. RetinaNet SSD [36] is a multi-scale object detection technique that detects objects by placing reference windows of different scales in different layers of the networks. The detection accuracy of small objects has not greatly improved. The primary explanation is that low-level feature maps have a limited receptive field and a significantly poorer ability to represent features than deep feature maps. Therefore, Lin et al. proposed FPNs [35]. The core idea behind FPNs is to use forward propagation of the network to create four feature maps of different scales, merge the high-level feature maps with the lower-level feature maps through layer-by-layer up-sampling, fuse the features from different network depths to achieve feature enhancement and then make predictions by using the fused feature map that each layer needs to only predict one scale of objects. The results of the experiments show that the FPN significantly increases SOD accuracy and can guarantee a detection speed of 6 FPS. Since the FPN was proposed, numerous enhanced variants have been developed, including the PANet [70], BiFPN [71], ASFF [72], NAS-FPN [73], etc. The object-proposal-based detection technique has long had a modestly better detection accuracy, despite the integrated convolutional network-based detection model having a significantly faster detection speed. After investigating the reasons behind this, Lin et al. presented RetinaNet [74]. The one-stage network initially outperformed the two-stage network. Lin et al. argued that the foreground-background class imbalance mostly accounts for the integrated convolutional network's inferior detection performance. So, focal loss was proposed to improve crossentropy loss. Focal loss is given by Eq (1): is the balancing variant; ∈ 0,1 stands for the probability when y = 1 (positive sample). The rate at which simple examples are down-weighted is adjusted by the focusing parameter 0 . RetinaNet can achieve the "focus" of hard samples and the redistribution of network learning ability by reducing the learning weight of simple background samples during the network training process. The lightweight feature fusion module proposed for FSSD [75] uses down-sampling to create a new feature pyramid. MDSSD [76] involves applying deconvolution to a high-level feature map with both powerful semantic information and then fusing it with low-level feature maps by using the fusion module to preserve rich spatial details and high feature representation capabilities for small objects. The architectures of RetinaNet and MDSSD are shown in Figure 3.
At the last layer of a backbone, small object features have almost disappeared. The top-down path makes it nearly impossible for FPNs to fuse the features of small objects. Additionally, as the network gets deeper, the deep feature map gains more semantic information but loses out on spatial information. This causes an offset between anchors and convolutional features, meaning that, after several convolutions, the position of the anchor on the deep feature map differs from the position on the original map. Additionally, the deep features and shallow features cannot be effectively aligned by the FPN fusion. Gong et al. [77] proposed a fusion factor for describing the coupling degree of adjacent layers in FPNs which can be calculated by using the dataset statistical data or learned through implicit learning. Adjusting the fusion factor of adjacent layers in an FPN can adaptively drive the shallow layers to focus on learning tiny objects, thus improving the detection of tiny objects. The highresolution detection network (HRDNet) [78] accepts multiple resolution inputs via multi-depth backbones. To cut down on computational costs, the multi-depth image pyramid network (MD-IPN) uses a multi-depth backbone to output multi-scale, multi-level feature maps, which means that highresolution input will be fed into a shallow network to reserve more positional information, and that low-resolution data will be fed into a deep network to extract more semantics. Multi-scale FPNs align and fuse multi-scale feature groups produced by an MD-IPN to decrease the information mismatch between these multi-scale, multi-level features. Liu et al. [79] proposed the IPG-Net to mitigate the disappearance of small object features following serial down-sampling and the dislocation between spatial information and semantic information; it includes an IPG transformation and IPG fusion module. IPG-Net receives an image pyramid as the input; the IPG transformation module extracts shallow features from image pyramids of various resolutions that include rich spatial information and detailed information; the IPG fusion module fuses the shallow features extracted by the IPG transformation module and the deep features of the backbone. RHF-Net [80] applies top-down and bottom-up feature fusion. It contains a recursive execution of the hybrid fusion module that enables RHF-Net to both connect high-level semantic features to the low-level features (top-down direction) and reshape the rich spatial features of low-level feature maps to the deeper layer (bottom-up direction), thus improving the contextual features of objects of all scales.
The spatial distribution of small objects on the high-resolution feature map of the feature pyramid is very sparse, accounting for only a small part of the high-resolution feature map. QueryDet [81] uses the query technique to accelerate the reasoning speed of the object detector based on the feature pyramid by preventing the detection head from doing resource-intensive calculations on the entire high-resolution feature map. It includes a query head in parallel with classification and regression to predict the locations (query keys) of a possible small object in the features of the previous layer. The current layer uses these locations to generate a sparse value feature map (query value). Then, it predicts the query keys of this layer to be given to the following layer.
Super-resolution is another effective method that directly enriches the information of small objects by increasing the resolution of the input image. EFPNs [82] add a super-resolution layer to an FPN, as it uses the feature texture transfer module to super-resolve features by extracting regional texture features from the reference features. This adds convincing details to the EFPN and improves the accuracy of SOD. To eliminate the representational disparity between large and small objects and allow a small object to attain the same detection accuracy as large objects, Li et al. [83] used a GAN to enhance the small object's feature representation to a super-resolved representation. But, the superresolved feature might not be convincing, as the large object image and the small object image are not from the same image. The SOD-MTGAN [84] learns the mapping between low-resolution image patches and high-resolution image patches, which reduces the computational cost. Noh et al. [85] used high-resolution features for direct supervision. And, under the guidance of a super-resolution discriminator, low-resolution features are transferred to the super-resolution feature generator to generate high-resolution features. MARE [86] uses a network to obtain attention weights, which are considered as weights for each level of feature maps, to generate the final attention feature maps; it then performs feature fusion to further enhance the information that is useful for small targets. The EESRGAN [87] adds edge-enhanced sub-networks (EENs) [88] to the ESRGAN [89]. EENs perform edge enhancement on the intermediate super-resolution (ISR) images generated by the generator to produce the final super-resolution image. Together, the discriminator and detector perform the role of the discriminator, and the discriminator trains the generator by using relativistic loss [90]. The following Eqs (2) and (3) show the relativistic loss of the discriminator and the adversarial loss [91] of the generator. Where is the probability that a real image ( ) is relatively more realistic than a generated intermedia image ( ), is the operation that calculates the average of all generated intermediate images in a mini-batch, and is the operation that calculates the average of all real images in a mini-batch. Additionally, the EESRGAN employs end-to-end training to backpropagate the detector loss to the generator. Thus, the generator receives gradients from both the detector and the discriminator to enhance the quality of super-resolution images. Cao et al. proposed the MHN [92], which splits the network into three distinct branches (branch-l, branch-m, branch-s), where each branch produced equivalent high-level semantic feature maps with a variety of resolutions, allowing it to better match objects of various scales. 1 , ,

Scale-aware training
The largest object in the COCO dataset is 20 times larger than the smallest, and the scale invariance of CNNs is not robust against such large-scale variances. Scale-aware training strategies can make the detector more robust against scale variance. A common process of the scale-aware training model is shown in Figure 4.
Previously proposed approaches use image pyramids [93,94] to improve the accuracy of object detection at various scales, which have larger memory requirements. Scale normalization for image pyramids (SNIP) [95] is a training strategy that uses the image pyramid training model and only backpropagates the loss of object size within the predetermined range. To go further, SNIPER [96] chooses chips with a fixed resolution of 512 × 512 pixels from each layer of the pyramid to act as the training unit, unlike SNIP, which analyzes every pixel in an image. Because of the smaller chip resolution, it can train with a larger batch size, which improves both training efficiency and detection accuracy. Kim et al. proposed a scale-aware network (SAN) [97] that maps the convolutional features from the different scales onto a scale-invariant subspace to make CNN-based detection methods more robust against the scale variation, and also to construct a unique learning method that purely considers the relationship between channels without the spatial information for the efficient learning of the SAN. This method essentially improves the quality of convolutional features in the scale space and can be generally applied to many CNN-based detection methods to enhance the detection accuracy with a slight increase in the computing time.
Trident [98] is a multi-branch parallel network, where each branch adopts an appropriate dilated ratio to provide the receptive field size that can align with the object size. Moreover, a scale-sensitive training approach is applied to enhance each branch's capacity for scale perception and prevent the training of objects of extreme scale on branches with unmatched receptive fields. Each branch's effective range, l, is given by Eq (4): Peng et al. [99] show that the local and dense continuous scales which are hard to optimize are not necessary, and that, through a collaboration of well-learned global scales on layers, a network could be granted the scale-awareness. Therefore, they designed a global scale learning module to replace the normal convolutional module and learn the appropriate global scale for different layers. FPN dramatically improves the detection accuracy of small objects.
The feature representation capability will be diminished by the semantic gap between feature layers of various scales.
Lower detection speed than SSD.
The fusion factor further improves FPN performance for small objects.
HRDNet acquires more details for small objects with high resolution.
Large numbers of parameters.

IPG-Net [79] CVPR20
IPG-Net alleviates the vanishment of the small object features.
This method is inefficient. It provides a practical solution for multi-resolution feature extraction without using a GAN, and it is time-efficient. Multi-branch technique makes the receptive field size align with the object size.

RHF-Net
It may bring about the over-fitting problem in each branch, as caused by too few effective samples.
This method makes the network more sensitive to scale invariance. Figure 4. Architecture of TridentNet [98]; d is the dilation rate.

Incorporating contextual information
Visual objects frequently coexist with other relevant objects in a certain setting, which provides rich contextual associations to be exploited. Researchers [100] have shown that utilizing the context as extra information can help to detect small objects with obscure features. Two typical models of incorporating contextual information are shown in Figure 5. Chen et al. [42] extended the R-CNN model by using ContextNet and a small region proposal generator to improve SOD. Regarding the region proposal network (RPN), Chen et al. used smaller RPN anchor sizes (16 2 , 40 2 , 100 2 vs. 128 2 , 256 2 , 512 2 ). ContextNet integrates contextual information to calculate the final classification score. Bell et al. [101] proposed ION, which utilizes information inside and outside of the ROI to improve detection performance. Regarding the inside part, ION extracts the features of the ROI at several levels at different scales by using skip pooling to enhance the ability to detect small objects. Regarding the outside part, ION extracts the contextual information outside of the ROI by using a spatial recurrent neural network to enhance the feature information and promote the subsequent classification and regression performance. The DSSD [102] fuses deep semantic information as context with shallow semantic information. The CSSD [103] is a contextaware framework that incorporates context by integrating deconvolutional or dilated convolutional layers into SSD. In object detection, there are two common contexts. Image-level context refers to modeling the contextual information of each pixel in the whole image, which is implicitly incorporated into the deep convolutional network, while the instance-level context, which models object-object relationships, is an important clue for object detection and reasoning. A spatial memory network (SMN) [104] was proposed to get the instance-level context. The network detects an object, remembers it and then uses it as a priori knowledge to help detect the previously missed target in the next iteration. Fu et al. [105] introduced a unique contextual reasoning method for SOD that models and infers the relationships between objects' inherent semantic and spatial layouts. The learnable semantic association functions are defined by the semantic module from the standpoint that proposals belonging to the same category share semantic co-occurrence information. The formula is given by Eq (5): where , denotes an indicator function and maps the initial region features to latent representations. The spatial layout module disregards semantic similarity and builds relationships based on spatial similarity and spatial distance in the internal spatial layout, allowing small objects that have a high degree of spatial similarity and appear in clusters to communicate contextual information about the spatial layout to one another. FA-SSD [106] is a combination of F-SSD and A-SSD. F-SSD uses higher-level feature maps as context to concatenate with low-level feature maps. A-SSD uses an attention mechanism to minimize unnecessary shallow features in the background. Both image-level context and instance-level context are commonly used by SOD. The gradient will vanish as the reasoning signal and perceptual signal cancel each other out.
IRR updates the initial regional features to boost SOD.
Small objects are associated with difficulty in extracting semantic features.
FA-SSD is more accurate than SSD.
It has lower accurate than DSSD.

Data augmentation
High-quality large-scale datasets can greatly improve the performance of deep learning SOD. However, the amount of labeled data is still far from sufficient due to the high cost of annotation. Data augmentation is a common method to enrich the diversity of the dataset, thus improving the generality and robustness of the model to some extent. This can also help to mitigate the degradation of object detection accuracy due to the uneven distribution of different scale objects in the dataset.
A lot of data augmentation techniques have been developed, such as affine transformation, Mosaic [107], MixUp [108] and CutMix [109], but these methods have better performance on medium-or large-sized objects than small objects. Kisantal et al. [110] thoroughly investigated the MS COCO dataset and discovered a sample imbalance problem: images with small objects in the dataset are only a small fraction; particularly, the number of small objects in each image is less and the site of occurrence lacks diversity. Kisantal et al. proposed oversampling images with small objects to increase the number of small objects during training. Chen et al. [111] found that random copying and pasting led to background mismatch and object size mismatch. To solve that, they employed adaptive data augmentation, which uses a semantic segmentation network to obtain an a priori roadmap and samples an effective position to place the object enhanced by the roadmap. Ünel et al. [112] proposed a tilingbased technique where the input images are deliberately split into overlapping tiles to increase the relative pixel area of small objects.
To address scale variance, DST [113] receives the loss proportion caused by small objects as feedback. If the loss proportion is smaller than the predetermined threshold, the training images are enlarged and spliced in the following iteration to compensate for the missing small objects. Zoph et al. [114] used AutoAugment [115] to find the optimal data augmentation method for object detection by applying an augmentation strategy search to the training set. An RNN controller and a reinforcement learning methodology are included in the search strategy. Chen et al. [116] proposed scale-aware automatic data augmentation, which includes a scale-aware search space with augmentations at the image and box levels, as well as a search metric called the Pareto scale balance. The metric is realized by recording accumulated loss and accuracy over various scales. This approach achieves better object detection accuracy for small objects.
Random copying and pasting may cause background mismatch.
Free-anchor and adaptive resampling result in excellent performance for very small objects.
The method provides a good trade-off between accuracy and time cost.
DST [113] arXiv20 Uses the feedback information to guide data preparation.

Other strategies
Samet et al. [117] proposed a new labeling technique in which the predictions derived from individual features are aggregated into one prediction to reduce the labeling noise of the anchor-free detector. Duan et al. proposed CenterNet++ [118], which uses a triplet of a center key point and a pair of corners to represent an object. The corners can locate objects with any geometry. Wang et al. [119] evaluated the sensitivity of the Intersection over Union (IoU) to position variations of small objects, and they suggest replacing the IoU with a new measuring technique that models each box as a Gaussian distribution and uses the normal Wasserstein distance (NWD) to determine the similarity of the two distributions to one another. Xu et al. [120] presented receptive field distance to quantify the similarity between the Gaussian receptive field and ground truth directly, rather than assigning samples with IoU sampling strategies. C3Det [121], an interactive, multi-class, tiny-object annotation framework that Lee et al. suggested, allays concerns about the demands and expense of annotation in the actual world. SAHI [122] entails dividing the input images into overlapping slices to yield a higher percentage of small objects in the image of the input network. These two metrics are more effective than the IoU metric for small object detection. RFLA [120] ECCV22 C3Det [121] CVPR22 Annotation framework for tiny objects.
It alleviates the expense of tinyobject annotation.
This scheme is plug-and-play, does not require pre-training and improves the accuracy of detecting small objects.
Larger feature maps require more memory and computing cost.

Crucial SOD tasks
In this section, we present a systematic review of SOD in terms of small face detection, small pedestrian detection and aerial image detection tasks. We first thoroughly describe the current approach to each task. Then, a comprehensive summary of the strengths and weaknesses of each method is presented.

Small face detection
Multi-scale modeling [123] was proposed following a thorough investigation of image resolution, object scale variation and contextual information. This algorithm uses SSD as the foundation, and the sparse discrete image pyramid is fused to handle the scale shift of objects. Rich contextual information is necessary for SOD, but low-level feature maps are used because SOD lacks semantic information; however, deep feature maps contain rich contextual and semantic information. As a result, multi-layer feature fusion is incorporated into SOD, which enhances the performance of small face detection. S 3 FD [124] incorporates a scale-equitable face detection network to adapt face detection at various scales. Additionally, the effective receptive field and equal-proportion interval principles are used to define the scales of the anchors, ensuring that different scales of anchors are distributed uniformly across the image, and that anchors at different layers match their corresponding effective receptive fields. Then, by using a scale compensation anchor-matching approach, the recall rate of small faces is increased. Lastly, the false positive rate of small faces is decreased by predicting the number of background anchors for each matched small anchor. [125] uses generative adversarial network to generate high-resolution face. Face-MagNet [126] employs ConvTranspose (kernel = 8, stride = 4) layers that pass the features of small faces from the lower feature layer to the prediction layer inside an RPN and classifier to magnify the feature maps for the better detection of small faces. The three-trick, scale-equitable framework, max-out, and scale compensation anchor-matching achieve superior performance.

MagNet [126] WACV18
Feature fusion approach to integrating contextual information.
ConvTranspose is more helpful than skip connections or context pooling.
The improvement is not obvious.
Zhu et al. [127] CVPR18 EMO metric to get a high IoU.
The EMO score inspired several effective strategies for a new anchor design to obtain a higher facial IoU score.
Simple improvements of RetinaNet achieve better performance.
It handles the imbalance between images.
Zhu et al. [127] pointed out that the anchor-based face detector does not process small faces well because the anchor and small faces cannot overlap perfectly, so it is difficult to adjust the anchor to be close to ground truth. Therefore, Zhu et al. proposed an expected max overlapping (EMO) score, which improves the ability of the anchor and face to obtain a high IoU. And, by increasing the number of small-scale anchors, it enhances the likelihood of matching a face. Additionally, to get a high IoU for these faces with the anchor, the algorithm randomly moves the face positions during training. Finally, a compensation strategy of anchor matching was also proposed to improve the chance of detecting hard faces. TinaFace [128] involves modifications to RetinaNet, and it achieved a 92.4% average precision (AP). First, a DCN [129] is introduced as the backbone to learn complex geometric transformations; then, Inception is used to improve the multi-scale representation. And, the loss of bounding box regression is changed from smooth L1 to DIoU [130] due to DIoU being more accommodating for small objects. Finally, an IoU-aware branch is included to address the mismatch between localization accuracy and classification scores. Hard example mining techniques like OHEM [131] identify hard positive and hard negative examples and focus more effort on training on those hard instances to improve detector performance. Zhang et al. [132] increased the effectiveness of OHEM by combining OHEM with hard image-level mining to train the face detector; it automatically alters training weights on images according to their difficulty. Additionally, they used a detector that only produces a single high-resolution feature map with small anchors to specifically learn small faces and train it by using the hard image mining strategy. The strengths and disadvantages of small face detection methods are shown in Table 7.

Small pedestrian detection
Song et al. [133] proposed a topological line localization (TLL) network, i.e., a topological line detection network based on the pedestrian torso, which was designed to reduce the effects of smallscale pedestrian boundary blur, appearance blur and the annotation method of the bounding box that brings too much of a noisy background to small objects. And, combining TLL and ConvLSTM into a single time-aware architecture to aggregate the features of consecutive frames in the video thus enhances the performance of small pedestrian detection. Furthermore, a Markov random field, as a post-processing strategy, is employed to deal with crowd occlusion. Das et al. [134] constructed the ISI pedestrian dataset, which includes 13,129 annotated video frames with 82.3 thousands labeled pedestrians. Additionally, Das et al. provided a three-phase detection algorithm. First, the prospective regions in each frame are identified using a zone classifier, which uses an improved Inception network to lower the error. The frames per second is then significantly improved by solely using the possible regions to locate the pedestrian's position. Finally, using non-maximum suppression (NMS) is applied to remove the redundant bounding box of the same pedestrian.
CNNs can not only learn low-level features, but it also has a strong ability to learn high-level semantic features. Therefore, CSP [135] simplifies pedestrian detection into pedestrian scale prediction and center tasks through a convolutional operation. The detection head applies a convolutional operation to the feature map generated by the feature extractor and adds two parallel 1 × 1 convolutions to generate, respectively, a centroid heat map and a scale size prediction map. Cross-entropy loss is employed in center point prediction and L1 loss is employed in scale prediction. Yu et al. [136] constructed the TinyPerson dataset, which focuses on persons on, at and around the seaside for maritime quick rescue. Pedestrians in TinyPerson are much smaller than those in other datasets, with most having pixel ranges of under 20 pixels and a wide variance in the person's aspect ratio. And, to solve the problem that the distribution of the pre-training dataset differs greatly from the distribution of the dataset for the specified task, this algorithm proposes a scale match to make the feature distribution consistent between the pre-trained dataset E and the task-specific dataset D, as shown in Eq (6), where , is defined as the probability density function of objects of size s in the dataset D, and T is the scale change function.
The FSAF [37] allows each instance to freely choose the best layer to optimize the network, instead of using the traditional pyramid, which puts several anchors of a fixed size at each level. The best feature layer for each instance is dynamically selected throughout the training phase based on the content of the instance, rather than just its size; the selection function is given by Eq (7): where 224 is the ImageNet pre-training size and is the initial feature layer. The strengths and disadvantages of small pedestrian methods are shown in Table 8. SaYwF [134] arXiv19 A three-phase detection model.
Achieves a trade-off between detection accuracy and detection speed.

CSP [135] CVPR19
Pedestrian detection is converted to high-level semantic feature prediction.
No additional postprocessing is required for CSP.
Objects with a large variance in aspect ratio need to be examined.
Dynamically assigning each instance to the most suitable feature level is more robust.
Separate anchor-free branches do not have many advantages over anchor-based branches.
Yu et al. [136] WACV20 Scale match of the pretrained dataset to the task-specified dataset.
Scale match can better utilize the existing annotated data.
It has poor performance on TinyPerson.

SOD in aerial images
Object detection in aerial images is crucial in many real-world applications, including urban planning, emergency rescue [137], traffic detection [138,139], etc. Since aerial images are usually taken from high altitudes looking down, the rotation of objects varies greatly and is displayed in arbitrary directions. In addition, aerial images contain highly dense scenes and many small objects, making SOD a complex problem for aerial remote sensing images. Innovative detection algorithms have emerged to address these issues.
S²A-Net [140] contains a feature alignment module and an oriented detection module to keep consistency between the classification score and localization accuracy. SCRDet [141] designed a supervised multidimensional attention to highlight small object regions and reduce the effect of background noise. Oriented R-CNN [142] and MRDet [143] both proposed a lightweight regional proposal network to generate oriented proposals. [144] proposed a novel model which contains four parts. To extract feature maps from the input photos, the first component serves as the backbone. The backbone incorporates a ResNet50 network with deformable convolutional layers because a regular convolution cannot adjust to variations in viewpoint in images taken by drones. The second part seeks to use an FPN to exploit and improve the feature maps obtained from ResNet50. The RPN, which can be used to extract prospective proposals of objects in the image, is the third component. The last section is a task head for certain goals. Bounding box and mask prediction are assigned by using an interleaved cascade architecture by the component. Yi et al. [145] extended the center key-point object detector for oriented object detection. A U-shaped network [146] is the foundation of the model. In the process of up-sampling, skip connections are used to combine feature maps. Four maps make up the output of the architecture: the heat map, offset map, box parameter map and orientation map. The heat map and offset map are used to deduce the locations of the center points. After the center points are detected, the box boundary-aware vectors (BBAVectors) are regressed to capture the oriented bounding boxes. According to Han et al. [147], CNNs lack rotation invariance, which means that, after an image is rotated, the features it extracted will also change. ReCNN was therefore proposed, allowing CNNs to have rotation invariance. They incorporate rotation-equivariant networks into the backbone to extract rotation-equivariant features, which allows for precise prediction of the orientation. Then, the rotation-invariant RoI Align module was developed based on RROI Align [148] to align both the channel dimension and the spatial dimension to obtain the rotation invariance features. DarkNet-RI [149] uses DarkNet53 [7] as a backbone that contains a rotation-invariant layer to extract rotation-invariant multi-scale features and use classification solutions to directly predict the location of objects. After that, a box refinement module is utilized to carry out additional NMS to eliminate overlapping and redundant bounding boxes. RepPoints [150] develops adaptive point sets and can capture the geometric structure of airborne objects with abrupt changes in direction in a chaotic environment. Three oriented conversion functions were presented by Li et al. [151] to transform adaptive points into oriented bounding boxes for various oriented objects. They apply MinAeraRect in the post-processing to provide the usually rotated rectangle prediction, and the NearestGTCorner and MinAeraRect functions are applied to enhance adaptive point learning during training. Xu et al. [152] proposed Dot Distance (DotD), i.e., a normalized Euclidean distance between the centroids of two bounding boxes, to solve the problem of IoU being sensitive to minute offsets between bounding boxes when detecting tiny objects. S 2 ANET-SR [153] uses super-resolution to enhance the feature extraction of small objects in remote sensing images and incorporates perceptual loss and texture matching loss to train S 2 ANET-SR jointly with the detection loss. The authors of [154] developed a cross-layer attention module to extract non-local features from small objects to enhance their features. The authors of [155] utilized a Gaussian mixture model to generate focal regions, as well as an incomplete box suppression method to mitigate the truncated box problem, which improved the performance of SOD. The strengths and weaknesses of aerial image methods are shown in Table 9.

Evaluation of SOD
This section provides an overview of the SOD datasets that are currently available. The performance of SOTA SOD approaches is also evaluated by using three large-scale datasets. We chose well-known image datasets: MS COCO for the general SOD evaluation, WiderFace for SOD tasks with small faces, TinyPersons for SOD tasks with small pedestrians and DOTA for SOD tasks for aerial images.

Dataset
A high-quality dataset is important for developing advanced object detection algorithms. In recent years, many well-known datasets for object detection have been published, such as MS COCO [43] and VOC [45]. VOC is a dataset for the Pascal VOC challenge object detection subtask, which has two versions: VOC2007 and VOC2012. More than 27 thousands object instance bounding boxes are labeled in 33,043 images in VOC2012. MS COCO is a sizable multi-task dataset, as it has 91 object categories in all (80 object categories are used for object detection tasks) and 2500 thousands labeled instances in 328 thousands images. The tasks on the COCO dataset are more challenging because, in contrast to VOC, COCO contains more small objects and more complicated backgrounds in the images. COCO also has a more balanced object distribution. Less than 20% of the images in the COCO dataset have only one category and an average of 3.5 categories and 7.7 instance objects of each image. Over 70% of the images in the VOC dataset have only one category; on average, there are only 1.4 categories and 2.3 instance objects per image. These benchmarks boost the development of detecting regularsized objects. Unfortunately, the detection of small objects is still insufficient. It is caused by both the characteristics of small objects themselves, as well as the fewer benchmarks designed for SOD. To provide a comprehensive review of a dataset, we investigated datasets containing a large number of small objects that span various SOD tasks, such as face detection, pedestrian detection, traffic sign/light detection and aerial image object detection, as shown in Tables 10-13.   Table 10. Overview of face detection datasets.

Dataset
Year Description WIDER FACE [47] 2016 WIDER FACE is a large-scale dataset of face images. Images are selected from the publicly available WIDER dataset. IJB [156] 2015 IJB-A/B/C is a dataset for face detection and recognition. IJB-A contains 1845 objects, 11,754 images, 55,026 video frames, 7011 videos and 10,044 non-facial images. DarkFace [157] 2019 The DarkFace dataset offers 6000 nighttime low-light photos from real-world locations, all labeled with bounding boxes of human faces. Additionally, this dataset has 9000 unlabeled low-light images taken in the same environment. UFDD [158] 2018 UFDD, an unconstrained face detection dataset, consists of more than 6000 images and 11,000 faces, and it contains seven scenes: rain, snow, haze, blur, illumination, lens impediments and distractors. WildestFaces [159] 2018 The WildestFaces dataset includes 67,889 pictures. Along with annotations for face detection and recognition, it also includes tags for blur severity, scale and occlusion. TinyPerson is a challenging benchmark for tiny object detection in a complex context and at a long distance. A total of 72,651 labeled very small objects are included in the dataset. WiderPerson [160] 2020 The WiderPerson dataset, which contains 32203 images with a total of 393703 instances.
EuroCity [161] 2018 The EuroCity person dataset was collected in several European countries by invehicle cameras; it includes about 47,300 images with more than 238,200 annotated instances of people. Citypersons [162] 2017 The Citypersons dataset is a subset of a cityscape; it offers 5,000 images from 27 cities with 30 fine-grained, pixel-level annotations.

Caltech [163] 2009
Caltech is a challenging dataset that contains low-resolution, frequently obstructed objects. There are 192,000 and 155,000 pedestrian instances in the training and testing sets, respectively.

Evaluation metrics
Frames per second refers to the speed of object detection, and it indicates the number of images that can be processed within each second. A higher value implies that the method is faster and can potentially be applied to real-time SOD.
IoU measures the similarity between the areas of the prediction bounding box (bboxpred) and the ground truth bounding box, bboxGT. The IoU function is given by Eq (8).
AP is a common metric for object detection tasks, and the following definitions are used in the AP calculation. 1) Positive sample: a sample that contains a detection object, and the prediction bbox confidence score is larger than the set threshold.
2) Negative samples: samples that do not contain detection objects, and the prediction bbox confidence score is larger than the set threshold.  UAVDT [165] 2018 UAVDT is a sizable UAV-based video dataset with 80,000 total frames that are intended for vehicle detection and tracking.
DOTA [166] 2018 DOTA has three versions so far; DOTA-v1.0 includes 188,282 instances of 2806 aerial images in 15 main categories. NWPU VHR-10 [167] 2016 The NWPU VHR-10 dataset contains a total of 800 very high-resolution optical remote sensing images, which were acquired from Google Earth and Vaihingen. UCAS-AOD [168] 2015 The UCAS-AOD datasets include many small objects with intricate backgrounds with a total of 2420 images and 14,596 instances.
In the VOC dataset, the IoU threshold is typically set to 0.5. Positive samples with IoU values higher than 0.5 are labeled as TP, and positive samples with IoU values lower than 0.5 are labeled as FP. FN indicates the number of objects in ground truth that are not found. Then, the precision rate and recall rate are given in Eqs (9) and (10). AP is calculated across different recalls. Specifically, for a given recall value r, the precision value takes the maximum of all recall values that are greater than or equal to r. Then, the area under the precision-recall (P-R) curve is referred to as the AP value. The mAP is the mean AP value across all categories. AP and mAP are given in Eqs (11) and (12).
The stricter COCO evaluation metric is more widely used than the PASCAL VOC evaluation metric. The IoU thresholds of it typically range from 0.5 to 0.95, with a 0.05 step size. A special AP is also calculated separately for small (the square of the area < 32 2 ), medium (32 2 < area < 96 2 ), and large (area > 96 2 ) objects in MS COCO.  Table 14 shows the performance evaluation results for generic SOD algorithms applied to the COCO dataset; note that AP has the same meaning as mAP. AP50 and AP75 denote the AP when the IoU is set to 0.5 or 0.75, respectively, while APs, APm and APl denote the average accuracy for small, medium and large objects, respectively. As shown, IENet [179 achieves the best AP (51.2). In general, the detection performance for large objects is much higher than that for other sizes. HRDNet [78] achieves a value of 32.1 for small objects, whereas MRCenterNet [118] achieves a value of 27.8 for small objects. These results show that increasing the resolution of the input feature with multi-scale training can yield better performance on small objects. All experiments were conducted on a Linux system with NVIDIA GeForce RTX 2080Ti, CUDA 11.7.

Performance for small face detection
In Table 15, we evaluate small face detection methods on WIDERFACE [47]. WIDERFACE defines three levels of difficulty: 'easy', 'medium' and 'hard' based on the detection rate of EdgeBox [180]. As shown, TinaFace [128] achieves the best AP; the AP values for the easy, medium and hard test sets are 96.3, 95.7 and 92.1 respectively. IENet [180] achieves relatively better results, the AP values for the easy, medium and hard test sets are 96.1, 94.7 and 89.6, respectively. TinaFace and IENet both increase the resolution of the prediction feature map, which fully utilizes the fused feature map. IENET also fully incorporates the contextual information. It shows that boosting the resolution of the prediction feature map and incorporating contextual information may be the key to enhancing face detection.   Table 16 shows the typical small pedestrian SOD methods on the TinyPerson [136] dataset. MR [184] denotes the miss rate. The size divides are indicated by the superscripts MR and AP, where tiny denotes the size range (2,20) and small denotes the size range (20,32). The IoU thresholds utilized for the evaluation are indicated by the subscripts of MR and AP. Among these algorithms, FCOS [39] achieves the best results for all MR evaluations. With an IoU of 0.5, the FPN produced the best AP for small and tiny objects, whereas the Grid R-CNN [185] did so with IoU values of 0.25 and 0.75, respectively.

Performance on aerial images
In Table 17, we compare the performance of state-of-the-art aerial image object detection algorithms on DOTA-v1.0 [166], which consists of 15 categories: plane (PL), baseball diamond (BD), bridge (BR), ground field track (GTF), small vehicle (SV), large vehicle (LV), tennis court (TC), basketball court (BC), storage tank (SC), soccer-ball field (SBF), roundabout (RA), harbor (HA), swimming pool (SP) and helicopter (HC). ReDet and Oriented R-CNN achieve the best mAP value of 76.3. The best AP in each category is marked in bold.

Further discussion
Based on the experimental results, we further discuss some limitations of existing SOD methods as follows.
1) The framework of SOD is generally modified by popular models like Faster R-CNN, SSD and YOLO; these architectures may not be suitable for small objects, leading to poor performance.
2) Using super-resolution to enhance the resolution of a small object can improve the precision of SOD, but the detection speed will be significantly lower and unable to fulfill the demands of realworld scenarios like real-time monitoring.
3) Transformers have been widely applied in the computer vision field, like DETR [190] in object detection. However, there has not been much research on using transformers for SOD. 4) CNNs are not sensitive to scale changes. There is a need to design feature extractors that are more suitable for scale-aware. 5) MS COCO may not be an ideal benchmark for small objects because small objects account for a relatively small percentage of the dataset.

Challenges of SOD
In addition to the common challenges in object detection, such as continual object detection, imbalance problems, etc. There are typical challenges when it comes to SOD, including feature representation with noise, small object information loss, the effect of the receptive field, location variation sensitivity and the scarcity of small object datasets.
Feature representation with noise. The features of small objects are often contaminated by noise in the background after CNN implementation, making it difficult for the network to capture the discriminative information that is pivotal for the localization and classification tasks. Besides, small objects are often occluded and clustered, so it is particularly difficult to distinguish small objects from noisy clutter and precisely locate their boundaries.
Small object information loss. The features of a small object are virtually eliminated after the down-sampling operations in deeper neural networks due to the small number of pixels occupying each small object. The weak information wipeout of small objects is fatal to SOD because it is hard for the detection head to give accurate predictions in the presence of highly structural representations.
Effect of the receptive field. Large receptive fields are typically chosen by deep neural networks to prevent the loss of information. However, the receptive field for the prediction low-resolution feature map may not match the size of small objects. If the receptive field is larger than the small object, it will cause the object to be detected to become the background, and no features will be extracted by backbone networks, resulting in poor SOD performance.
Location variation sensitivity. Small location deviation of the bounding box in the IoU-based metric produces a more significant disturbance for small objects than for larger objects, which makes it difficult to find a suitable IoU threshold and deliver high-quality positive and negative samples to train the networks.
Scarcity of small object datasets. There are still not enough large-scale general small object datasets to match the cost of annotating small objects. Although MS COCO has a reasonably large amount of small objects (31.62%), each image has too many instances, which leads to the uneven distribution of small objects.

Future directions
According to the challenges of SOD and the analysis of performance results, we discuss several potential directions for future research in SOD: 1) Weakly supervised, unsupervised and self-supervised SOD. Existing deep learning-based SOD techniques use a fully supervised model. For model training, a sizable number of images with bounding-box annotations (fully supervised information) are required. However, the annotation work is labor-intensive and time-consuming. Weakly supervised object detection can use image-level labels (such as image categories) as supervised signals to train object localization models without the need for pixel-level annotation, which lessens the workload associated with the annotation. Unsupervised salient object detection [191] and self-supervised learning tasks [192] based on contrastive learning have become hot research topics in the past 2 years. Therefore, it is crucial to continue researching the development of weakly supervised learning-based SOD algorithms.
2) Suitable metric for SOD. IoU-based metrics, including the original IoU and its extensions (DIoU, GIou, etc.), are extremely sensitive to the position deviation of small objects and significantly reduce the detection performance when utilized in anchor-based detectors. The authors of [119] use a new Wasserstein distance-based SOD metric, which outperformed the standard fine-tuning baseline by an AP value of 6.7 AP, as well as the state-of-the-art SOTA model by an AP value of 6.0. Therefore, designing a suitable metric for small objects will be crucial to further research.
3) Multi-task joint optimization. Even though techniques like scale-aware training strategies, incorporating contextual information, data augmentation and increasing the resolution of input features help to improve SOD performance, they are still far from adequate, and the combined use of these methods may be able to further improve SOD performance. 4) Open world or few-shot SOD. Few-shot object detection [193] has produced prominent achievements, and SOD in the few-shot scenario is also in urgent need of solutions. Open world SOD seeks to overcome the SOD conundrum while enabling incremental learning in the model, and this type of issue will be a significant research topic in the future.

Conclusions
An in-depth review of state-of-the-art deep learning-based SOD algorithms is provided in this paper. We focus on SOD optimization approaches that aim to address the challenges of SOD, including scale-aware training, contextual information incorporation, data augmentation and boosting the resolution of input features. We have summarized the strengths and limitations of these approaches. We have also reviewed methods for crucial SOD tasks, including tiny face detection, tiny pedestrian detection and aerial image object detection. Additionally, detailed experiments were carried out to evaluate the performance of generic SOD algorithms, as well as methods for crucial SOD tasks; we found that boosting the resolution of input features is the most efficient way to improve SOD performance. Finally, we have presented four potential future directions for SOD.