ESRTMDet: An End-to-End Super-Resolution Enhanced Real-Time Rotated Object Detector for Degraded Aerial Images

The degradation of image resolution reduces the detection performance in aerial imagery because it generates a large number of small objects, and accurately detecting these small objects remains a challenge. Existing methods mostly use a superresolution (SR) model to first obtain the SR image of the low-resolution degraded image (<inline-formula><tex-math notation="LaTeX">$I^{\text{LR}}$</tex-math></inline-formula>) and then use this image as the input of the object detection (OD) network to solve this problem. However, this architecture that involves executing a complex SR network before the detector is time-consuming and makes it hard to achieve real-time model inference. To address this challenge, we propose a simple and effective rotated small OD method, named end-to-end superresolution enhanced real-time rotated object detector (ESRTMDet). First, we design a lightweight embedded feature map superresolution module (ESRM) embedded in the detection model to enhance and amplify the backbone output features, making the detection heads detect small objects more easily. Furthermore, we train a parallel SR network branch (PSRB) simultaneously that uses the backbone feature to restore a high-resolution image. Through our proposed feature alignment loss and feature affinity layer, our PSRB effectively guides the feature map enhancement of ESRM. Finally, through end-to-end joint optimization of the detector and PSRB, the detection performance on <inline-formula><tex-math notation="LaTeX">$I^{\text{LR}}$</tex-math></inline-formula> is significantly improved. Extensive experiments over DOTA and UCAS-AOD demonstrate that our method can achieve state-of-the-art results. In addition, we discard our PSRB and use <inline-formula><tex-math notation="LaTeX">$I^{\text{LR}}$</tex-math></inline-formula> as the input during inference, reducing the inference time-consuming of our model. Therefore, our ESRTMDet-X not only achieves 77.11% mean of average precision on the degraded DOTA dataset, but also achieves an amazing inference speed of 337 FPS, thus obtaining the best speed–accuracy tradeoff.


I. INTRODUCTION
A ERIAL images obtained from Earth observation and remote sensing technologies provide a bird's-eye view of the Earth's surface, depicting complex spatial scenes and numerous diverse objects. Image classification and object detection for aerial images (also known as remote sensing images) are among the most fundamental and challenging research topics in the geoscience and remote sensing communities. In recent years, significant progress has been achieved in these areas thanks to the development of deep learning techniques. For aerial imagery classification tasks, a combination of graph convolutional networks and convolutional neural networks (CNN) is used to extract diverse and discriminative features [1], resulting in superior classification performance for hyperspectral remote sensing images. Furthermore, a multimodal deep learning framework [2] has been developed to effectively utilize information from different modality remote sensing images, achieving state-of-the-art (SOTA) performance in pixel-level remote sensing image classification. For object detection in aerial images (ODAI), it is a challenging task due to the presence of a great number of small, cluttered, large aspect ratio, and arbitrarily oriented objects [3]. In recent years, significant progress has been achieved in ODAI with the development of deep CNN [3], [4], [5], [6], [7], [8], [9]. However, these methods rely on high-resolution (HR) aerial images (I HR ) that have a resolution up to half a meter and good imagery quality.
In practice, due to harsh imaging conditions, such as aerial camera shake, short transmission bandwidth, long-range shooting, and undersampled imaging, degraded aerial images [10], [11], [12], [13] are commonly captured. Hence, object detection for degraded aerial images has gained more attention in recent years. In particular, resolution degradation is a common type of degradation, and these degraded images are also called lowresolution (LR) degraded images (I LR ), as shown in Fig. 1. Compared to I HR , I LR often lack texture features, and object regions are more blurred, leading to poor detection results [14]. To address the problem of resolution degradation, recent works [14], [15], [16], [17], [18], [19] have introduced superresolution (SR) methods to restore missing texture and features in I LR before or after object detection (OD). For instance, Rabbi et al. [17] proposed the edge-enhanced superresolution generative adversarial network (EESRGAN) to obtain SR images prior to executing the detector network. They backpropagated the detection loss and discriminator's loss into the generator net's parameters, optimizing the generative adversarial network (GAN) jointly to produce SR images that more closely resemble HR images. Bai et al. [20] used a two-stage OD method (FasterRCNN) to obtain object patch images first. They then employed the superresolution generative adversarial network (SRGAN) to obtain SR patches and used the discriminator to refine the classification and regression results. Yang et al. [14] proposed the mutual-feel learning (MFL) architecture, which also used SRGAN to obtain SR images prior to executing FasterRCNN. They added a feedback path to the SRGAN discriminator, forming a closed-loop structure. MFL used a discriminator to distinguish the region of interest (RoI) features extracted by the region proposed network (RPN), the RoI features cropped on the SR image, and the RoI feature cropped on the HR image. This approach makes the SRGAN pay more attention to the region where the objects may exist. However, the abovementioned methods are time-consuming and difficult to enhance useful detection features. The optimization of generation is not guided by object information and uses separated iterative optimization for these two different tasks. Moreover, these methods do not consider the problem of rotated ODAI, as they all use horizontal bounding boxes (HBBs) to represent objects. Furthermore, we note that training and inferring the detector directly on the LR input image (I LR ) significantly reduces the computational burden. This approach can bring more noticeable model inference acceleration than many model compression techniques [21], which is more conducive to achieving real-time inference of the model.
Overall, the motivation of our article is to use a rotated object detection (ROD) method that integrates the SR network to solve or alleviate the problem of degraded detection performance  caused by image resolution degradation under the constraints of easily realize model deployment in practical unmanned aerial vehicle (UAV) systems. Through analysis of existing methods, we have identified following four remaining challenges that need to be addressed to achieve our goals.
1) The challenge of achieving precise detection of small rotated objects. The degradation of resolution in aerial imagery leads to the presence of numerous small targets, as shown in Figs. 2 and 3. However, existing methods continue to use HBBs to locate objects, overlooking the important feature that objects in aerial images can be arbitrarily oriented. As a result, the challenge of achieving accurate detection of small rotated objects in the field of OD remains unsolved.
2) The challenge of achieving overall model lightweight. The majority of existing OD methods that integrate the SR network use SOTA image SR methods (based on GAN) to directly upsample degraded images, then perform a two-stage detector on these SR images, which results in complex model architecture, a large number of model parameters, and difficulty achieving the overall model that is lightweight.
3) The challenge of jointly optimizing different types of models. Because SR and OD networks are designed for different types of tasks, hence, the features and concerns extracted by these two networks are quite different. As a result, the architecture for OD combined with SR is hard to realize end-to-end joint optimization. Existing methods typically adopt a training process where these two types of models are separately trained and then fine-tuned through joint training. This training pipeline is often suboptimal and inefficient [14], [22] due to the lack of exchange of information between different models, and it takes significant time spent on independent model training in advance. 4) The challenge of ensuring fast model inference. Considering the actual deployment requirements of possible UAV systems, the detection model must have a fast inference speed, achieving at least 60 FPS to ensure stable and reliable detection. But existing OD methods that combine SR usually demand full execution of the complex SR network during inference, intensifying computational burden and leading to slow inference speeds (often below 60 FPS), which fails to meet actual deployment requirements. Therefore, to realize a real-time ROD method in aerial images, we propose a series of simple and effective models named ESRTMDet. Our method not only solves the drawbacks of the abovementioned detectors combined with the SR method but also obtains the fastest detection speed we know of. In summary, our key contributions are as follows.
1) We propose a lightweight embedded feature map superresolution module (ESRM) that comes after the detection backbone and before the neck. Our ESRM effectively uses valuable texture enhancement features learned by the parallel superresolution network branch (PSRB) to enhance the detection head's ability to extract small object features. And our ESRM does not bring too much additional computational burden. Through ESRM, we have alleviated the challenges caused by rotating small target detection (challenge 1), and by embedding ESRM into the lightweight detection model, we ensure the lightweight of the overall model (challenge 2). 2) We use the PSRB as an auxiliary network and employ the feature affinity layer (FAL) and feature alignment loss (L AL ) to guide the ESRM in restoring high-frequency texture information, thus enhancing the amplification quality of feature maps. In addition, our PSRB is not involved in model inference, ensuring real-time detection ability, which solves challenge 4. 3) To enable the PSRB to focus more on the regions where detection objects are present, we generate RoI weights using the predicted output of the classification branch of detector heads. Our RoI weights optimize PSRB, ESRM, and FAM training, allowing for effective end-to-end joint optimization between these two heterogenous learning tasks, hence, challenge 3 is also solved. 4) A series of experiments on the DOTA and the UCAS-AOD datasets demonstrate the effectiveness of our method. Our ESRTMDet-X achieves 77.11% mAP on DOTA with single-scale training and testing, as well as 89.5% and 95.0% on UCAS-AOD using VOC2007 and VOC2012 metrics, respectively, achieving SOTA detection performance on aerial I IR . Furthermore, our proposed model achieves an impressive inference speed of 337 FPS, making it the best tradeoff between speed and accuracy for ROD in degraded aerial images, as far as we know. Therefore, our research provides significant practical value for the deployment of deep learning algorithms in actual UAV systems.

II. RELATED WORK
In this section, we review recent related works on three aspects: rotated ODAI, SR networks, and the methods of combining SR and OD (SR+OD) in aerial images, as our proposed method integrates both SR and ROD.

A. Rotated ODAI
In the last decade, significant progress has been made in the field of OD, with notable advancements by [23], [24], [25], [26], [27], [28], [29]. At the same time, significant progress has been made in ODAI. For example, Wu et al. [30] combined a novel spatial-frequency channel feature with fast image pyramid estimation and ensemble classifier learning in the classic VJ [31] detection framework to achieve the most advanced detection performance among nondeep learning methods. However, this approach is limited by the detection framework, which only allows for the use of HBBs to represent objects, it is difficult to expand this approach to use more precise rotated bounding boxes (RBBs) to represent objects. Oriented OD, also known as ROD is a subfield of OD that utilizes more precise RBBs to represent objects. And in recent years, it has attracted considerable attention due to its potential applications in various fields, such as management, remote sensing, precision agriculture, national defense, emergency rescue, and disaster relief [4], [32], it has also become the most important research subtopic in remote sensing image OD tasks.
To tackle the challenge of detecting rotated objects in aerial images, one possible approach is to use rotated anchors, such as rotated RPN [32], which places anchors with different angles, scales, and aspect ratios on each location. However, densely rotated anchors result in extensive computations and memory usage. To address these issues, Ding et al. [4] proposed the RoI transformer that learns rotated RoIs from horizontal RoIs produced by RPN, which significantly improves the accuracy of oriented OD. However, this method increases the network complexity and requires fully connected layers and RoI alignment operations during the learning of rotated RoIs. On the other hand, Xu et al. [5] proposed a new representation called gliding vertices, which achieves ROD by learning four vertex gliding offsets on the regression branch of the FasterRCNN head [23]. Although this method simplifies the computation by avoiding RoI alignment and fully connected layers, it still uses horizontal RoIs and is based on a two-stage detection architecture, which is time-consuming and computationally expensive.
To overcome these limitations, some studies [6], [33] explored one-stage oriented OD frameworks based on the RetinaNet [24], which outputs object classes and RBBs without region proposal generation and RoI alignment operations. In addition, in recent years, there has been rapid development of anchor-free detectors in general OD tasks [26], [27], [28], [29], [34]. This mechanism significantly reduces the number of design parameters that require heuristic tuning and tricks for good performance. This simplifies the detector, especially during training and decoding phases [26], [34]. Several studies have explored anchor-free mechanisms for ROD. Pan et al. [35] developed a dynamic refinement network based on the anchor-free CenterNet [28]. He et al. [36] utilized attention mechanisms to refine the performance of remote sensing OD in a one-stage anchor-free network framework. Gong et al. [37] proposed an anchor-free oriented proposal generator to replace the RPN for horizontal boxes in the FasterRCNN detector, which resulted in improved performance. Li et al. [38] proposed an effective anchor-free method called oriented RepPoints, which uses an adaptive point set to capture the semantic and geometric features of an oriented object as a fine-grained representation. Liu et al. [8] used a Gaussian distribution to constrain the RBB and proposed a new assignment method suitable for rotated detection tasks. They combined this method with YOLOX to obtain stronger detection performance. However, the use of RBB representation causes problems, such as boundary discontinuity and square-like issues, making rotational IoU losses indifferentiable, which hinders the use of anchor-free methods. Therefore, Yang et al. designed GWD [39] and KLD [40] regression loss based on Gaussian Wasserstein distance and Kullback-Leibler divergence, respectively. These methods can be used with the anchor-free method FCOS [26] and result in performance improvements in ROD tasks.

B. SR Network
SR is a technique that generates an HR image using an LR image, with the aim of recovering high-frequency texture information [41], [42]. Superresolution CNN (SRCNN) [43] was the first to successfully use CNN in the SR problem. SRCNN's structure is straightforward, consisting of only three CNN layers, and it processes preupsampled images obtained by bicubic interpolation. Residual learning, which uses skipping connections to avoid gradient vanishing, makes the design of deep networks possible compared with the original stacked CNN [44]. Inspired by the ResNet architecture, the enhanced deep superresolution (EDSR) network [45] has been proposed, which removes batch normalization layers (BN) in each residual block (ResBlock) of ResNet since BNs get rid of range flexibility from the network and achieves performance improvement. With the development of deep learning, GAN has shown a remarkable ability for SR problems. Superresolution GAN (SRGAN) [46] focuses the generator on recovering high-frequency texture information using perceptual loss. Enhanced SRGAN [47] is developed by removing the BN in the generator and designing a residual-inresidual dense block to replace the normal ResBlock, achieving more significant performance improvement. However, most SR methods pursue better results in SR by using models with a large number of parameters, leading to higher computational burden and lower network inference speed.

C. Methods of Combining SR and OD (SR+OD) in Aerial Images
The use of SR as a preprocessing step in OD has proven to be effective in various OD tasks [48]. Shermeyer et al. [49] also demonstrated the usefulness of SR for OD performance on satellite imagery. Courtrai et al. [50] used an SR network based on GAN to generate SR images, which are then fed into the detector to improve detection performance. Rabbi et al. [17] used a Laplacian operator to extract edges from input images to enhance the ability to reconstruct HR images, resulting in improved performance in object localization and classification. Small-object detection (SOD)-multitask GAN (MTGAN) [20] proposed using an OD network to adaptively generate RoI object patches for subsequent restoration and detection. Wang et al. [19] introduced the effectiveness of SR for OD in the remote sensing field, as well as an SR model based on multifeature fusion and CycleGAN structure, to enhance images. Bashir et al. [18] improved the SR framework by incorporating a cyclic GAN and residual feature aggregation (RFA) and used YOLO as the detection network to detect objects on SR images. Yang et al. [14] added a feedback path to take FasterRCNN's RPN results to the SRGAN discriminator, forming a closed-loop structure and making the SRGAN pay more attention to the region where the object may exist. In these works, the SR structure has effectively addressed the challenges of small objects and LR inputs. However, compared with single detection models, additional computation is introduced due to the enlarged scale of the input image to HR size, and the cost of the SR network cannot be ignored. Unlike the aforementioned work, where SR is applied at the start stage, using the SR network only as an auxiliary method to enhance SOD performance without participating in model inference is a more promising architecture. Zhang et al. [22] adopted this architecture, using EDSR as an auxiliary network and YOLO v5 backbone fusion features as input to EDSR to restore the HR image. However, this method still lacks information communication between these two different tasks.
Moreover, Wu et al. [51] addressed SOD problems by converting them into semantic segmentation problems and proposed UIU-Net for infrared SOD by utilizing an interactive cross attention mechanism and the ReSidual U-blocks module to improve the classical UNet framework, resulting in the most advanced segmentation performance. However, while the minimum external rectangle postprocessing method can be used to obtain the rotation detection box from the mask, it still cannot be directly applied to the rotating small target detection problem of optical remote sensing images because the commonly used aerial image target detection data lacks fine semantic segmentation masks of the objects.

III. METHOD
In this section, we introduce our proposed method. First, we provide a brief overview of the baseline model, which is the basic rotated detection method we adopt. Second, we introduce the specific network used in our PSRB, which is a paralleled SR network branch. Next, we describe our proposed ESRM in detail. This module is embedded after the detector backbone and before the neck, and it aims to enhance the feature maps to improve the detection performance. Finally, we present the overall architecture of our proposed method and provide the optimization details.

A. Basic Rotated Detection Method as Baseline
In previous works [14], [17], [22], the object orientation in I LR was not taken into account, and these methods still used HBBs to indicate objects. However, the number of small objects in resolution-degraded aerial images increases greatly, as shown in Figs. 2 and 3. Conventional HBB representation introduces background information that is not conducive to accurately locating objects. Therefore, research is necessary to detect objects in resolution-degraded aerial images using a more accurate RBB representation. The RBB is usually represented as follows: where, θ ∈ − π 2 , π 2 denotes the clockwise rotated angle from the image coordinate system position direction of x to the bounding box relatively coordinate system position direction of x. We use the long edge definition format [39] where the width w must be larger than the height h. Recently, the success of the transformer architecture [52], [53], [54] in the image comprehension field has drawn attention to improving classic CNNs. Among them, ConvNetXt [55] uses large kernel convolutions to increase the feature receptive field and capture global context, which overcomes the shortcomings of classic 3 × 3 kernel convolutions and achieves significant performance improvements. The real-time models for object detection (RTMDet) series model [9], based on the YOLOX series model, uses large-kernel depthwise convolutions to replace classic 3 × 3 convolutions to build basic CSP layers [56]. This balances the performance and inference overhead of convolution well. The RTMDet model not only uses large-kernel depthwise convolutions but also has compatible capacities in the backbone and neck. It introduces soft labels when calculating matching costs in the dynamic label assignment and uses better training techniques, all of which efficiently improve detection performance. Furthermore, this model can be easily modified for ROD tasks by modifying the output number of the regression branch (from 4 to 5, increasing the prediction of a rotation angle) and using the simplest rotation IoU loss, which we named rRTMDet in this article. In this article, we chose the one-stage anchor-free rotated detector rRTMDet as our basic rotated detection model and baseline. We trained rRTMDet directly on the DOTA I LR and OD results shown in Table I.

B. Parallel Superresolution Network Branch
In this section, we present the structure of the PSRB, which is depicted in Fig. 5. The PSRB comprises three key components: the feature encode module (FEM), the feature decode module (FDM), and the feature up-sample module (FUM). We propose a feature encode module (FEM) based on the stem network architecture of the rRTMDet detection network, as detailed in Fig. 4. The classic SR net [45], [47] directly uses I LR as input and retains low-level image structured information in its extracted features, which also exists in the stem part of the detector's backbone network. To better leverage these features and promote the learning of high-level features of the SR network, we incorporate the FEM before the classic SR network. This allows us to make the lowest input features between the two tasks as similar as possible, which facilitates end-to-end joint training and optimization. The FDM and FUM are part of the EDSR model [45], as illustrated in Fig. 5. Specifically, we adjust the number of channel dimensions in the first layer convolution of EDSR to match the output feature channel dimension of FEM. In addition, based on our experiments in Section IV-C, we propose using only four stacked layers of ResBlocks in our FDM instead of the original sixteen, since deeper PSRB did not improve performance but significantly prolonged training time. The FUM is subpixel convolution layers [57], the same as EDSR's FUM. Since the model feature maps are downsampled by a factor of 2 after FEM, we use the FUM to upsample four times, and then reduce the channel dimension to 3 through the final convolution, to obtain the final SR image. Therefore, our proposed PSRB performs ×2 image SR task.

C. Embedded Feature Map Superresolution Module
Previous studies [22] and [58], have demonstrated that incorporating PSRB into the original architecture can enhance the performance of general OD tasks and semantic segmentation tasks when using I LR as input. However, in our experiments (see Section IV-C), we found that adding PSRB to the architecture for ROD tasks in aerial images resulted in only minor performance improvements.
The limited performance gains from adding a pixel-shuffle residual block (PSRB) to the original architecture can be attributed to the need for effective information interaction between the two different task models to avoid interference of background information due to more accurate RBB annotation required for ROD. Furthermore, the reduced feature size of the neck due to I LR input size halves that of I HR , making it harder for the model to pay attention to small objects, thus greatly reducing the rotated detection model's performance. Hence, we propose the use of a lightweight ESRM to improve the performance of ROD tasks in aerial images when using I LR as the input image. We embed our ESRM after the C2, C3, and C4 output of the detector's backbone, using lightweight large-kernel depthwise  convolution and 3 × 3 convolution to build the basic ResBlocks, following the approach used in [9]. Our ESRM increases the output feature map size of the backbone by 2 times, equivalent to the I HR as the model input. See Fig. 6 for our ESRM structure and Fig. 4 for the detailed composition of the ResBlock.

D. Feature Alignment Loss and FAL
We design an additional FAL, as shown in Fig. 4, to enhance the information interaction between the two different task models. First, we downsample the output feature map of PSRB's FDM four times and process it through the FAL. Next, we calculate our proposed feature alignment loss (L AL ) between the output feature of our ESRM on the C2 backbone and the FAL to minimize the similarity difference between these two output features. This enables us to jointly optimize the two types of tasks so that effective information can be exchanged between these two types of models, thus optimizing the overall architecture.
Our proposed feature alignment loss uses the normalized Gram matrix to calculate the internal structure similarity of the feature map, as shown in (5). Specifically, for any feature map F ∈ R C×H×W , we can compress its spatial dimensions to obtain a feature map F ∈ R C×HW . The F can be represented by its row vector f i ∈ R 1×HW , i = 1, . . ., C as We use the Gram matrix (2) to calculate the similarity between the row vectors. However, there may be numerical issues when using the Gram matrix directly. Therefore, we first regularize f i as norm_f i = ||f i || 2 before calculating the Gram matrix. We actually use the normalized Gram matrix, as shown in (4), to address the numerical problems. We represent the matrix composed of the normalized norm_f i as norm_F = [ norm_f 1 norm_f 2 . . . norm_f C ]. A more direct calculation method of the normalized Gram matrix is shown in (5).
We use the normalized Gram matrix to calculate the similarity between different feature maps. Specifically, we measure the structural relation difference between different input feature maps using the weighted Euclidean distance, which we define as our proposed feature alignment loss L AL . This is shown in the following: where, represents elementwise multiplication, and W roi is our proposed RoI weights. F 1 and F 2 represent two input feature maps that need to be aligned with each other.
Because we want to improve the attention of the PSRB to the image RoI while enhancing the low-level structural features in the corresponding region of the detection feature. To achieve this, we generate RoI weights using the classification branch of the detection model. The detailed calculation method for generating RoI weights is provided in Algorithm 1. Among them, the variable cls_scores corresponds to the output of the detector's classification branch. The hyperparameter α serves as the weight ratio between the object-containing region and the background area. We set the default value of α to 5. Our experiments, depicted in Fig. 8, indicate that our proposed model is not highly sensitive to this hyperparameter. Our proposed weights W roi are utilized as the weighted coefficients for L SR [see (8)] and L AL [see (6)] in our model. This way, the results of the detection model can influence the SR model, and through W roi they form a closed loop in our overall architecture.

E. Overall Architecture and Optimization
The overall architecture of our proposed method is shown in Fig. 4, and the end-to-end training pipeline,as Algorithm 2 shows in the following.
Using this end-to-end training pipeline, the PSRB and rRT-MDet models can be trained jointly. Our method does not require training the generator and discriminator separately, such as in  GAN models found in the literature [14], [17], [18], nor does it require training different task networks independently. As a result, our method is simpler and easier to deploy. The detection loss L Det , SR loss L SR , and total loss L Total of our method is calculated as follows, respectively: where, N indicates the number of positive samples in the rRT-MDet head, i is the index of a positive sample in a batch, p i and b i are the predicted object category and decode bounding box in the head. l i represents the ground-truth category of ith object and g i is the ground-truth bounding box. And we follow the RTMDet default setting employing quality focal loss [59] as the L cls , use rotated IoU loss [60] as the L reg . And we also follow the common practice in the general SR task using L2 loss in our max_score ← max(cls_scores, dim = 1) mean ← mean(max_score) std ← std(max_score) mask ← float(max_score ≥ mean + std) mask ← interpolate(mask, scaler_factor = 2 i+1 ) masks ← append(mask) masks ← logical_or(masks) attention_region ← interpolate(masks, scaler_factor = 1 4 ) W roi = attention_region × (α − 1) + 1 return W roi L SR and use our proposed W roi as the loss weights. In addition, λ 1 , λ 2 , λ 3 , λ 4 are loss balance parameter, which we set to {1, 2, 1, 10} by default.

IV. EXPERIMENTS
Our method was evaluated on two challenging aerial ROD datasets, i.e., DOTA and UCAS-AOD.

A. Datasets
DOTA: [3] is a large-scale aerial OD dataset consisting of 2806 aerial images ranging from 800×800 to 4000×4000, containing a total of 188 282 instances of 15 common object categories, such as planes (PL), baseball diamonds (BD), bridges (BR), ground track fields (GTF), small vehicles (SV), large vehicles (LV), ships (SH), tennis courts (TC), basketball courts (BC), storage tanks (ST), soccer-ball fields (SBF), roundabouts (RA), harbors (HA), swimming pools (SP), and helicopters (HC). Both the training and validation sets are used for training, while the test set is used for testing. In accordance with [6], we extract a series of 1024×1024 patches with a 200-pixel overlap from the original images to create our HR (I HR ) datasets for experimentation. We then use the bicubic method to downsample I HR by 2 times, getting in 512×512 resolution-degraded images (I LR ). After the image degradation process, we observe the distribution of small, medium, and large objects on the DOTA dataset in Fig. 2. Fig. 2 shows that the number of small objects increased by 45.2% after ×2 image resolution degradation, while the number of large objects decreased by 80.9%. This change in object distribution significantly affects the performance of the baseline model, as demonstrated in Table I. UCAS-AOD: [61] is an aerial image dataset designed for rotated SOD, which contains 1510 images including 510 car images and 1000 plane images, with a total of 14 596 instances. As is customary, we randomly divided it into the training set, validation set, and test set with a ratio of 5:2:3. To experiment with the UCAS-AOD dataset, we resized all images to 836×836 Input C2 , C3 , C4 into Neck and obtain P 2, P 3, and P 4 feature maps; Input P 2, P 3, P 4 into Head to obtain O Det and each classification branch output cls_scores; Step 2: PSRB forward: Input I LR into FEM and obtain SR encoder feature map E SR ; Input E SR into FDM and obtain SR decoder residual feature map R SR ; Input R SR into FUM and obtain I SR ; Step 3: Joint optimize: Input R SR into FAL and obtain SR affinity residual feature map A SR ; Downsample A SR to C2 RF 's feature map size, named A SR ; Input cls_scores into Algorithm 1 to obtain attention region weights W roi ; Use W roi as the weights in Alignment loss L AL and SR loss L SR ; Calculate L AL between A SR and C2 RF , and L AL between E SR and C0; Calculate L SR and detection loss L Det ; Through L Total use arbitrary optimizer to joint optimize our model to obtain HR (I HR ) images and used the same method as in DOTA experiments to obtain corresponding resolution-degraded images (I LR ) with a size of 416×416. Fig. 3 shows the change in the number of objects of different sizes on the UCAS-AOD dataset after downsampling. The analysis reveals that the number of small objects increased by 62.4% after the typical ×2 resolution degradation processing. However, detecting small objects accurately is more challenging than detecting medium and large objects, and as a result, the detection accuracy of the baseline model directly detecting on I LR decreased significantly.

B. Implement Details
We followed the experimental configuration of RTMDet, using CSPNetXt [9] as the backbone and CSPNetXt-PAFPN as the neck for our ESRTMDet. For fair comparisons with other methods, we used CSPNetXt-L and CSPNetXt-X as backbones,

C. Ablation Studies
In this section, we conduct a series of experiments on the DOTA dataset to verify the effectiveness of our proposed method. All ablation experiments are performed using singlescale training and testing.
Evaluation of baseline performance on I LR : To demonstrate the impact of resolution degradation, we conducted experiments on the DOTA dataset using rRTMDet series models trained and tested directly on I LR . The detection performance of each model size is presented in Table I. We observe that the detection performance decreases as the model size decreases. The small-sized model rRTMDet-tiny shows the greatest decrease in detection performance with up to 6.43% mAP reduction, while the largesized model rRTMDet-X only shows a 1.53% mAP reduction. We analyze that this is due to large-sized models having more channels and model parameters, which facilitate the identification of small object features compared to compact models. Overall, the performance of all models decreases significantly on I LR compared to the performance on I HR . We believe that the decrease in detection performance is primarily attributed to the abundance of small objects in the resolution-degraded image, as demonstrated in Fig. 2. In addition, we note that compared with performing detection on I HR , direct detection on I LR has less computational complexity, and the inference speed has been significantly improved, as shown in Table X. According to our analysis, this significant inference acceleration is brought about by a smaller input image, because the parameter of the model has  III  ABLATION EXPERIMENT FOR THE INSERTION POSITION OF THE FEATURE  UPSAMPLING METHOD. OUR RRTMDET-TINY + BICUBIC MEANS USE THE  CLASSICAL BICUBIC METHOD AS THE FEATURE UPSAMPLING METHOD. AFTER  BACKBONE MEANS THE UPSAMPLING METHOD EMBEDDED IN THE POSITION  BEHIND THE BACKBONE AND BEFORE THE NECK. AFTER NECK INDICATES THE  UPSAMPLING METHOD INSERTED IN THE POSITION AFTER THE NECK BEFORE  THE HEAD   not been reduced. Thus, it is important to investigate methods to improve the detection performance on I LR . We choose these models as our baselines and use the rRTMDet-tiny model for subsequent ablation experiments. Evaluation on PSRB: We attempted to enhance the original rRTMDet by adding a PSRB directly, utilizing EDSR as our SR network. Rather than using our proposed FEM, we incorporated the method from [22] to merge C2 and C4 feature maps as the input of the PSRB, while also conducting experiments with various backbone output feature maps as input. The outcomes are demonstrated in Table II. We discovered that using high-level feature maps or combining multiple feature maps occasionally reduced performance, which we believe is due to the absence of low-level information in high-level feature maps and the need for excessive SR multiples ratios for the EDSR. For example, C2's size corresponds to an ×8 downsampling of the input image, and our EDSR requires completing an ×2 SR compared with the input image size, which necessitates an ×16 upsampling  in FUM. As a result, higher level feature maps correspond to larger SR multiples ratios. Therefore, in later experiments, we only utilized C0 feature map data in the PSRB. This not only guarantees that comparable performance improvements can be achieved but also prevents the execution of excessively high SR multiples ratios of the EDSR.
In our experiments, we first used the architecture of stacked 16-layer ResBlocks (EDSR-ResBlock) as the original EDSR but found that it significantly increased the training time by approximately three times. To overcome this challenge, we attempted to reduce the number of stacked EDSR-ResBlock. Our experimental results are presented in Fig. 7. We discovered that reducing the number of stacked EDSR-ResBlock did not negatively affect detection performance, in fact, it even improved it. We believe that the excessive stacking of EDSR-ResBlock led to an increase in the number of EDSR parameters, which resulted in longer training times required to achieve convergence. In addition, the size of the feature maps was much larger than the typical 64 × 64 sizes used in general SR tasks. Therefore, the model needed to learn more features to complete SR, which in turn required longer training iterations. Hence, when using the same number of training epochs as the detection model, the SR network with fewer stacked EDSR-ResBlock may achieve better performance. As a result, we employed a structure with only four layers of EDSR-ResBlock in our subsequent experiments. Evaluation on ESRM: The abovementioned Table II experiments show that adding only a PSRB to the baseline has limited impact on detection performance. We believe that simply adding a parallel SR network and using backpropagation algorithms to teach the backbone network to enhance features is insufficient to improve SOD. In our analysis, we note that when using I LR inputs, the feature map size is halved compared to using I HR inputs. In the previous method [17], [18], [19], [50], I LR images were enlarged to the size of I HR and then processed by the detection network to ensure that the I LR and I HR are the same size as the feature map in the network. However, this requires the SR network to participate in the inference stage of the model, making it challenging to achieve real-time inference. Therefore, we adopt a more intuitive scheme, which is to directly upsampling the feature map of the model instead of enlarging the image. Our experiments demonstrate that enlarging the feature maps output by the backbone network results in more significant performance improvements than enlarging the feature maps output by the neck, as shown in Table III. We attribute this to the fact that the feature map output by the backbone network retains more low-level features, and the neck+head network is better suited to learning small object features after upsampling the feature map. But it is more challenging to extract small object features when just using only the head part of the model. Therefore, in our subsequent experiments, we incorporate an embedded feature map upsampling method in the position behind the backbone and before the neck.
The abovementioned experiment demonstrates that the scheme of directly enlarging the feature map can effectively amplify the characteristics of small objects and allow the model to focus more on them. So, then we tested several upsampling methods, as shown in Table IV, including the bicubic interpolation method, the deconvolution [62] method commonly used in semantic segmentation, and the subpixel convolution [57] method mainly applied to SR tasks, as well as our proposed method (see Section III-C). Our proposed ESRM demonstrated the best performance, indicating our ESRM has the strongest feature maps upsampling effect. We analyze that the performance improvement is mainly due to our ESRM method is adds a lightweight residual structure on top of the subpixel convolution method. This modification enables the module to have a consistent architecture with the EDSR while maintaining our lightweight design. As a result, the ESRM module can be easily inserted into detection models without significantly increasing computational burden, as Table X shows, and the ESRM module demonstrates powerful SR performance, making it become an effective feature map upsampling method for improving SOD. Accordingly, in subsequent experiments, we used ESRM as the feature maps upsampling method.
Evaluation on FAL and feature alignment loss: Table V shows that directly combining PSRB and ESRM results in a detection performance of 71.69% mAP. In addition, we propose a feature alignment loss L AL in Section III-D to allow for more effective information interaction and joint optimization of the two networks. However, we found that directly using the backbone C0 feature map as input for PSRB limits the flexibility of SR task learning. We analyzed that the reason for this phenomenon is that the features extracted from the lightweight backbone are more inclined to high-level features for detection. Consequently, the ability to extract low-level features for SR tasks is limited. To address this, we added FEM to PSRB and combined it with L AL , resulting in improved performance, as shown in Table V. To further reduce the training instability caused by the discrepancy of feature distribution between rRTMDet and PSRB, we appended a FAL in the output feature map of PSRB before applying our proposed feature alignment loss. This FAL is a 1x1 convolution layer, as shown in Fig. 4. Combining all of these improvements resulted in a 3.75% mAP improvement over the baseline.
In addition, we conducted ablation experiments on the hyperparameter α, which is used in Algorithm 1 to calculate W roi for L AL , as shown in Fig. 8. The experimental results demonstrate that the impact of different values of α is negligible as long as it is greater than 1. When α is set to 1, W roi becomes a unit matrix, indicating that the weight is evenly distributed between the foreground and background regions, rendering it ineffective.
Visualization and qualitative analysis: From the perspective of qualitative analysis, we visualize the feature map learned by ESRTMDet-X in Figs. 9 and 10. Fig. 9 illustrates that our proposed L AL and FAL can effectively restore more low-level high-frequency information to the upsampling feature map C2 '  TABLE X  ESRTMDET SERIES MODEL AND RRTMDET SERIES MODEL PARAMETERS, MACS, FLOPS, FPS, AND MAP COMPARISON. THE MAP IS THE RESULT OF DOTA  WITH SINGLE-SCALE TRAINING AND TESTING. MACS, FLOPS, AND FPS ARE THE RESULTS OF INFERENCE WITH 512×512 IMAGE SIZE UNDER A SINGLE Tables VI and VIII.

D. Comparision With SOTA
In this section, we compare our proposed ESRMDet with other SOTA methods on two challenging aerial detection datasets, i.e., DOTA, and UCAS-AOD.
Results on DOTA: In Table VI, we compare the performance of our ESRTMDet series method with other SOTA SR+OD methods on DOTA task 1 (i.e., rotated detection task). As the test set annotations for DOTA are not available, we use the evaluation metrics of the COCO dataset to assess the detection accuracy of small (AP S ), medium (AP M ), and large (AP L ) objects on the DOTA validation dataset. Our ESRTMDet-X model achieves the highest accuracy in detecting small objects, and the AP S of our Fig. 11. Some detection results of our proposed ESRTMDet-X by the single-scale training and testing on the DOTA I LR dataset. The confidence threshold is set to 0.05 when visualizing these results, and one color stands for one object class.
ESRTMDet series models surpasses that of their corresponding baseline models. These results confirm the effectiveness of our proposed method.
In addition, we achieved the new SOTA performance of 77.11% mAP through our ESRTMDet-X model. This performance is comparable to that of many famous anchor-based two-stage and one-stage methods on the original DOTA. As shown in Table VII, our method significantly improves the detection accuracy of small objects, especially in the SV, LV, SH, and HC categories. Qualitative detection results of our proposed ESRTMDet are shown in Fig. 11.
Results on UCAS-AOD: The UCAS-AOD dataset contains a large number of small objects, which are often overwhelmed by complex surrounding scenes in aerial images. To comprehensively compare our method with other SOTA SR+OD methods, we use the VOC2007 and VOC2012 metrics to evaluate detection performance. In addition, we followed the same evaluation metrics as in the DOTA experiments and used the COCO evaluation metrics on the UCAS-AOD test dataset to obtain the detection accuracy of small (AP S ), medium (AP M ), and large (AP L ) objects. As shown in Table VIII, our ESRTMDet-X model outperforms other methods with mAP values of 95.0% and 89.5% for VOC2012 and VOC2007 metrics, respectively. These results demonstrate the superiority of our proposed method, particularly on the AP S and AP M metrics. The detection accuracy of each category on UCAS-AOD is presented in Table IX. We also visualize the results of vehicle and airplane detection in Fig. 12.

V. DISCUSSION
We chose the most advanced ROD model rRTMDet as our baseline, which can achieve satisfactory results on degraded images without using any SR enhancement. The baseline outperforms several SR+OD methods based on FasterRCNN, as Tables VI and VIII show. Combining the rRTMDet baseline with our proposed SR enhancement method resulted in a performance close to directly using the I HR as input. According to Tables VII and IX, all models achieved a performance improvement of about 2% mAP compared to their corresponding baseline models. The performance improvement for small objects was more significant, confirming the efficacy of our proposed SR enhancement method.
Using the I LR as the input image allowed our ESRTMDet to achieve faster inference speeds (using FPS for quantification), smaller computational burdens (using FLOPs quantification), and lower computational complexity (using MACs quantification) compared to the baseline that used I HR as input, as shown in Table X. Our models add only a small number of additional parameters (using Params for quantification) but achieve better detection performance (achieving 77.11% mAP) and retain impressive inference speed (achieving 330+ FPS). Our method's inference speed has significantly exceeded many model cropping and compression methods that are directly executed on I HR . Thanks to our design of a lightweight ESRM and the PSRB not participating in model inference, our method adds minimal parameters and computational burden, as shown in Table X. These results confirm the efficiency of our methods. Our proposed method achieves real-time inference on our device, making it suitable for deployment on actual drone platforms in the future. From the perspective of macromodel architecture, our method can be regarded as a form of knowledge distillation (KD) between heterogeneous tasks, which differs from the traditional distillation approach. The traditional KD method [63] aims to train a compact model using a large model's knowledge and ensure that the detection performance of the small model is comparable to that of the large model. In contrast, our objective is to leverage an SR model's knowledge to enhance the OD model's performances, which are two distinct tasks. We employ an end-to-end joint optimization training pipeline and do not require a pretrained SR model. Instead, we allow the SR model to learn useful information flexibly through our training process (Algorithm 2) and optimize it in conjunction with the detection model.
In this study, we only investigate the most representative ×2 resolution degradation issues. However, more severe resolution degradation can be addressed by increasing the upsampling multiple in our ESRM and the SR multiples ratios in our PSRB. Nonetheless, we have yet to explore the impact of more severe blur and irregular noise, which we intend to study in future work.
In future research, we plan to explore the combination of traditional KD methods, which jointly utilize the large-size model of the detection task and the SR model to enhance the detection ability of a compact model. In addition, we will consider multitask optimization and adopt a more appropriate method to optimize these heterogeneous tasks simultaneously. Furthermore, inspired by the UIU-Net [51], we can convert the problem of rotated SOD into one of semantic segmentation. Through this problem conversion enables us to utilize the most advanced foundation models for computer vision, such as the segment anything model [64], to effectively solve the rotated SOD problem. This technique is also a promising direction for our future research.

VI. CONCLUSION
In this article, we propose ESRTMDet, an end-to-end realtime object detector for degraded aerial images that incorporates SR techniques. We enhance the baseline using the PSRB and ESRM models and employ feature alignment loss and FAL to enable interaction between the different tasks. We extensively evaluate our method on two challenging aerial OD benchmarks. Our ESRTMDet-X model achieves a remarkable 77.11% mAP and an impressive 337 FPS, which not only outperforms other SR+OD methods in terms of detection accuracy but also achieves the best inference speed.
In future work, we plan to enhance the detection performance of compact models, such as ESRTMDet-tiny or ESRTMDet-S by combining traditional KD methods, with the goal of deploying these models in actual UAV systems. In addition, we will also investigate the impact of more severe aerial image degradation to further improve the robustness of our model.