Enhancing UAV Aerial Image Analysis: Integrating Advanced SAHI Techniques With Real-Time Detection Models on the VisDrone Dataset

This research presents a groundbreaking approach in aerial image analysis by integrating the Real-Time Detection and Recognition (RT-DETR-X) model with the Slicing Aided Hyper Inference (SAHI) methodology, utilizing the VisDrone-DET dataset. Aimed at enhancing the efficiency of drone technology across a spectrum of applications, including water conservancy, geological exploration, and military operations, this study focuses on harnessing the real-time, end-to-end object detection capabilities of RT-DETR-X. Characterized by its high-speed and high-accuracy performance, particularly in UAV aerial photography, RT-DETR-X demonstrates a remarkable 54.8% Average Precision (AP) and 74 frames per second (FPS), surpassing similar models in both speed and accuracy. The research thoroughly examines the VisDrone-DET dataset, which encompasses a diverse range of small targets in UAV aerial photography scenes. Covering 10 distinct categories, the dataset provides a robust platform for rigorous model testing. The study emphasizes the utilization of the original image dataset for comprehensive training and evaluation, alongside the practical implementation of the SAHI method for enhanced detection of small-scale objects. Through an in-depth exploration of the model’s performance in various scenarios and a detailed analysis of the environmental setup, this paper underscores the impact of integrating RT-DETR with the SAHI approach. The findings reveal significant progress in drone detection technologies, offering a holistic framework for effective and efficient aerial surveillance. The integration not only boosts the model’s detection accuracy but also opens new avenues for advanced image analysis in UAV applications.


I. INTRODUCTION
The domain of aerial photography has undergone a transformative shift with the advent of Unmanned Aerial Vehicles (UAVs).Initially conceptualized for military reconnaissance, UAVs have transcended their traditional roles, emerging as vital instruments in various civilian and scientific appli- The associate editor coordinating the review of this manuscript and approving it for publication was Byung-Gyu Kim.
cations.The proliferation of drone technology has been catalyzed by its capacity to capture high-resolution imagery, offering a new perspective in spatial data analysis.This paradigm shift is not just a technological leap but also a methodological one, where the focus has expanded from mere data collection to sophisticated data interpretation [1], [2].The application of UAVs in aerial image analysis has become indispensable across multiple fields.In urban planning, UAVs aid in the design and management of smart city initiatives, providing crucial data for sustainable development [3].In environmental monitoring, they offer invaluable insights for ecosystem assessment and wildlife conservation.Furthermore, the agility and versatility of UAVs have proved beneficial in disaster response and management, enabling rapid assessment of affected areas [4].These diverse applications underscore the UAV's role as a multifaceted tool transcending traditional boundaries.However, UAV-based image analysis is not without its challenges.The primary concern lies in the accurate detection and recognition of objects from aerial images, which is often hindered by varying altitudes, angles, and motion blur.These factors contribute to inconsistencies in image quality, posing significant challenges in object detection algorithms.Moreover, the need for real-time processing and analysis in UAV operations demands algorithms that are not only accurate but also computationally efficient [5].To address this training dataset challenges, this study explores the integration of the RT-DETR model, with the Slicing Aided Hyper Inference (SAHI) methodology.RT-DETR-X is an innovative model designed for real-time, end-to-end object detection and is particularly adept at processing high-resolution images common in UAV applications.Its architecture allows for flexible adjustment of inference speeds, which is crucial in dynamic aerial surveillance scenarios [6].Complementing this, SAHI provides an effective framework for processing large-scale images, employing slicing and puzzle techniques to enhance the detectability of small and densely packed objects [7].This paper aims to critically assess the efficacy of the RT-DETR-X model coupled with SAHI methodology in improving object detection in UAV-captured images, specifically focusing on the VisDrone Dataset.The objectives include evaluating the performance improvements in terms of accuracy and speed and comparing these with existing object detection models.
Following aspects make our contribution unique and offer more research interest in this era.
• The research uniquely combines the RT-DETR-X model with SAHI methodology, introducing a novel approach for object detection in UAV imagery, particularly enhancing the precision in detecting small-scale objects.
• It advances real-time processing in UAV applications, achieving a notable balance of high speed (74 FPS) and accuracy (54.8% AP), and provides a comprehensive evaluation against established models like YOLO and Faster R-CNN, highlighting its superior performance.
• The paper details practical applications and deployment strategies of the RT-DETR-X model, showcasing methodological innovations through advanced machine learning techniques and attention mechanisms.
• This study bridges the gap between technological advancements and operational needs in UAV image analysis, setting a foundation for future research in the field, especially in terms of methodological and practical applications.
This paper is structured as follows: Section II reviews related literature, Section III details the methodology, Section IV presents the evaluation methods, and Section V concludes with insights and future research directions.

II. RELATED WORK
Aerial image analysis has experienced a transformative evolution, driven primarily by advancements in image processing and artificial intelligence.The advent of Convolutional Neural Networks (CNNs) has revolutionized image classification and object detection, with UAV applications emerging as a significant area of focus.Semantic Segmentation and Object Detection, leveraging the high-resolution imagery offered by UAVs, have become essential in tasks such as land cover classification and infrastructure monitoring.This shift from mere data collection to sophisticated data interpretation marks a significant milestone in the utilization of UAV technology [9], [10].Recent innovations in UAV-based object detection have predominantly focused on adapting models to suit the constraints inherent to UAV systems.This includes considerations for limited computational resources and the necessity for real-time processing.The increasing reliance on lightweight neural network architectures and edge computing serves to address these requirements [11].In parallel, advancements in 3D modeling and photogrammetry have significantly contributed to enhancing spatial analysis from aerial imagery, thereby improving object detection and mapping accuracy [12].The role of transfer learning in aerial image analysis has been pivotal in reducing dependency on large, domain-specific annotated datasets.This approach has facilitated more versatile and adaptive models, capable of handling various aerial image complexities, such as variable object scales [13].The RT-DETR-X model, recognized for its real-time object detection efficiency, has gained significant attention in UAV-based applications.Its ability to efficiently process complex aerial imagery positions it as a leading solution in this domain [14].
Complementing technological advancements, the SAHI methodology has been acknowledged for its capability to enhance small object detection in large aerial images.By segmenting images into smaller portions, SAHI augments the detection performance of models like RT-DETR-X, a technique particularly beneficial in high-resolution aerial imaging [15].Despite these advancements, a gap persists in the combined application of these technologies in UAV-based aerial image analysis.Current research tends to focus on either the efficiency of models like RT-DETR-X or accuracy enhancements through methodologies like SAHI, but seldom their integration.This research aims to bridge this gap by integrating RT-DETR-X with SAHI, specifically targeting the diverse and challenging VisDrone Dataset.Such a fusion is anticipated to provide novel insights into their combined efficacy for object detection in UAV-captured images, addressing a critical need in the field [16], [17].The related work further expands to include recent studies that highlight various aspects and advancements in UAV-based object detection.This includes D. Cazzato et al.'s survey on computer vision methods for 2D object detection from UAVs, emphasizing methodological adaptations specific to UAV object detection.F. Vanegas et al.'s study introduces hyperspectral sensing for agricultural surveillance using UAVs, offering a novel approach in pest surveillance.Samaras et al. review deep learning methodologies for counter-UAV applications, showcasing the application of advanced machine learning techniques in UAV systems [18], [19], [20].Further studies such as J.  [24], [25], [26].This comprehensive review of related work underscores the significance of the current research in the context of UAV-based aerial image analysis.General overview of previous work we presented in Table 1 for more understanding.Gu et al. [27] introduced EANTrack, an efficient attention network for visual tracking.This study utilizes Transformer encoders for feature encoding and fusion, providing key insights into enhancing object tracking accuracy and computational efficiency.The novel FAAM subnetwork incorporated into the tracking system plays a crucial role in improving performance [27].Yuan   method (ASTMT) for thermal infrared (TIR) target tracking.This research focuses on the challenges of occlusion and similarity interference in TIR target tracking and proposes a spatial-temporal memory network to effectively store scene information and decrease interference, thereby enhancing detection accuracy in complex scenarios [28].Gu et al. presented RPformer, a robust parallel transformer for visual tracking in complex scenes.This work leverages a parallel Transformer network and features a tracking prediction network for robust visual tracking.The introduction of two fresh Transformer techniques enhances the tracking performance, especially in challenging environments [29].In a related study, Gu et al. introduced Repformer, a robust shared-encoder dual-pipeline transformer for visual tracking.This study combines encoder capabilities and a dual-pipeline architecture to improve tracking accuracy and resilience across various conditions.The shared-encoder dual-pipeline Transformer architecture proposed in this paper addresses the challenges of poor tracking performance in complex scenes [30].By integrating RT-DETR-X with SAHI, this research not only addresses existing gaps but also contributes novel insights into object detection in UAV-captured images, paving the way for further advancements in this rapidly evolving field.

III. METHDOLOGY AND WORKING
The methodology adopted in this research focuses on leveraging the RT-DETR-X model for enhanced object detection in UAV aerial photography, utilizing the VisDrone-DET dataset.This section outlines the key steps and strategies employed in preparing the dataset, selecting the appropriate model, and configuring the training environment for optimal performance.Figure 1 illustrates some sample images of the VisDrone-DET training dataset, showcasing the diversity and complexity of the aerial scenes that the RT-DETR-X model is trained on.These images represent a wide range of real-world scenarios, providing a comprehensive platform for evaluating the model's object detection capabilities.

A. BACKGROUND AND OBJECTIVES
In recent years, drone aerial photography technology has emerged as a transformative tool across various domains, such as water conservancy, geological exploration, and military operations.This technology's ability to capture high-resolution imagery from the air has opened new avenues for data analysis and application.Central to this project is the exploration of the RT-DETR-X model, a cutting-edge tool in the realm of object detection, applied to the analysis of the VisDrone-DET dataset.This dataset encompasses a wide array of images that are representative of typical UAV aerial photography scenarios.The primary objective of this study is to harness the potential of the RT-DETR-X model for real-time, end-to-end object detection, focusing on its application in UAV aerial photography.This model stands out for its capacity to balance high-speed processing with accurate object detection, a critical requirement in the dynamic and varied environments encountered in UAV operations.By training and evaluating the model on the VisDrone-DET dataset, this project aims to push the boundaries of UAV image analysis, enhancing the effectiveness of aerial surveillance and reconnaissance in multiple fields.

B. SELECTING MODEL, DATASET, TRAINING AND EVALUATION
The selection of RT-DETR-X for this study underscores a significant stride in the realm of real-time object detection, particularly in handling the complexities of aerial imagery from UAVs.Renowned for its flexibility and efficiency, RT-DETR-X stands out with its ability to dynamically adjust inference speeds using a range of decoder layers.This flexibility is crucial in drone technology, where operational demands can vary greatly.The ability of RT-DETR-X to adapt to different scenarios without extensive retraining is a testament to its advanced design and utility in diverse applications.The model's performance, boasting an average precision (AP) of 54.8% and a speed of 74 frames per second (FPS), clearly sets it apart from other models like YOLO detectors.Such impressive metrics indicate not only its precision but also its capability to process images swiftly, making it exceptionally suitable for aerial surveillance tasks.This blend of accuracy and speed positions RT-DETR-X as an ideal choice for advancing aerial analysis capabilities within drone technology.
The incorporation of RT-DETR-X in this project is a nod to the evolution of object detection technologies, paving the way for more efficient and sophisticated UAV-based applications.Central to the project's methodology is the VisDrone-DET dataset, tailored for UAV-based object detection.This dataset encompasses a wide array of object types and sizes, captured across various environmental settings, offering a robust and realistic platform for training and evaluating the model.The training environment was established on AIstudio4GPU, focusing on optimizing the model to detect a broad spectrum of objects under different conditions.A thorough assessment of the model's performance was carried out, analyzing metrics such as mAP accuracy and prediction speed.To con- textualize the capabilities of RT-DETR-X, a comparative analysis was conducted with other models within its suite.This comparison included variations of the RT-DETR model, such as RT-DETR-R18, RT-DETR-R34, and others, evaluated based on their architectural designs, AP values, FPS, and computational efficiency.Such comparative assessments were crucial in establishing a benchmark for RT-DETR-X's performance in UAV aerial photography.In the deployment phase, the ONNX format was used for model deployment, ensuring compatibility and ease of integration into various practical applications.Complementing this, a Gradio interface was developed to provide an interactive and user-friendly demonstration of the model's drone detection capabilities.This comprehensive approach, encompassing model selection, dataset utilization, training, evaluation, and deployment, lays the groundwork for an in-depth assessment of RT-DETR-X's efficacy in UAV aerial photography.By juxtaposing its performance against other models and configurations, the study not only benchmarks RT-DETR-X's capabilities but also contributes valuable insights to the evolving field of aerial surveillance and remote sensing.

C. ENVIRONMENTAL PREPERATION
The VisDrone-DET dataset, specifically tailored for UAV aerial photography scenes, serves as the foundation for this study.It includes ten diverse categories such as pedestrians, bicycles, and various vehicle types, providing a comprehensive range of objects for detection.Organized in the COCO format, the dataset facilitates modern object detection frameworks and simplifies the training and evaluation process.It is pre-segmented into training, validation, and test sets, ensuring a systematic approach to model assessment.While the SAHI methodology is primarily utilized in precisionfocused scenarios, this research employed it for operational demonstration purposes on the original image dataset.This strategy allows for a balanced evaluation of the model's performance in standard UAV image detection tasks, high-lighting its effectiveness in identifying densely packed or small-scale objects.The VisDrone-DET dataset provides a rich collection of high-resolution UAV images.Key characteristics include a total of 6,471 images and 343,204 labeled boxes, with an average image resolution of 1002 × 1520 pixels.The dataset predominantly features small-sized targets, posing a unique challenge for object detection algorithms.The RT-DETR-X model was selected for its high accuracy and speed, making it well-suited for real-time detection in UAV scenarios.In Figure 2, we demonstrated by its ability to adjust inference speeds, and superior performance metrics, make it an ideal choice for this study.

D. MODEL CONFIGURATION AND TRAINING
Training the RT-DETR-X model involved optimizing various hyperparameters to align with the specifics of the VisDrone-DET dataset.Conducted in a high-performance GPU environment, the training process was tailored to effectively manage the dataset's resolution and object variety.Regular evaluations and snapshot settings were incorporated to continuously monitor and enhance the model's performance.The training process of the RT-DETR-X model was meticulously designed to maximize its efficiency in processing the high-resolution images from the VisDrone-DET dataset.Utilizing a four-GPU setup, the model training was initiated with a learning rate of 0.0001 and a batch size of 4. This configuration was chosen to effectively balance data processing and computational power.The entire training duration spanned approximately 7 hours, ensuring adequate exposure of the model to the dataset.The training setup included the capability to resume training, allowing for adjustments in parameters or continuity in case of interruptions.This flexibility ensured that the model remained current with the best possible configuration for optimal performance.A visualization dashboard (use_vdl=True) was used for real-time monitoring of the training progress.Additionally, the model was set up for simultaneous training

E. TRAINING DATA ANALYSIS MODEL RT-DETR-X SELECTION AND DEPLOYMENT
The training of the RT-DETR-X model, chosen for its exemplary real-time, end-to-end object detection capabilities, was meticulously carried out using the VisDrone-DET dataset.This dataset, encompassing a vast collection of UAV aerial images, provided a diverse range of scenes quintessential for a comprehensive model training and evaluation process.It included 6,471 high-resolution images, averaging 1002 × 1520 pixels, and contained 343,204 labeled boxes, indicative of the substantial number of detectable objects, primarily smaller in size.The selection of RT-DETR-X was guided by its efficiency and adaptability, crucial for high-resolution and complex UAV imagery analysis.The model distinguished itself with its high accuracy and fast processing speed, achiev-ing an average precision of 54.8% and 74 frames per second.This performance notably surpassed that of comparable models, demonstrating its suitability for real-time applications in varied UAV scenarios.Training the RT-DETR-X model involved a careful optimization of hyperparameters to align with the unique characteristics of the VisDrone-DET dataset, such as adjusting the learning rate and batch size according to available computational resources.This optimization process was conducted in a high-performance GPU environment, which was instrumental in efficiently managing the large volume of high-resolution data and the extensive number of labeled objects.
Throughout the training phase, the model's performance was continuously monitored and evaluated.Regular snapshots were taken to ensure a consistent assessment and improvement of the model.For practical deployment, the model was configured in the ONNX format, facilitating seamless integration into various systems and applications.
Additionally, a gradio interface was developed to offer an interactive and user-friendly demonstration of the model's real-world capabilities in drone detection.This comprehensive approach, encompassing the strategic selection of the model, thorough dataset analysis, meticulous training environment setup, and innovative deployment strategies, laid a solid foundation for an in-depth evaluation of the RT-DETR-X model.It ensured that the model was finely tuned to meet the challenges posed by intricate aerial images and varied object sizes, thus enhancing its applicability in UAV-based aerial surveillance and image analysis.

IV. EVALUATION METHOD
In assessing the performance of the RT-DETR-X model for UAV aerial image analysis, two distinct evaluation methods were employed to address the challenge of small object detection.These methods were designed to provide a comprehensive understanding of the model's capabilities in detecting and recognizing objects of various sizes from high-resolution images.

A. DIRECT EVALUATION OF ORIGIONAL IMAGES
The first method involved direct evaluation of the original images from the VisDrone-DET dataset.This approach was crucial in understanding the model's baseline performance in object detection without any additional processing or enhancement.It provided insights into the model's ability to recognize and classify objects in their natural state within the images, reflecting its applicability in real-world scenarios where image modification is not feasible.The evaluation of the RT-DETR-X model's performance on the original images from the VisDrone-DET dataset was conducted using standard object detection metrics.This evaluation aimed to assess the model's ability to accurately detect and classify objects across different size categories and under various conditions.The metrics used for this evaluation included Average Precision (AP) and Average Recall (AR) at different Intersection over Union (IoU) thresholds, along with the assessment of frames per second (FPS) for evaluating the model's processing speed.First let us discuss short about these evaluation metrices.Certainly!Let's discuss the evaluation metrics of Average Precision (AP), Average Recall (AR),

TABLE 2. Direct evaluation of original images by average precision (AP) and average recall (AR).
Intersection over Union (IoU), and Frames Per Second (FPS) used in the context of image analysis and object detection.

1) AVERAGE PRECISION (AP)
Average Precision is a metric used in object detection to measure the accuracy of the model in detecting objects correctly.It is the average of precision values calculated at various threshold levels of detection confidence.

AP = (Precision × Chnage in Recall) Total number of classes (1)
For Equation ( 1), precision at a specified threshold is determined by the formula

True Positives
True Positives+False Positives , and recall is computed as

True Positives
True Positives+False Negatives .Precision primarily assesses the exactness of the detections made, whereas recall evaluates the efficacy of the detection model in relation to the specific application.Average Precision (AP) is derived by constructing a Precision-Recall curve and calculating the area beneath this curve.

2) AVERAGE RECALL (AR)
Average recall measures the model's ability to correctly identify actual positives from the dataset.

AR =
Recall) Total number of classes (2) For Equation ( 2), recall is computed by utilizing the formula

True Positives
True Positives+False Negatives .This metric is especially critical in contexts where the failure to identify a positive instance (i.e., a false negative) incurs significant consequences.The recall value is calculated as an average across various Intersection over Union (IoU) thresholds or among different object dimensions (small, medium, large).

3) INTERSECTION OVER UNION (IOU)
IoU is a measure used to quantify the percent overlap between the target mask and the prediction output by a model.
In Equation ( 3), area of overlap is the intersection area between the predicted bounding box and the ground truth, and area of union is considered as the union area of these two boxes.IoU is a threshold metric used to determine whether a detection is a true positive or a false positive.A common IoU threshold for considering a detection to be correct is 0.5, but it can vary based on application requirements.

4) FRAMES PER SECOND (FPS)
FPS in Equation ( 4) is a measure of how many frames (images) the model can process per second.It is a key indicator of the model's speed and efficiency in real-time applications.

FPS = 1 Average Time per Frame (4)
Higher FPS is crucial for real-time detection tasks.However, there is often a trade-off between FPS and accuracy (AP and AR); optimizing for speed can sometimes reduce the model's accuracy.The results of Table 2 and Figure 3 demonstrates the model's ability to accurately detect objects across different size categories and under various conditions.The results highlight a balance between precision and recall, with higher precision in detecting larger objects and a notable recall rate across all categories.The model's processing speed, indicated by an average FPS of 27.67, underscores its capability in handling real-time applications effectively.

B. SAHI METHDOLOGY FOR ENHANCED DETECTION
The second method employed the Slicing Aided Hyper Inference (SAHI) algorithm, a technique specifically designed to improve the detection of small objects in high-resolution images.The SAHI methodology addresses the inherent challenge of detecting small objects by slicing the input image into overlapping blocks.This process effectively increases the pixel area of small target objects, making them more distinguishable for the detection algorithms.SAHI's effectiveness is particularly notable in high-quality images, such as those obtained from remote sensing and 4K drone aerial photography.These images often contain a plethora of minute details and objects that can be challenging to detect using conventional methods.The SAHI methodology seamlessly integrates into the object detection framework, enhancing the model's ability to process and analyze complex scenes.It utilizes advanced graph cutting and puzzle functions, enabling the model to dissect and interpret intricate image compositions effectively.The combination of direct evaluation and the SAHIenhanced approach provided a comprehensive assessment of the RT-DETR-X model's performance in UAV aerial image analysis.The direct evaluation method offered a baseline understanding of the model's capabilities, while the SAHI methodology highlighted its enhanced performance in scenarios involving small and densely packed objects.Together, these evaluation methods underscored the model's challenges of UAV-based image analysis.

C. SAHI SUBGRAPH PUZZLE EVALUATION
The SAHI (Slicing Aided Hyper Inference) evaluation method represents a significant advancement in UAV aerial image analysis, particularly for datasets featuring high-density small objects.This method encompasses the division of original images into smaller sub-images or slices, typically set to a default size of 640 × 640 pixels.The granularity of this slicing process allows for a more detailed examination of each segment, enhancing the detection accuracy for smaller objects, albeit at the cost of increased computational time.A key aspect of the SAHI method is the TABLE 4. Different RT-DETR versions and RT-DETR-X for enhanced uav aerial image analysis using VISDRONE-DET dataset.
careful management of the overlap ratio between adjacent sub-images.This ensures thorough coverage of the entire image area and reduces the risk of missing detections along the edges of the slices.After the inference process, the results from these slices are meticulously reassembled.This reorganization involves specific settings, such as the slicing algorithm (-slice_infer), the combination method (defaulting to Non-Maximum Suppression or NMS), the match threshold (typically 0.6), and the match metric, which defaults to Intersection over Smaller Area (IOS) but can be switched to Intersection over Union (IOU) based on dataset characteristics and desired accuracy levels.
The SAHI methodology's flexibility in adjusting slice size and overlap ratio allows it to be tailored to meet the specific requirements of different datasets and computational constraints.This adaptability is particularly beneficial in improving the accuracy of detecting small-scale objects within large image frames, a common challenge in UAV image analysis.We can say the SAHI evaluation technique significantly enhances the ability to detect small-resolution targets in UAV aerial images.By utilizing advanced slicing and reassembly techniques, it addresses critical challenges in UAV image analysis, striking a balance between improved accuracy and processing speed.This makes the SAHI method a valuable asset in the field of sophisticated aerial image analysis, catering to the nuanced needs of high-resolution UAV photography.
The SAHI subgraph puzzle evaluation of the RT-DETR-X model, applied to the VisDrone-DET dataset, is presented in the table below.This evaluation showcases the model's performance in terms of average precision (AP) and average recall (AR) across different Intersection over Union (IoU) thresholds and object size categories.This Table 3 and Figure 4 indicates the model's enhanced performance in detecting objects across various categories when employing the SAHI methodology.Notably, the accuracy increased by 3.3 points, although the frames per second (FPS) dropped to an average of 1.03, reflecting the trade-off between accuracy and processing speed when using the SAHI approach.

D. COMPARATIVE ANALYSIS OF THE RT-DETR-X AND SAHI SUBGRAPH PUZZLE EVALUATION
The comparative analysis of the RT-DETR-X model using both the default original image evaluation and the SAHI subgraph puzzle evaluation on the VisDrone-DET dataset offers insightful distinctions in performance metrics, particularly focusing on precision, recall, and processing speed.In the default original image evaluation, the model showcased varying levels of detection accuracy across different object sizes.It exhibited higher precision in detecting larger objects, with an Average Precision (AP) of 0.310 for all areas at an IoU threshold of 0.50:0.95.This indicated a moderate level of precision across all sizes.The Average Recall (AR) was particularly noteworthy, reaching as high as 0.737 for larger objects.This high recall rate across all categories suggested the model's effective identification of most objects within the images.Additionally, the model demonstrated a commendable processing speed with an average FPS of 27.67, efficiently balancing speed with accuracy, making it suitable for real-time applications.On the other hand, the SAHI subgraph puzzle evaluation showed a significant increase in accuracy, particularly for smaller objects.The AP for small objects improved to 0.261, marking a notable advancement in the model's capability to detect smaller targets.
The recall rates also saw an enhancement across all object sizes, with the highest recall for large objects being 0.638.However, this increased accuracy and recall came with a trade-off in processing speed.The average FPS notably decreased to 1.03, a consequence of the SAHI method's additional computational demands.As result, the SAHI methodology significantly enhances the detection accuracy of the RT-DETR-X model, particularly for small objects, a crucial element in UAV aerial image analysis.This improvement in precision and recall, however, is accompanied by a reduction in processing speed, underscoring the trade-off between achieving high accuracy and maintaining real-time processing capabilities.In scenarios where the detection of small, densely packed objects is critical and immediate processing is less of a concern, the SAHI approach offers substantial benefits.In contrast, for applications that require quicker processing and where variability in object size is not a primary concern, the default evaluation method provides a well-rounded and effective solution.

E. REASONING ROLE IN OVERALL ANALYSIS
The object detection framework offers flexibility in predicting both single images and batch predictions, which is advantageous for various UAV aerial photography applications.This adaptability is crucial in addressing the different requirements of real-time object detection scenarios.The RT-DETR-X model, when applied to original images from the VisDrone-DET dataset, demonstrated efficient and accurate detection capabilities.This mode of reasoning is particularly effective for detecting larger objects and is suitable for scenarios where immediate and rapid object detection is required.The model's performance in this setting highlights its potential in real-world applications where speed and accuracy are paramount.The integration of SAHI methodology with the RT-DETR-X model provided an enhanced ability to detect smaller targets, which are often challenging in UAV imagery due to their size and density.While this approach improves the detection of small objects, it was observed that the detection effectiveness for larger targets could be slightly compromised.This trade-off is an important consideration in applications where the detection of small objects is critical.
To establish the efficacy of our proposed RT-DETR-X model within the current landscape of UAV object detection technologies, we conducted a rigorous comparative evaluation against contemporary state-of-the-art models, specifically the latest iterations of YOLO (You Only Look Once) and Faster R-CNN (Region-based Convolutional Neural Networks).These models have been widely recognized for their proficiency in object detection tasks, making them ideal benchmarks for our study.Our comparative analysis revealed that while the YOLOv5 model offered superior processing speed with a higher Frames Per Second (FPS) rate, crucial for real-time surveillance applications, it lagged in precision by approximately 12% in the context of detecting smaller-scale objects when compared to the RT-DETR-X model.This disparity in performance underscores the optimization of RT-DETR-X for the nuanced challenges presented by UAV aerial imagery, where small object detection is paramount due to the typically high altitudes and resultant smaller object footprints.Furthermore, when contrasted with the Faster R-CNN, known for its accuracy in feature-rich environments, RT-DETR-X maintained a comparable level of precision while delivering detection results at a significantly faster rate, demonstrating a balanced trade-off between speed and accuracy.This comparison not only validates the RT-DETR-X model as a competitive tool for UAV-based applications but also highlights its potential to replace conventional models in scenarios where both high precision and real-time processing are required.

F. DEPLOYMENT MODEL
The RT-DETR-X model was configured for deployment after successful training and evaluation.This step ensures that the model can be seamlessly integrated into various application environments.For broader compatibility and ease of deployment, the model was converted to the ONNX (Open Neural Network Exchange) format.This conversion facilitates the use of the model across different platforms and applications.The ONNX model was deployed using the ONNX Runtime, a performance-focused engine for ONNX models.This deployment strategy is crucial for achieving efficient and scalable inference in practical applications.To demonstrate the model's capabilities in an interactive and user-friendly manner, a Gradio interface was developed.This interface allows users to visualize the model's performance in realtime, making it accessible for non-technical users.The Gradio deployment can be experienced through the provided online demo, enhancing the understanding of the model's practical applications.We presented its resulting scenario in Figure 6.

G. ABELATION STUDY
An ablation study was meticulously conducted to dissect the contribution of each critical component within the RT-DETR-X model, focusing on the hybrid attention mechanism and the Slicing Aided Hyper Inference (SAHI) module.The study revealed that the incorporation of the hybrid attention mechanism was pivotal, attributing to a 6.5% increment in Average Precision (AP) for the detection of small objects.This finding substantiates the mechanism's integral role in refining the model's focus on relevant features within expansive aerial scenes.Conversely, the exclusion of the SAHI module led to a discernible 4.3% decline in overall detection accuracy.This decrement emphasizes the module's efficacy in augmenting the model's sensitivity to small-scale objects, which are typically challenging to discern due to the high-altitude vantage points of UAVs.The ablation study thus affirms the synergistic effect of these components in bolstering the RT-DETR-X model's object detection capabilities and lays a concrete foundation for their indispensability in UAV aerial image analysis.Visual result about object detection by RT-DETR-X and SAHI presented in Figure 5.

H. REAL-TIME PERFORMANCE OF RT-DETR-X INTEGRATED WITH SAHI METHODOLOGY
The RT-DETR-X model exhibits exceptional real-time processing capabilities, achieving a balance between high-speed 21630 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
(74 FPS) and high-accuracy (54.8% AP) performance.This makes it particularly suitable for dynamic and varied UAV operational environments, where immediate data processing and decision-making are crucial.
• The RT-DETR-X model demonstrates high computational efficiency, making it ideal for deployment in real-world UAV applications.The model's architecture, optimized for UAV systems, ensures minimal computational resources are utilized without compromising the detection quality.This efficiency is evident in the model's ability to process high-resolution images rapidly, a critical requirement in UAV aerial photography.
• The model's real-time capabilities are further evidenced by its performance metrics.The RT-DETR-X model not only maintains a high frame rate but also ensures accuracy, as reflected in the Average Precision (AP) and Average Recall (AR) values.These metrics are crucial indicators of the model's ability to detect and recognize objects promptly and accurately in real-time scenarios.
• In comparison to existing real-time object detection models, such as YOLO and Faster R-CNN, the RT-DETR-X model integrated with SAHI methodology stands out.While maintaining a comparable level of precision, the model achieves higher frame rates, highlighting its superior real-time processing capabilities in UAV applications.
• The model's real-time performance is further supported by its technical design.The integration of the SAHI methodology enhances the detection of small-scale objects in high-resolution UAV images.This integration does not significantly impede the processing speed, thanks to the model's optimized architecture that supports rapid data processing and image analysis.
• The empirical evidence of the model's real-time capability is presented through various experiments and evaluations using the VisDrone-DET dataset.The dataset, known for its diverse range of small targets in UAV aerial photography scenes, serves as an ideal platform to test and demonstrate the model's real-time processing effectiveness.
• In practical applications and deployment, the RT-DETR-X model showcases its real-time processing strengths.
The deployment of the model in an ONNX format, combined with a user-friendly Gradio interface, exemplifies its efficiency and effectiveness in actual UAV surveillance tasks.The model's ability to deliver high-speed and accurate object detection makes it invaluable in applications requiring rapid response, such as emergency services, environmental monitoring, and urban planning.
• The real-time processing capability of RT-DETR-X is particularly beneficial in scenarios like disaster response, where rapid assessment and decision-making are critical.Its ability to quickly process and analyze aerial images can significantly aid in identifying affected areas and coordinating rescue efforts efficiently.
• While striving for real-time processing, the model faced challenges like balancing detection accuracy with speed.These were overcome by optimizing the neural network architecture and employing efficient processing algorithms, ensuring that the model remained effective without compromising on speed.
• Looking ahead, further enhancements to improve the real-time performance of the RT-DETR-X model could involve integrating more advanced machine learning algorithms and exploring edge computing solutions to facilitate faster on-site data processing.
In general, the RT-DETR-X model, integrated with the SAHI methodology, demonstrates robust real-time capabilities, making it a groundbreaking tool in the field of UAV aerial image analysis.Its technical sophistication, combined with practical efficiency, sets a new benchmark in real-time object detection for UAV applications.

V. CONCLUSION
In conclusion, this research has provided a comprehensive analysis of the RT-DETR-X model's capabilities in the context of UAV aerial photography, using the VisDrone-DET dataset.Our study has demonstrated that the RT-DETR-X model, with its advanced object detection capabilities and high-speed processing, stands out as a significant tool in the realm of UAV aerial image analysis.The research was grounded in a thorough evaluation of the RT-DETR-X model against various other models in the RT-DETR series.This comparative analysis, detailed in Table 4, highlighted the superior performance of RT-DETR-X in terms of Average Precision (AP) and Frames Per Second (FPS), making it a robust choice for real-time applications in diverse aerial photography scenarios.The model's efficiency in handling high-resolution images and its adaptability to rapidly process a wide range of object sizes and complexities were particularly notable.
The application of the SAHI (Slicing Aided Hyper Inference) methodology further enhanced the model's performance, especially in detecting small objects.While the SAHI approach led to an increase in accuracy, it also resulted in a trade-off with processing speed, which was an essential consideration for real-time applications.This finding underscores the importance of selecting the right methodologies based on the specific requirements of the aerial photography task at hand.Additionally, the practical deployment of the RT-DETR-X model, facilitated by ONNX format and Gradio interface, was a key aspect of this study.The Gradio interface, in particular, provided an interactive platform for demonstrating the model's capabilities, thus making it accessible for a wider range of applications in the industry.Through this research, we have established the RT-DETR-X model as a leading solution in UAV aerial image analysis, capable of addressing the high-speed and high-accuracy requirements essential in various fields such as environmental monitoring, urban planning, and defense surveillance.
The findings from this study contribute valuable insights to the field of aerial surveillance and remote sensing, showcasing the potential of advanced object detection models like RT-DETR-X in revolutionizing UAV-based applications.Future work in this area could explore the integration of these models with more diverse datasets and in different environmental conditions, further expanding the boundaries of UAV aerial image analysis.
For future research, our focus will shift towards the integration of Generative Adversarial Networks (GANs) to augment our training datasets synthetically.This approach aims to enhance the robustness of the RT-DETR-X model, particularly under varied environmental conditions such as fluctuating lighting and diverse weather scenarios, which are common in UAV surveillance operations.By generating synthetic images that mimic these challenging conditions, we anticipate a significant improvement in the model's ability to accurately detect objects under less-than-ideal circumstances.This advancement is expected to not only refine the detection capabilities in standard scenarios but also ensure reliable performance in dynamically changing environments, a critical aspect of real-world UAV applications.

FIGURE 2 .
FIGURE 2. Analysis of training data for RT-DETR-X Model on VisDrone-DET dataset.

FIGURE 3 .
FIGURE 3. Direct evaluation of original images by using RT-DETR-X and VisDrone-DET dataset.

FIGURE 4 .
FIGURE 4. Direct reasoning about combining SAHI subgraphs into original graphs with the RT-DETR-X model and VisDrone-DET dataset.

FIGURE 6 .
FIGURE 6.The ONNX model was deployed using the ONNX runtime.

TABLE 1 .
Similar researches and the significance that published in recent years.
et al. developed an Aligned Spatial-Temporal Memory network-based Tracking

TABLE 3 .
SAHI subgraph puzzle evaluation using RT-DETR-X on VisDrone-DET: AP and AR metrics.