Object-Centric Masked Image Modeling-Based Self-Supervised Pretraining for Remote Sensing Object Detection

Masked image modeling (MIM) has been proved to be an optimal pretext task for self-supervised pretraining (SSP), which can facilitate the model to capture an effective task-agnostic representation at the pretraining step and then advance the fine-tuning performance of various downstream tasks. However, under the high randomly masked ratio of MIM, the scene-level MIM-based SSP is hard to capture the small-scale objects or local details from complex remote sensing scenes. Then, when the pretrained models capturing more scene-level information are directly applied for object-level fine-tuning step, there is an obvious representation learning misalignment between model pretraining and fine-tuning steps. Therefore, in this article, a novel object-centric masked image modeling (OCMIM) strategy is proposed to make the model better capture the object-level information at the pretraining step and then further advance the object detection fine-tuning step. First, to better learn the object-level representation involving full scales and multicategories at MIM-based SSP, a novel object-centric data generator is proposed to automatically setup targeted pretraining data according to objects themselves, which can provide the specific data condition for object detection model pretraining. Second, an attention-guided mask generator is designed to generate a guided mask for MIM pretext task, which can lead the model to learn more discriminative representation of highly attended object regions than by using the randomly masking strategy. Finally, several experiments are conducted on six remote sensing object detection benchmarks, and results proved that the proposed OCMIM-based SSP strategy is a better pretraining way for remote sensing object detection than normally used methods.


I. INTRODUCTION
O BJECT detection plays a critical role in the field of remote sensing, and it can enable the interpretation of remote sensing imagery for various applications, including urban planning, traffic control, military and civil intelligent surveillance systems [1], [2], [3], [4]. Since object detection task needs to simultaneously localize and recognize various objects from complex remote sensing scene, its challenge goes beyond the scene-level analysis that only identifies the semantic labels of scene images. Accordingly, apart from the backbone network, the object detectors still contain feature fusion networks and regression layers, as shown in Fig. 1(b). Now, much progress on remote sensing object detectors [5], [6], [7], [8] almost focuses on the improvement from the aspects of feature fusion and regression layers and has demonstrated promising results on a few benchmark datasets. However, the backbone network acts as the premise of the feature fusion network and regression layers and is responsible for capturing basic semantic feature from images, which are pivotal for identifying and localizing objects [9]. Thus, the feature representation description capability of backbone network is the foundation for improving detection performance, and building a stronger visual representation description can facilitate the subsequent feature fusion and parameter regression so as to obtain better results.
Currently, as shown in Fig. 1(a), most researchers [10], [11], [12], [13], [14] tend to rely on the free-lunch, whereby the backbone networks usually adopt off-the-shelf pretrained weights learned on large-scale labeled natural scene datasets (e.g., Im-ageNet [15]) based on the pretext task of image classification. Although it largely promotes detection performance comparing with training from scratch, the supervised pretraining (SP) inevitably suffers some limitations such as task-aware discrepancy. Recently, masked image modeling (MIM)-based self-supervised pretraining (SSP) has been proven to be superior over previous SP methods and alleviated the dependency on labeled data [16], [17], [18], [19], [20], [21], [22], [23], because MIM can provide a powerful feature representation for vision transformer (ViT) via random masking and image reconstruction. Inspired by the progress of MIM-based SSP, several studies [24], [25], [26], [27], [28] also begun to employ MIM-based SSP on unlabeled remote sensing data to obtain the pretrained models. These methods proved that MIM-based SSP is capable of capturing a generic visual representation description for various downstream tasks and promote the model fine-tuning performance in the remote sensing domain (RSD). However, as observed in [29] and [30], we find that whether for SP or SSP, scene-level representation learning cannot be aligned with representation learning of dense prediction task, especially for remote sensing object detection, because global understanding about scene fails to learn properties that are important for perceiving local dense objects, as presented in Fig. 1(a) and (b).
To this end, we aim to develop an MIM-based SSP with the goal of aligning the representation learning of pretraining step and fine-tuning step for remote sensing object detection. As we know, masked autoencoder (MAE) [17] and SimMIM [19] have demonstrated that the great success of MIM-based SSP is attributed to the high randomly masked ratio, because it can facilitate the ViT-based model to holistically understand a scene image from a few fragmentary patches and evidently reveal whether the scene image is well understood by image reconstruction task. However, the remote sensing scene is more complicated than nature scene due to a lot of background interference. If following the indicating of randomly masking original input images with a high masked ratio (e.g., 75%), it would easily ignore representation learning about small-scale objects that are prevalent in remote sensing scene and instead heavily attend to object-irrelevance background regions at the pretraining step. As a consequence, it is obvious that due to the lack of capturing small-scale object-level representation, the pretrained models cannot provide a comprehensive feature representation description capability for object detection tasks with full-scale objects at the fine-tuning step.
Motivated by the abovementioned analysis and followed by our previous study of consecutive pretraining (CSPT) [26], an object-centric masked image modeling (OCMIM) pretraining strategy is proposed to handle the misalignment between scene-level pretraining and object-level fine-tuning steps, and endow the pretrained models with object-level reasonable feature representation description capability for remote sensing object detection. First, to address the issue of missing representation information for small-scale object and setup object-level representation information acquisition at the pretraining step, an object-centric data generator (OCDG) is specifically designed for pretraining step, as illustrated in Fig. 2(a). It enables model to learn object-level context information involving full scales and multicategories. Second, due to complicated background interference in remote sensing scene, we advocate making MIM-based SSP adaptively attend the reconstruction of object regions by masking the highly attended object regions instead of randomly masking. To achieve this, an attention-guided mask generator (AGMG) is proposed in Fig. 2(c). By calculating the self-attention map between cls token and image patch tokens, the attention-guided regions can be presented, and then a certain proportion of regions with high attention scores are chosen to be masked. Finally, under the guidance of AGMG, the images generated by OCDG are partially masked and reconstructed by the MIM pretext task, as shown in Fig. 2(b). In addition, several experiments are carried out on six remote sensing object detection datasets (e.g., NWPU-VHR10 [31], DIOR [32], HRSC2016 [33], UCAS-AOD [34], ITCVD [35], and HRSID [36]), and experimental results proved that the proposed OCMIM-based SSP applied to the second stage of CSPT [26] can further advance the detection performance. In general, the main contributions of this work can be summarized in the following.
1) A novel OCMIM-based SSP is proposed for remote sensing object detection, and it can endow the backbone of detectors with reasonable object-level feature representation by facilitating the pretrainig step align with the object detection fine-tuning step. 2) An OCDG is designed for acquiring object-level image data for the pretraining step, which consists of multiscale object-centric cropping, tiny object splicing, and sample balancing. It advances model to sufficiently learn objectlevel context information and meanwhile ensures that object information involving full scales and multicategories can be gracefully captured. 3) Given that the objects are surrounded by complex background in remote sensing scene, an AGMG is proposed to promote model learning more discriminative object information, which can utilize the prior knowledge of well-pretrained model to guide the pretraining step to attend and reconstruct foreground object regions.

A. Pretraining for Object Detection
The large-scale pretraining has been a consensus solution to promote the performance of object detection task. Regarded to the CNN models pretrained on large-scale datasets, such as ImageNet [15] and Million-AID [37], by the task of image classification, many works [7], [8], [13], [14], [38], [39], [40], [41] used them as backbone network and then fine-tuned the detector on specific downstream object detection datasets, which can make significant performance improvement than training from scratch. Then, because of the task difference between image classification of pretraining step and object detection of fine-tuning step, Li et al. [42] provided a detailed analysis of CNN models pretrained by the task of object detection and proved that detection pretraining is very beneficial when a higher degree of localization is desired. Moreover, different from the abovementioned SP, which maps images to class labels or object boxes, contrastive learning-based SSP methods [43], [44], [45] attracted positive embedding pairs while dispelled negative pairs to learn image representation, and have achieved great success. Inspired by the pretext task of contrastive learning, some researchers have begun to consider specified pretraining paradigms for object detection task. For examples, Wei et al. [29] proposed a novel object-level contrastive learning method named SoCo, which aims to align pretraining to object detection fine-tuning step. Meanwhile, Dang et al. [46] studied an SSP approach for object detection based on sampling random boxes and maximizing spatial consistency. Considering balancing the recognition and localization of object detection, Bai et al. [47] proposed a point-level region contrast pretraining approach, which performs contrastive learning by directly sampling individual point pairs from different regions. In addition, as transformer model sweeps the field of computer vision, the authors in [48] and [49] also studied special unsupervised pretraining methods for the newly designed detector called DETR [50]. According to abovementioned research works, it can be found that whether the supervised or self-supervised learning for object detection model pretraining, how to make the pretraining step align with object detection fine-tuning step is a hot research field. Thus, in this article, we also focus on improving our previous work of CSPT [26] to make it more suitable for object detection task.

B. MIM-Based SSP
Recently, MIM-based SSP motivated by BERT [51] has raised to be a promising pretraining method in computer vision field. Unlike the contrastive learning needs to carefully allocate the positive and negative pairs by prior setting, it directly learns the image representation knowledge by reconstructing masked image contents. For instances, SimMIM [19], MAE [17], and ConvMAE [18] randomly masked the original RGB pixels with high masked ratio and then utilized an encoder-decoder architecture to predict the masked pixel values. In addition, except for masking the RGB pixels, BEiT [16] and Maskfeat [20] explored masking the unsupervised features and proved that it is also a useful signal for representation learning. Moreover, some researchers are committed to distill the representation knowledge from large-scale pretrained models to advance the MIM-based SSP. Kakogeorgiou et al. [52] exhibited the context of distillation-based MIM, where a teacher transformer encoder generates an attention map and then it is used to guide masking for the student model. Then, Xue et al. [53] rethought the reconstruction of MIM pretext and proposed to learn the consistency of visible patch features extracted by the student model and intact image feature extracted by the teacher model. Although the existed MIM-based SSP methods have made great progress by exploring effective MIM strategies to learn holistic scene representation applied for various downstream tasks, the dense object-level prediction characteristic of object detection task is not carefully considered. Thus, in this article, we are devoted to explore better MIM-based SSP strategy to capture the object-level representation description for remote sensing object detection.

A. Overview
The overall framework of the proposed object-level MIMbased SSP strategy called OCMIM for object detection task is illustrated in Fig. 2. It includes two parts, i.e., OCDG and AGMG. First, the OCDG is designed for automatically setting up targeted data according to objects themselves for pretraining step, where Fig. 2(a) represents the OCDG, which includes multiscale object-centric cropping, tiny object splicing, and sample balancing. Through these specifically designed processing steps used for task-related data generation of the second stage of CSPT [26], the object-level representation involving full scales and multicategories can be effectively learned at the pretraining step for aligning with object detection task. Second, the AGMG is proposed for facilitating MIM-based SSP to pay more attention on foreground object regions, so that it avoids the influence of redundant background at the pretraining step, whose details are represented in the Fig. 2(c). Specifically, it first calculates self-attention map between cls token and image patch tokens by query-key product to reflect the highly attended regions. Then, as shown in the middle of Fig. 2(b), on the premise of retaining some visual clues, mask these regions as much as possible to achieve attention-guided reconstruction. Finally, the OCDG and AGMG are introduced into MAE [17] and utilized for model pretraining based on an autoencoder architecture. Then, the L 1 -norm loss is applied for the output of autoencoder architecture, which can clearly reveal whether the object-level features are effectively captured via the whole strategy. Next, the OCDG and AGMG modules of OCMIM-based SSP strategy and fine-tuning step are described in details.

B. Object-Centric Data Generator (OCDG)
As mentioned in Section II. B, MIM is an optimal pretext task for SSP, which has been used for setting up a remote sensing foundation model for various downstream tasks. However, under the high masked ratio, the scene-level MIM-based SSP would ignore the object-level information at the pretraining step, especially for small-scale objects. This results in failing to provide a suitable feature representation description capability for backbone network of object detectors. Thus, at the pretraining step, we consider to setup object-level representation information acquisition so as to propose the OCDG. In the proposed OCDG, given an original scene-level image I scene , the random scale of α ∈ [3,5] is adopted for cropping the GTs according to object center point, width, and height, which can be expressed as follows: where Crop represents the image cropping operation, N is the number of GTs, (x i c , y i c ), w i , and h i represent the center point coordinate, width, and height of the ith GT. Considering that the tiny objects of object-level image data I object possess few pixels but frequently appear in remote sensing scene, they still conflict with object-level MIM-based SSP. Since when directly interpolating these tiny objects into higher resolution at spatial dimension for model pretraining, the distorted issue would affect the representation learning even by object-level MIM-based SSP. To this end, instead of uniformly interpolating I object into 224 × 224 pixels for model pretraining, the object-level images whose size are less than 80 × 80 pixels are rescaled into moderate scale (e.g., 112 × 112 pixels), and then four rescaled tiny object images are randomly spliced into an integrated image with 224 × 224 pixels, which can ensure the tiny object information to be gracefully captured, as shown in Fig. 2(a) and (b). Thus, the proposed tiny object splicing way and pretraining process can be expressed as follows: where, in (2), PE denotes the patch embedding that splits image into L patches, and the feature of patches is represented as f L . Splice denotes splicing four images randomly, Rescale means resizing I object∈S into 112 × 112 pixels, and S represents the set of small objects. By (2), whether the large-scale object-level images not belonging to S or small-scale spliced images belonging to S, they are flattened as L patches to prepare for subsequent masking. In (3), M denotes the randomly masked regions. f 0 is the learned vector that indicates the presence of masked patches to be predicted. E and D represent the encoder and decoder modules of autoencoder, respectively. I R represents the reconstructed image. In addition, as object-level image data are acquired from GTs, the long-tailed distribution of multicategories for object detection model pretraining would suffer the same category imbalance issue, which makes pretrained model intently learn categories with sufficient samples and ignore other categories. Accordingly, a sample balancing method is adopted to keep the sample balance among different categories, as shown in the right of Fig. 2(a). In details, let λ to be the average sample number of all categories. Next, if the sample number is less than Algorithm 1: Attention-Guided Mask Generation. Input: Attention Generator G and Input Image I Output: Mask M 1: Attention Generator G has P blocks, P = 12 2: ImageP atches ← P atchEmbed(I) 3: x ← Concat(ImageP atches, cls) 4: for block in blocks: do 5: x, qk_martix ← block(x) 6: qk_matrix ∈ R (N +1)×(N +1) , N + 1 denotes 1 cls token and N image patch tokens. 7: [a 1 , a 2 , . . ., a N ] ← qk_matrix[0, 1 :] 8: Attn ← Sort([a 1 , a 2 , . . ., a N ]) in descending 9: λ, these categories are oversampled for OCMIM-based SSP, and if the sample number is much more than λ, these categories are undersampled for OCMIM-based SSP.

C. Attention-Guided Mask Generator (AGMG)
For the scene-level MIM-based SSP, the randomly masking strategy with high masked ratio has been proved to be concise and effective in nature scene. However, in remote sensing scene, the foreground objects with few pixels are often surrounded by complex background. Randomly masking and reconstructing remote sensing images will lead to overly attending to redundant background area, thereby learning the object-level representation information inefficiently. In order to better guide the model to capture object-level representation information at the pretraining step, it is desired that the foreground object regions can be masked as much as possible while only remain few visible tokens of foreground object regions. This can advocate the pretraining step to learn targeted object-level representation about foreground regions. To achieve this, the AGMG is designed, as shown in Algorithm 1. First, a ViT-based model pretrained on M-AID [37] is introduced as attention generator G, which is used for roughly revealing the foreground object regions, as shown in Fig. 1(c). Formally, an input image X ∈ R H×W ×C is first fed into ViT-based attention generator G, and then the query Q and key K ∈ R (N +1)×d from the last block are utilized to calculate the self-attention map, where N + 1 represents the number of tokens (i.e., N is the number of image patches and 1 denotes the cls token) and d is a feature-embedding dimension. The formula is given as follows: where T denotes matrix transpose; Attn represents the generated self-attention map. Next, the attention vector a of the cls token is taken from self-attention map Attn, as shown in the following, which denotes correlation between cls token and other tokens of image patches. a = (a 1,2 , a 1,3 , a 1,3 , . . ., a 1,N +1 ) .
To capture the highly attended regions, we first sorted the a by descending, and then selected the top 80% of a for masking images. Meanwhile, randomly unmask 5% tokens from the highly attended tokens, aiming to remain the part visible cues. Finally, the mask M with 75% ratio is generated, which keeps consistent with MAE [17]. Notably, we frozen the pretrained weights of attention generator G to avoid the network collapse at the pretraining step.

D. Fine-Tuning on Object Detection Task
After finishing the pretraining of autoencoder architecture by the proposed strategy, the ViT-B [54] based encoder part is regarded as the backbone of detectors at the fine-tuning step. Moreover, for multiscale feature description, the four layers of ViT-B [54] (i.e., 8th, 9th, 10th, and 11th layer) are selected to be fused for adapting multiscale object detection from remote sensing imagery, and the details are shown in Fig. 3. Finally, the loss function is represented as follows: Equation (6) contains the classification loss L class and bounding box regression loss L box . First, L class adopts the cross-entropy loss, as shown in the following: where p i denotes the prediction probability of the ith proposal box. Then, when the ground truth of the ith proposal box is positive, the p * i = 1; otherwise, the p * i = 0. Second, the L1-norm loss is utilized for calculating the offset of bounding box coordinates between ground truth and prediction result. It is formulated as follows: where the t * i denotes the ground truth of bounding box coordinates; the t i means the prediction of bounding box coordinates. Here, the bounding box coordinates include the center point (x, y), the width w, and the height h. Notably, at the fine-tuning step, we only focus on the effect of pretrained models, thus the other model fine-tuning tricks, such as advanced data augmentation, feature enhancement, and multiscale training and testing, are not considered for introduction.

IV. EXPERIMENTS AND ANALYSIS
A. Datasets and Implementation Details 1) Dataset Description: To prove the effectiveness of the proposed OCMIM-based SSP strategy, six remote sensing object detection datasets (e.g., NWPUVHR10 [31], HRSC2016 [33], DIOR [32], UCAS-AOD [34], ITCVD [35], and HRSID [36]) are utilized for comparison. First, following the CSPT [26], the large-scale nature scene dataset of ImageNet-1K [15] (IN1K) is used for the first stage of model pretraining. Then, the training  sets of abovementioned six downstream datasets are used for the second stage of pretraining. Meanwhile, considering the large image size of NWPUVHR10 [31] and ITCVD [35], we split their images into 512 × 512 pixels with stride of 128. Furthermore, the image size of other datasets is all regulated into 512 × 512 pixels.
2) Pretraining Setting: A ViT-based autoencoder architecture is adopted for model pretraining. Specifically, ViT-B [54] is chosen as the encoder, and a lightweight transformer model that only has eight self-attention blocks is adopted as the decoder. Then, all input images are regulated into 224 × 224 pixels and divided into nonoverlapping 16 × 16 image patches (i.e., vision tokens). Subsequently, to ensure effectiveness, we keep the same pretraining setting as MAE [17] by utilizing 75% mask ratio and then reconstructed by the decoder. Followed the settings of CSPT [26], in the first stage of pretraining, the ViT-based model is pretrained on a large-scale unlabeled dataset of ImageNet-1 K (IN1K) [15] for 800 epochs, which can obtain a generalist model. In the second stage of pretraining, the training set of downstream datasets is used for further pretraining the generalist model. For experimental details, the batchsize is set to 64 on one RTX 3090. AdamW [58] with momentums β1 = 0.9 and β2=0.95 is employed for optimization. The learning rate schedule adopts cosine decay with a base learning rate of 3. 3) Fine-Tuning Setting: To reflect the effect of the proposed OCMIM-based SSP for model fine-tuning step, we select the several detectors (e.g., plain Mask-RCNN [55], RetineNet [8], state-of-the-art (SOTA) ATSS [57], and AFPN-GAS [5]) in MMDetection [59] framework as benchmark model, without introducing any model fine-tuning tricks. In order to implant the pretrained models into these detectors, the original backbone network of benchmark model is replaced by the pretrained encoder network. About the neck network, feature pyramid network (FPN) is often used for constructing multiscale features in object detection task. Thus, the 8th, 9th, 10th, and 11th blocks of pretrained encoder network is chosen for outputting four feature vectors. Then, to fit the input of FPN, these four feature vectors are first transformed from sequence into 2-D spatial space by reshaping and permuting feature dimensions. Then, the output of the 8th block is upsampled by a factor of 4 via using two 2 × 2 transposed convolutions with stride = 2. The output of the 9th block is upsampled by a factor of 2 via using a single   2 × 2 transposed convolution with stride = 2. The output of the 10th block remains unchanged. The output of the 11th block is downsampled by a factor of 2 via 2 × 2 max pooling with stride = 2. Other network modules keep default setting. For training details, the input image size is set as 512 × 512 pixels, and the total number of epochs is set to 12 with a batchsize of 8. Next, the initial learning rate is set as 0.02 for Mask-RCNN [55], 0.002 for RetinaNet [8], 0.01 for ATSS [57], and AFPN-GAS [5], and then the learning rate is reduced by a factor 10 times at the 8th and 11th epochs. The SGD optimizer is utilized with momentum = 0.9 and weight decay = 0.0001. Random flipping and random resizing are used for data augmentation. Finally, the mean average precision (mAP@0.5) is calculated for evaluating detection performance.

B. Comparison Analysis
In this section, first, the widely used scene-level MIM-based SSP and the proposed OCMIM-based SSP are mainly compared in Table I. From Table I, we can see that the proposed OCMIMbased SSP strategy applied for the second stage of CSPT [i.e., object-level CSPT(IN1K → Train)] can get better performance than using the scene-level MIM-based SSP [i.e., scene-level CSPT(IN1K → Train)] on two benchmark detectors, which can reach the best results on NWPUVHR10 [31], HRSC2016 [33], DIOR [32], UCAS-AOD [34], ITCVD [35], and HRSID [36], respectively. Moreover, different pretraining strategies are also compared at the 5th and 6th columns of Table I, and we can find that by using ViT-B [54] as backbone network and training from scratch, it would obtain the catastrophic detection performance. The reason is that ViT [54] has very little inductive bias, and cannot obtain good performance under the condition of insufficient training data. Then, comparing with training from scratch, using scene-level MIM-based SSP on IN1K dataset (i.e., SSP(IN1K)) can largely promote the detection performance, which proved the role of building a strong basic feature representation through model pretraining before object detection task. In general, the MIM-based SSP, such as SSP(IN1K), scene-level CSPT(IN1K → Train), and object-level CSPT(IN1K → Train), are opening the pandora's box, which can evoke the basic feature representation ability of ViT-B [54] for remote sensing object detection. Besides, promoting the pretraining step align with fine-tuning step, such as the proposed OCMIM-based SSP strategy, is also important to further improve the performance of object detection task.
Second, to fully verify the superiority of the proposed pretraining strategy, based on the Mask-RCNN [55], several traditional SP methods and advanced SSP methods are selected for comparison on two multiclass remote sensing object detection datasets, i.e., DIOR [32] and NWPUVHR10 [31]. Here, DIOR [32] Tables II and III. First, from the second row  of Tables II and III, the SP(IN1K) of ViT-B [54] obtains the worst performance, which shows that the pretraining strategy by the task of image classification cannot fully stimulate the potential of the ViT-based models so that it is hard to capture effective object-level representation for object detection task. This situation also can be proved by the fifth and sixth rows of Tables II and III. It can be observed that when adopting the advanced MIM-based SSP strategy [i.e., SSP(IN1K)], they can bring performance gain than SP(IN1K); besides, SSP has the advantage of releasing the potential of large-scale unlabeled remote sensing data. Comparing different pretext tasks of SSP, the third and fourth rows of Tables II and III show that contrastive learning-based SSP has poorer performance than the MIM-based SSP, which demonstrates the superiority of MIM-based SSP. Regarding to the CNN-based model shown in the first row of Tables II and III, it can be found that SP(IN1K) of ResNet-101 performs better on NWPUVHR10 [31] than on DIOR [32] and even surpasses some SSP methods. We analyze the reason that the CNN-based models have stronger ability  [31] of local indicative bias than ViT-based models so that it can quickly fit the datasets with small-scale data amount. Despite all of above, from the seventh and eighth rows of Tables II and III, SSP(MAID) and CSPT(IN1K → Train) both achieve further detection performance gain by bridging the domain gap between natural scene domain and RSD. Finally, when employing our proposed OCMIM-based SSP strategy, as shown in the ninth row of Tables II and III, it obtains the highest mAP, i.e., 69.5% mAP and 89.3% mAP on DIOR [32] and NWPUVHR10 [31]. This demonstrated that our proposed method is very effective for advancing the detection performance by learning stronger object-level representation at the pretraining step.
Third, to show the universality of the proposed pretraining strategy, two SOTA detectors (e.g., ATSS [57] and AFPN-GAS [5]) are introduced. The experimental results in Tables II and III demonstrate that even with carefully designed modules, SOTA object detectors struggle to achieve satisfactory performance without pretraining. This finding emphasizes the importance of leveraging the basic feature representations from pretrained models for remote sensing object detection. Moreover, a comparison of the results in 10th-15th and 16th-21th rows of Tables II and III indicates that simply replacing the backbone obtained from our proposed pretrained strategy can further advance the performance of SOTA detectors without increasing any additional computational complexity at the fine-tuning step.

C. Ablation Analysis
In this section, to clearly prove the effectiveness of OCDG and AGMG, the ablation studies are conducted. As reported in Table  IV, each module is verified for its necessity. First, as shown in the first and second rows of Table IV, when only performing the multiscale object-centric cropping of OCDG on the image data of second stage of CSPT [26], it obtains 67.10% mAP, which is less than using the scene-level image data for model pretraining.
We analyze the reason that the dominant amount of small-scale  [32] objects and the severe sample imbalance among categories still appear on the object-level data condition. Specifically, directly interpolating these cropped images of tiny objects (e.g., less than 80 × 80 pixels) into 224 × 224 pixels would lead model to learn the distorted information at the pretraining step and then degenerate the detection performance. Thus, as shown in the third row of Table IV, it is considered to filter out tiny objects (e.g., less than 80 × 80 pixels) to ensure the model learning more effective object-level feature representation at the pretraining step, as a result, it can bring 0.4% mAP improvement. Furthermore, to avoid the influence of category sample imbalance, the sample balancing is considered in the fourth row of Table IV, which brings 0.6% mAP gain. Based on this finding, as reported in the fifth row of Table IV, the tiny object splicing and sample balancing are both introduced to combine with multiscale object-centric cropping. It obtains the higher performance of 68.60% mAP on DIOR [32], which demonstrated that the tiny object splicing and sample balancing of OCDG module can facilitate learning the reasonable representation about the full scales and multicategories at the pretraining step and promote the multicategories object detection performance at the fine-tuning step. Moreover, after equipping with the OCDG, the AGMG is further joined, as shown in the sixth row of Table IV. We can observe that it achieves the highest mAP of 69.50%, which illustrated that attention-guided masking strategy can provide a better object-level representation learning ability for remote sensing object detection task than using normal randomly masking strategy.
Furthermore, related to the comparison of scene-level MIMbased SSP and object-level OCMIM-based SSP, we visualize the reconstruction results, as shown in Fig. 4. As for the ships with large scale from HRSC2016 [33] in Fig. 4(a), they can be well reconstructed after randomly masking image whether in object level or scene level. However, as for the ships with small scale from NWPUVHR10 [31] in Fig. 4(b), it can be found from the red dotted line that these orange ships have been masked out and cannot be reconstructed in the scene-level MIM. By contrary, the object-level MIM can capture and learn the object information of orange ships, which shows that object-level representation can be learned well via OCMIM-based SSP. From the perspective of object detection fine-tuning step, some detection results are provided in Fig. 5. As can be seen, the results of column (d) can accurately recognize and locate more smallscale objects than the results of column (c), which shows that Fig. 7. Performance comparison about different scale factors α of the OCDG module on DIOR [32] and NWPUVHR10 [31]. the proposed OCMIM-based SSP strategy is more suitable for remote sensing object detection downstream task. Meanwhile, we also quantitatively discuss the fine-tuning performance of tiny object detection by using two different pretraining ways (i.e., scene-level MIM-based SSP and object-level OCMIMbased SSP). Here, the tiny objects (i.e., less than 32 × 32 pixels) of airplane, bridge, ship, storage tank, vehicle, and windmill from DIOR [32] are chosen for comparing performance. From these P-R curves of Fig. 6, we can see that when utilizing the proposed OCMIM-based SSP strategy for model pretraining, it can obtain better tiny object detection performance on six categories of DIOR [32] than using the scene-level MIM-based SSP for model pretraining.
In addition, an ablation analysis was conducted to evaluate the impact of the scale factor α of OCDG, and the results are presented in Fig. 7. The results indicate that α ∈ [3,5] leads the best performance on two multiclass object detection datasets. Meanwhile, a slight decline in performance was observed when the scale factor α was set as [5,7]. However, comparing with the abovementioned two settings, α ∈ [1, 3] obviously brings worse performance. We analyze that appropriate contextual information surrounding objects can help model better understand the object-level feature representation at the pretraining step, thereby effectively improving detection performance at the fine-tuning step.
To further clarify the effectiveness of AGMG, several extra ablation discussions were also conducted. In order to select an appropriate pretrained model that has the ability to reflect the foreground object regions well, three ViT-B models pretrained by different pretraining methods are intuitively compared, as shown in Fig. 8. These figures represent the self-attention map of the last block of ViT-B [54] model, which are calculated by query-key product between cls token with image patch tokens. Here, the warmer color represents the higher attention score. By comparing the (b), (c), and (d) rows of Fig. 8, it can be seen that the ViT-B model pretrained from large-scale remote sensing data of M-AID [37] [i.e., SSP(MAID)] pays more attention on foreground object regions, which shows that it is able to generate the guided mask for MIM. Subsequently, the ViT-B [54] model obtained from SSP(MAID) is employed for masking the objectlevel image data, aiming to verify the guided mask whether can promote the model fine-tuning performance of object detection. The ablation results are reported in Table V. One observation is that when utilizing the AGMG with ViT-B model of SSP(MAID) for masking object-level image data from OCDG, the significant performance gain (i.e., 0.3%-1.4%) can be steadily obtained on six datasets. We can deduce a conclusion that the attention-guided masking way can efficiently assist model pretraining learn more beneficial object-level representation by attending foreground regions than randomly masking. MIM-based SSP is a promising pretraining strategy in the RSD. However, the progress in object detection task has been limited due to complex background and vast object-scale variance. Thus, we provide a novel OCMIM-based SSP strategy for remote sensing object detection task. First, considering the normally used scene-level pretraining would fail to learn local object representation, an OCDG is proposed for model pretraining, which can facilitate the model to effectively capture object-level representation information with full scales and multicategories. Second, because these remote sensing objects are surrounded by complex background, an AGMG is designed for masking the foreground object regions as far as possible and while retaining some visible clues to distinguish more discriminative object information. Finally, through a lot of experiments, the effectiveness of the proposed OCMIM-based SSP strategy is demonstrated on six public remote sensing object detection benchmark datasets. His research interests include remote sensing object detection and recognition.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.