The Segment Anything Model (SAM) for Remote Sensing Applications: From Zero to One Shot

Segmentation is an essential step for remote sensing image processing. This study aims to advance the application of the Segment Anything Model (SAM), an innovative image segmentation model by Meta AI, in the field of remote sensing image analysis. SAM is known for its exceptional generalization capabilities and zero-shot learning, making it a promising approach to processing aerial and orbital images from diverse geographical contexts. Our exploration involved testing SAM across multi-scale datasets using various input prompts, such as bounding boxes, individual points, and text descriptors. To enhance the model's performance, we implemented a novel automated technique that combines a text-prompt-derived general example with one-shot training. This adjustment resulted in an improvement in accuracy, underscoring SAM's potential for deployment in remote sensing imagery and reducing the need for manual annotation. Despite the limitations encountered with lower spatial resolution images, SAM exhibits promising adaptability to remote sensing data analysis. We recommend future research to enhance the model's proficiency through integration with supplementary fine-tuning techniques and other networks. Furthermore, we provide the open-source code of our modifications on online repositories, encouraging further and broader adaptations of SAM to the remote sensing domain.


Introduction
Remote sensing image analysis is an essential tool in various applications, including environmental monitoring, disaster management, urban planning, and many others [12,57].Accurately segmenting surface objects within these images is crucial for extracting valuable information, enhancing the efficiency of the processing task [20].Despite advancements in segmentation techniques, including the advances of artificial intelligence (AI) with deep learning-based methods [4,2], a key challenge remains: effective segmentation of images with minimal human input.
The Segment Anything Model (SAM), developed by Meta AI, is a groundbreaking approach to image segmentation that has demonstrated exceptional generalization capabilities across a diverse range of image datasets, requiring no additional training for unfamiliar objects [19].This "zero-shot" approach enables it to make accurate predictions with little to no training data.However, its potential may be limited when facing specific domain conditions.To overcome this limitation, SAM can be modified by a "one-shot" learning approach [61], a novel aspect that we aim to explore with remote sensing imagery in this paper.
Zero-shot learning pertains to a model's capability to accurately process and act upon input data that it hasn't explicitly encountered during training [1,48].This ability is derived from gaining a generalized understanding of the data rather than specific instances.Zero-shot learning systems can recognize objects or understand tasks they have never seen before based on learn-ing underlying concepts or relationships.In contrast, one-shot learning denotes a model's ability to interpret and make accurate inferences from just a single example of a new class of data [61].By feeding SAM with a single example (or 'shot') of this new class, we can potentially enhance its performance, as it has more specific information to work with.
The most well know one-shot methods for SAM are named Per-SAM and PerSAM-F, both being training-free personalization approaches [61].Given a single image with a reference mask, PerSAM localizes the target concept using a location before an initial estimate of where the object of interest is likely to be.The second method is PerSAM-F, a variant of PerSAM that uses one-shot fine-tuning to reduce mask ambiguity.In this case, the entire SAM is frozen (i.e., its parameters are not updated during the fine-tuning process), and two learnable weights are introduced for multi-scale masks.This one-shot fine-tuning variant requires training only 2 parameters and can be done in as little as 10 seconds to enhance performance [61].Both are capable of leveraging SAM and improving it, making it a flexible model.
Another important aspect relates to SAM's ability to perform segmentation with minimal input, requiring only a bounding box or a single point as a reference, or even a prompt text as guidance [19].This capability has the potential to reduce human labor during the annotation process.Many existing techniques require intensive annotations for each new object of interest, resulting in significant computational overheads and potential delays in time-sensitive applications.SAM, on the other hand, presents an opportunity to alleviate this time-intensive task.

arXiv:2306.16623v1 [cs.CV] 29 Jun 2023
Preprint -The Segment Anything Model (SAM) for Remote Sensing Applications: From Zero to One Shot 2 Since SAM's release in April 2023, the geospatial community has shown strong interest in adapting SAM for remote sensing image segmentation.However, a more in-depth investigation is needed.In this context, we present a first-of-its-kind evaluation of SAM, focusing on both its zero and one shot learning performance on segmenting remote sensing imagery.We adapted SAM to our data structure, benchmarked it against multiple datasets, and assessed its potential to segment multiscale images.We then evolved SAM's zero-shot characteristic to the one-shot approach and demonstrated that with only one example of a new class of data, SAM's segmentation performance can be significantly improved.Our proposal's innovation is within the one-shot technique, which involves using a prompt-text-based segmentation as a training sample (instead of a human-labeled sample), making it a fully automated process for refining SAM on remote sensing imagery.In this study, we also discuss the implications, limitations, and potential future directions of our findings.Understanding the effectiveness of SAM in this domain is of paramount importance for novel development.In short, with its promise of zero-shot and one-shot learning, SAM has the potential to transform current practices by significantly reducing the time and resources needed for training and annotating data, thereby enabling a quicker, more efficient approach.

Remote Sensing Image Segmentation: A Brief Summary
The remote sensing field has experienced impressive advancements in recent years, largely driven by improvements in aerial and orbital platform technologies, sensor capabilities, and computational resources [50,39].One of the most critical tasks in remote sensing is image segmentation, which involves partitioning images into multiple segments or regions, each corresponding to, ideally, a specific object or class [20].In this section, we focus on providing comprehensive information regarding segmentation processes, deep learning-based methods, and techniques, and explain the overall importance of conducting zero-to-one shot learning.
Traditional image segmentation techniques in remote sensing often rely on pixel-based or object-based approaches.Pixelbased methods, such as clustering and thresholding, involve grouping pixels with similar characteristics, while object-based techniques focus on segmenting images based on properties of larger regions or objects [15,51].However, these methods can be limited in their ability to handle the complexity, variability, and high spatial resolution of modern remote sensing imagery [20].
Segmentation involves various methods designed to separate or group portions of an image based on certain criteria.Each method has a unique approach and application.Interactive Segmentation, for example, is a process that relies on user input to enhance the accuracy of the segmentation [21,54].The user may guide the algorithm by identifying foreground and background markers.Super Pixelization is another method that groups pixels in an image into larger units, or "superpixels," based on shared characteristics such as color or texture [11].This grouping can simplify the image data while preserving the structural essence of the objects.
Object Proposal Generation goes a step further by suggesting potential object bounding boxes or regions within an image [15,47].These proposals serve as a guide for a more advanced model to identify and classify the actual objects' pixels.Foreground Segmentation, also known as background subtraction, is a technique primarily used to separate the main subjects or objects of interest (the foreground) from the backdrop (the background) in an image sequence [63,31].
Semantic Segmentation is a more comprehensive approach where every pixel in an image is assigned to a specific class, effectively grouping regions of the image based on semantic interest [59,17].Instance Segmentation builds upon semantic segmentation by not only classifying each pixel but also identifying distinct objects of the same class and recognizing the individual objects as separate entities or instances [10,30].Panoptic Segmentation merges the concepts of semantic and instance segmentation, assigning every pixel in the image a class label and a unique instance identifier if it belongs to a specific class [16,7].This method aims to give a complete understanding of the image by identifying and classifying every detail.
All these methods have been vastly studied, but one that surged in recent years, with the advancements of Natural Language Models (NLM), is known as "Promptable Segmentation," an approach that aims to create a versatile model capable of adapting to a variety of segmentation tasks [34,62].This is achieved through "prompt engineering," where prompts are carefully designed to guide the model toward generating the desired output [29,48].This concept is a departure from traditional multi-task systems where a single model is trained to perform a fixed set of tasks.The unique feature of a promptable segmentation model is its ability to take on new tasks at the time of inference, serving as a component in a larger system [48,34].For instance, to perform instance segmentation, a promptable segmentation model could be combined with an existing object detector.
A state-of-the-art open-set object detector is Grounding DINO (GroundDINO) [26].This system is an enhancement of the Transformer-based object detector called DINO [60], enriched with grounded pre-training to be able to identify a broader range of objects based on human inputs, such as category names or referring expressions.An open-set detector is meant to identify and classify objects that weren't part of the model's training data, as opposed to a closed-set detector that can only recognize objects it has been specifically trained on.The information from Grounding DINO can potentially be used to guide the segmentation process, providing class labels or object boundaries that the segmentation model could use.
Most NLM incorporate deep-learning-based networks and, with the rise of these methods, more advanced segmentation techniques have been developed for remote sensing applications.Convolutional Neural Networks (CNNs), which emerged as a popular choice due to their ability to capture local and hierarchical patterns in images [39,33], have widely been used as the backbone for these tasks.CNNs consist of multiple convolutional layers that apply filters to learn increasingly complex features, making them well-suited for segmenting objects in many remote sensing images [58,4].However, they are computationally intensive and may require substantial training data.
Generative Adversarial Networks (GANs) have also shown potential in the field of image process.GANs consist of a generator and a discriminator network, where the generator tries to create synthetic data to fool the discriminator, and the discriminator aims to distinguish between real and synthetic data [18].For image segmentation, GANs can be used to generate realistic images and their corresponding segmentations, which can supplement the training data and improve the robustness of the segmentation models [5].
Vision Transformers (ViT), on the other hand, is a recent development in deep learning that has shown promise in image segmentation tasks.Unlike CNNs, which rely on convolutional operations, ViT employs self-attention mechanisms that allow them to model long-range dependencies and global context within images [23,25].This approach has demonstrated competitive performance in various computer vision tasks, including remote sensing image segmentation [2], and it is currently outperforming CNNs in remote sensing data [13].
Another key concept in deep learning that can enhance the segmentation process refers to its capability for transfer learning.
With transfer learning, a pre-trained model on a large dataset is adapted for a different but related task [49].For instance, a CNN or ViTr trained on a large-scale image recognition dataset like ImageNet can be fine-tuned for the task of remote sensing image segmentation [37,40].The advantage of transfer learning is that it can leverage the knowledge gained from the initial task to improve performance on the new task, especially when the amount of labeled data for the new task is limited.
One of the main challenges in applying deep learning techniques to remote sensing image segmentation is the need for large volumes of labeled ground-truth data [6].Acquiring and annotating this data can be time-consuming and labor-intensive, requiring expert knowledge and resources that may not be readily available.Furthermore, the variability and complexity of remote sensing imagery can make the labeling process even more difficult [3].In light of these issues, it becomes imperative to develop robust, efficient, and accessible solutions that can aid in the processing and analysis of such data.A model that can perform segmentation with zero domain-specific information may offer an important advantage for this process.
In this sense, the Segment Anything Model (SAM) has emerged as a potential tool for assisting in the segmentation process of remote sensing images.SAM design enables it to generalize to new image distributions and tasks effectively and already resulted in numerous applications [19].By using minimal human input, such as bounding boxes, reference points, or simply text-based prompts, SAM can perform segmentation tasks without requiring extensive ground-truth data.This capability can reduce the labor-intensive process of manual annotation and be incorporated into the image processing pipeline, potentially accelerating its workflow.
SAM has been trained on an enormous dataset, of 11 million images and 1.1 billion masks, and it boasts impressive zero-shot performance on already a variety of segmentation tasks [19].Foundation models such as this, which have shown promising advancements in NLP and, more recently, in computer vision, can carry out zero-shot learning.This means they can learn from new datasets and perform new tasks often by utilizing 'prompting' techniques, even with little to no previous exposure to these tasks.In the fields of NLP, "foundation models" refer to large-scale models that are pre-trained on a vast amount of data and are then fine-tuned for specific tasks.These models serve as the "foundation" for various applications [32,34,55].
SAM's ability to generalize across a wide range of objects and images makes it particularly appealing for remote sensing applications.Since it can be retrained with a single example of each new class at the time of prediction [61], it demonstrates the models' high flexibility and adaptability.The implementation of a one-shot approach may assist in designing models that learn useful information from a small number of examples -in contrast to traditional models which usually require large amounts of data to generalize effectively.This could potentially revolutionize how we process remote-sensing imagery.As such, by investigating SAM's innovative technology, we may be able to provide more interactive and adaptable remote sensing systems.

Materials and Methods
In this section, we describe how we evaluated the performance of the Segment Anything Model (SAM), for both zero and oneshot approaches, in the context of remote sensing imagery.The method implemented in this study is summarized in Figure 1.
The data for this study consisted of multiple aerial and satellite datasets.These datasets were selected to ensure diverse scenarios and a better range of objects and landscapes.This helped in assessing the robustness of SAM and its adaptability to different situations and geographical regions.
The study then investigated SAM's segmentation capacity under different prompting conditionals.First, we used the general segmentation approach, in which SAM was tasked to segment different objects and landscapes without any guided prompts.This provided a baseline for SAM's inherent segmentation capabilities with zero-shot.For this approach, we only evaluated its visual quality, since it segments every possible object in the image, instead of just the ones with ground-truth labels.It also is not guided by any means, thus resulting in the segmentation of unknown classes, serving as just a traditional segmentation filter.
In the second scenario, bounding boxes were provided.These rectangular boxes, highlighting specific areas within the images, were used to restrict SAM's segmentation per object and see its proficiency in recognizing and segmenting them.Next, we conducted segmentation using points as prompts.In this setup, a series of specific points within the images were provided to guide SAM's process.It allowed us to test the precision capabilities of SAM.Finally, we experimented with the segmentation process using only textual descriptions as prompts.This was conducted with an implementation of SAM alongside GroundingDINO's method [26].This permitted an evaluation of these models' capabilities to understand, interpret, and transform textual inputs into precise segmentation outputs.
To measure SAM's adaptability and potential to deal with remote sensing imagery, we then performed a one-shot implementation.
For each of the datasets, we included an example of the target class to SAM.For that, we adapted the model with a novel combination of the text-prompt approach and the one-shot learning method.Specifically, we selected the best possible example (highest logits) of the target object, using textual prompts to define the object for mask generation.This example was then presented to SAM as the sole representative of the class, effectively guiding its learning process.The rationale behind this combined approach was to leverage the context provided by the text prompts and the efficacy of the one-shot learning method to the adaptability of SAM to a fully-automated enhancement process.

Description of the Datasets
We begin by separating our dataset into three categories related to the origin of the platform used for capturing the images: 1. Unmanned Aerial Vehicle (UAV); 2. Airborne, and; 3. Satellite.Each of these categories provides unique advantages and challenges in terms of spatial resolution and coverage area dimension.In our study, we aim to evaluate the performance of SAM across these different sources to understand its applicability and limitations in diverse contexts.Their characteristics are summarized in Table 1.We also provided illustrated examples from these datasets in Figure 2, illustrating how the data is being tackled, as in bounding boxes and point prompts.
The UAV category comprises data that have the advantage of very-high spatial resolution, returning images and targets with fine details.This makes them particularly suitable for localscale studies and applications that require high-precision data.However, the coverage area of UAV datasets is limited compared to other data sources.The images compromised more single-class objects per dataset, so these problems were tackled in binary form.In the case of linear objects, specifically continued plantation crops cover, we used multi-points spread across the target, contained within its center and extremes, to ensure that the model was capable of understating it better.For more condensed targets such as houses and trees, we used the centered position of the object as a point prompt.
The second category is Airborne data, which includes data collected by manned aircraft.These datasets typically offer a good compromise between spatial resolution and coverage area.We processed these datasets with the same approach as with the UAV images since they also consisted of binary class problems.The total quantifiable size of these datasets surpasses 90 Gigabytes and comprises more than 10,000 images and image patches.Part of the dataset, specifically the aerial one (UAV and Airborne), is currently being made public in the following link for others to use: Geomatics and Computer Vision/Datasets.These datasets cover different area sizes and their corresponding ground-truth masks were generated and validated by different specialists in the field.
The third category consists of Satellite data, which provides the widest coverage and is focused on multi-class problems.The spatial resolution of satellite data used is generally lower than that of UAV and Airborne data.Furthermore, the quality of the images is more affected by atmospheric conditions, with illumination conditions differentiating from each other, thus providing additional challenges for the model.These datasets consist of publicly available images from the LoveDA dataset [53] and from SkySat ESA archive [9] and present a multiclass segmentation problem.To facilitate's SAM evaluation, specifically with the guided prompts (bounding box, point, and text), we conducted a one-against-all approach, in which we separated the classes into individual classifications ("specified class" versus "background").

Protocol for Promptable Image Segmentation
In this section, we explain how we adapted SAM to the remote sensing domain and how we conducted the prompable image segmentation with it.All of the implemented codes, specifically designed for this paper, were made publicly available in an underconstruction educational repository [42].Also, as part of our work, we are focusing on developing the "segment-geospatial" package [46], which implements features that will simplify the process of using SAM models for geospatial data analysis.This is a work in progress, but it is publicly available and offers a suite of tools for performing general segmentation on remote-sensing  images using SAM.The goal is to enable users to engage with this technology with a minimum of coding experience.
Our geospatial analysis was conducted with the assistance of a custom tool, namely "SamGeo", which is a component of the original module.SAM possesses different models to be used, namely: ViT-H, ViT-L, and ViT-B [19].These models have different computational requirements and are distinct in their underlying architecture.In this study, we used the ViT-H SAM model, which is the most advanced and complex model currently available, bringing most of the SAM capabilities to our tests.
To perform the general prompt, we used the generate method of the SamGeo instance.This operation is simple enough since it segments the entire image and stores it as an image mask file, which contained the segmentation masks.Each mask delineates the foreground of the image, with each distinct mask allocated a unique value.This allowed us to classify and segment different geospatial features.The result is a non-classified segmented image that can also be converted into a vector shape.As mentioned, we only evaluated this approach visually, since it was not possible to appropriately assign the segmented regions outside of our reference class.
For the bounding box prompt, we used the SamGeo instance in conjunction with the objects' shapefile.This approach was used to extract bounding boxes from any multipart polygon geometry, which returned a list of geometric boundaries for our image data based on its coordinates.To efficiently process these boundaries, we initialized its predictor instance.In this process, the image was segmented and passed through the predictor along with a designated model checkpoint.Once established, the predictor processed each clip box, creating the masks for the segmented regions.This process enabled each bounding box's contents to be individually examined as instance segmentation masks.These binary masks were then merged and saved as a single mosaic raster to create a comprehensive visual representation of the segmented regions.Although not focused on remote sensing data, the official implementation is namely as Grounded-SAM [14].
The single-point feature prompt was implemented similarly to the bounding-box method.For that, we first defined functions to convert the geodata frame into a list of coordinates [x, y] instead of the previous [x1, y1, x2, y2] ones.We utilized SamGeo again for model prediction but with the distinction of setting its automatic parameter to 'False' and applying the predictor to individual coordinates instead of the bounding boxes.This approach was conducted by iterating through each point of the coordinate pairs, predicting its features in instances, and saving the resulting mask into a unique file per point (also resulting in instance segmentation masks).After the mask files were generated, we proceeded to merge these masks into a single mosaic raster file, giving us a complete representation of all the segmented regions from the single-point feature prompt.
The text-based prompt differentiates from the previous approach since it required additional steps to be implemented.This method blends GroundingDINO's [26] capabilities for zero-shot visual grounding with SAM's object segmentation functionality for retrieving the pre-trained models.For instance, once Grounding DINO has detected and classified an object, SAM is used to isolate that object from the rest.As a result, we've been able to identify and segment objects within our images based on a specified textual prompt.This procedure opens up a new paradigm in geospatial analysis, harnessing the power of state-of-the-art models to extract image features based only on natural language input.
Since remote sensing imagery often contains multiple instances of the same object (e.g., several 'houses', 'cars', 'trees', etc.), we've added a looping procedure.The loop identifies the object with the highest probability in the image (i.e.logits), creates a mask for it, removes it from the image, and then restarts the process to identify the next highest probable object.This process continues until the model reaches a defined minimum threshold for both detection and text prompt association.The precise balancing of these thresholds is crucial, with implications for the accuracy of the model, so we manually set them for each dataset based on trial and error tentative.The segmented individual images and their corresponding boxes are subsequently generated, while the resulting segmentation mask is saved and mosaicked.

One-Shot Text-Based Approach
The one-shot training was conducted following the recommendations on [61] by using its PerSAM and PerSAM-F approaches.We begin by adapting the text-based approach of the combination of the GroundDINO [26] and SAM [19] methods to return the overall most probable object belonging to the specified class in its description.By doing so, we enable a fully-automated process of identifying a single object and including it on a personalized pipeline for training SAM with this novel knowledge.
In this section, we describe the procedures involved in the oneshot training mechanism as well as the methods used for object identification and personalization.To summarize the whole process, we illustrate the main phases in Figure 3.
Following Figure 3, the initial phase of the one-shot training mechanism involves the model using the object with the highest logits calculated from the text-based segmentation.This ensures the object is accurately recognized and selected for further steps.It's this aspect of the process that the text-based approach comes into play, capitalizing on GroundDINO's capabilities for zero-shot visual grounding combined with SAM's object segmentation for pre-trained model retrieval.As such, the selected object becomes the "sample" of the one-shot training process due to its high probability of belonging to the specified class by the text.
Once the object has been identified through this method, the next phase involves creating a single-segmented object mask.This mask is used for the retraining of SAM in a one-shot manner.
The text-based approach adds value by helping SAM distinguish between the different object instances present in the remote sensing imagery, such as multiple "houses", "cars", or "trees", for example.Each object is identified based on its individual likelihood, leading to the creation of a unique mask for retraining SAM.The third phase comes into play once the object with the highest probability has been identified and its mask has been used for SAM's one-shot training.The selected input object is removed from the original image, leaving the remaining objects ready for further segmentation.
The final phase involves a dynamic, interactive loop, where the remaining objects are continuously segmented until no more objects are detectable by the PerSAM approach [61].This phase is critical as it ensures that every potential object within the image is identified and segmented.Here again, the loop approach aids the process, using a procedure that identifies the next highest probable object, as it creates a mask, removes it from the image, and repeats.This cycle continues until a breakpoint is reached, where it detects that the position of the object is the same as the previous one.
Another important clarification of the one-shot approach regards the choice of the method for its training.An early exploration of both PerSAM and PerSAM-F methods [61] was conducted to assess their utility in the context of remote sensing imagery.Our investigations have shown that PerSAM-F emerges as a more suitable choice for this specific domain.
PerSAM, in its original formulation, leverages one-shot data through a series of techniques such as target-guided attention, target-semantic prompting, and cascaded post-refinement, delivering favorable personalized segmentation performance for subjects in a variety of poses or contexts.However, there were occasional failure cases, notably where the subjects comprised hierarchical structures to be segmented.
Examples of such cases in traditional images are discussed in [61], where ambiguity provides a challenge for PerSAM in determining the scale of the mask as output (e.g. a "dog wearing a hat" may be segmented entirely, instead of just the "dog").
In the context of remote sensing imagery, such hierarchical structures are commonly encountered.An image may contain a tree over a house, a car near a building, a river flowing through a forest, and so forth.These hierarchical structures pose a challenge to the PerSAM method, as it struggles to determine the appropriate scale of the mask for the segmentation output.
An example of such a case, where a tree covers a car, can be seen in Figure 4.
To address this challenge, we used PerSAM-F, its fine-tuning variant of PerSAM.As previously mentioned, PerSAM-F freezes the entire SAM to preserve its pre-trained knowledge and only fine-tunes 2 parameters within a 10 seconds training window [61].Crucially, it enables SAM to produce multiple segmentation results with different mask scales, thereby allowing for a more accurate representation of hierarchical structures commonly found in remote sensing imagery.PerSAM-F employs learnable relative weights for each scale, which adaptively select the best scale for varying objects.This strategy offers an efficient way to handle the complexity of segmentation tasks in remote sensing imagery, particularly when dealing with objects that exhibit a range of scales within a single image.This, in turn, preserves the characteristics of the segmented objects more faithfully.
As such, PerSAM-F exhibited better segmentation accuracy in our early experiments, thus being the chosen method to be incorporated with the text-based approach.Regardless, to evaluate the performance and utility of the text-based one-shot learning method, we conduct a comparative analysis against a traditional one-shot learning approach.The traditional method used for Preprint -The Segment Anything Model (SAM) for Remote Sensing Applications: From Zero to One Shot  comparison follows the typical approach of one-shot learning, providing the model with a single example from the ground-truth mask, manually labeled by human experts.To ensure fairness, we provided the model with multiple random samples from each dataset, and mimic the image inputs to return a direct comparison for both approaches.We calculated the evaluation metrics from each input and returned its average value alongside its stan-dard deviation.Since the text approach always uses the same input (i.e. the highest logits object), we were able to return a single measurement of their accuracies.

Model Evaluation
The performance of both zero-shot and one-shot models was measured by evaluating their prediction accuracy on a groundtruth mask.For that, we used metrics like Intersection over Union (IoU), Pixel Accuracy, and Dice Coefficient.These metrics are commonly used in evaluating imaging segmentation, as they provide a more nuanced understanding of model performance.For that, we compared pairs of the predicted masks with the ground-truth masks.
Intersection over Union (IoU) is a common evaluation metric for object detection and segmentation problems.It measures the overlap between the predicted segmentation and the ground truth [45].The IoU is the area of overlap divided by the area of the union of the predicted and ground truth segmentation.A higher IoU means a more accurate segmentation.The equation to achieve it is presented as: Here, TP represents True Positives (the correctly identified positives), FP represents False Positives (the incorrectly identified positives), and FN represents False Negatives (the positives that were missed).
Pixel Accuracy is the simplest used metric and it measures the percentage of pixels that were accurately classified [35].It's calculated by dividing the number of correctly classified pixels by the total number of pixels.This metric can be misleading if the classes are imbalanced.The following equation can be used to return it: Here, TN represents True Negatives (the correctly identified negatives).
Dice Coefficient (also known as the Sørensen-Dice index) is another metric used to gauge the performance of image segmentation methods.It's particularly useful for comparing the similarity of two samples.The Dice Coefficient is twice the area of overlap of the two segmentations divided by the total number of pixels in both images (the sum of the areas of both segmentations) [35].The Dice Coefficient ranges from 0 (no overlap) to 1 (perfect overlap).The equation to perform it is given as follows: The Dice Coefficient is twice the area of overlap of the two segmentations divided by the total number of pixels in both images (the sum of the areas of both segmentations).
We also utilized other metrics such as True Positive Rate (TPR) and False Positive Rate (FPR) to measure the effectiveness of SAM, juxtaposed with the accurately labeled class from each dataset.The interpretation of these metrics is as per [43], where the True Positive Rate (TPR) denotes the fraction of TP cases among all actual positive instances.The False Positive Rate (FPR), for instance, signifies the fraction of FP instances out of all TN instances.A model with a higher TPR is proficient at correctly pinpointing lines and edges and performs better at avoiding incorrect detections of lines and edges when the FPR is lower.Both metrics are calculated as: In light of the nature of SAM, a transformer network, we aimed to preserve the context of our images for the attention mechanism of the model.Instead of cropping the images, specifically the aerial ones, into smaller patches, we chose to either use larger image crops or even entire orthomosaics for processing in a one-go.This method of implementation, however, substantially increased the amount of time required to perform the inference in our aerial data.For the larger patches, the inference process in GPU time was below 10 minutes for most data, while when considering entire datasets, it took around 1 to 2 hours to process.For the inference, we used an nVidia RTX 3090 with 24 GB GDDR6X video memory and 10,496 CUDAS in the Ubuntu 22.04 operation system.
Regardless, the results yielded a detailed insight into the segmentation scores between each prompt (general, bounding box, poing, text, PerSAM-F, and text-based PerSAM-F).This analysis helped us evaluate better the efficiency and accuracy of SAM's performance in prompt segmentation against the groundtruth masks, providing a quantitative understanding of it.

Results and Discussion
Our exploration of the Segment Anything Model (SAM) for remote sensing tasks involved an evaluation of its performance across various datasets and scenarios.This section presents the results and discusses their implications for SAM's role in remote sensing image analysis.This process commenced with an investigation of SAM's general segmentation approach, which requires no prompts.By merely feeding SAM with remote sensing images, we aimed to observe its inherent ability to detect and distinguish objects on the surface.Examples of different scales are illustrated in Figure 5, where we converted the individual regions for vector format.This approach demonstrates its adaptability and suitability for various applications.However, this method is not guided by a prompt, not returning specific segmentation classes, making it difficult to measure its accuracy by our available labels.
As depicted in Figure 5, the higher the spatial resolution of an image, the more accurately SAM segmented the objects.An interesting observation pertained to the processing of satellite images where SAM encountered difficulties in demarcating the boundaries between contiguous objects (like large fragments of trees or roads).Despite this limitation, SAM exhibited an ability to distinguish between different regions when considering very-high spatial resolution imagery, indicative of an interesting segmentation capability that does not rely on any prompts.This approach offers value for additional applications that are based on object regions, such as classification algorithms.Moreover, SAM can expedite the process of object labeling for refining other models, thereby significantly reducing the time and manual effort required for this purpose.
Following this initial evaluation, we proceeded to test SAM's promptable segmentation abilities using bounding boxes, points, and text features.The resulting metrics for each dataset are summarized in Table 2. Having compiled a dataset across diverse platforms, including UAVs, airborne devices, and satellites with varying pixel sizes, we noted that SAM's segmentation efficacy is also quantitatively influenced by the image's spatial resolution.These findings underscore the significant influence of spatial resolution on the effectiveness of different prompt types.
For instance, on the UAV platform, text prompts showed superior performance for object segmentation tasks such as trees, with higher Dice and IoU values.However, bounding box prompts were more effective for delineating geometrically well-defined and larger objects like houses and buildings.The segmentation of plantation crops was a unique case.Point prompts performed well at a finer 0.01 m resolution for individual plants.However, as the resolution coarsened to 0.04 m and the plantation types changed, becoming denser with the plant canopy covering entire rows, bounding box prompts outperformed the others.This outcome suggests that, for certain objects, the type of input prompt can greatly influence detection and segmentation in the zero-shot approach.
With the airborne platform, point prompts were highly effective at segmenting trees and vehicles at a 0.20 m resolution.This trend continued for the segmentation of lakes at a higher 0.45 m resolution.It raises the question of whether the robust performance of point prompts in these scenarios is a testament to their adaptability to very high-resolution imagery or a reflection of the target object's specific characteristics.These objects primarily consist of very defined features (like cars and vehicles) or share similar characteristics (as in bodies of water).
In the context of satellite-based remote sensing imagery, point prompts proved most efficient for multi-class segmentation at the examined resolutions of 0.30 m and 0.50 m.This can be at-Preprint -The Segment Anything Model (SAM) for Remote Sensing Applications: From Zero to One Shot 10 tributed to the fact that bounding box prompts tend to overshoot object boundaries, producing more false positives compared to point prompts.This finding indicates the strong ability of point prompts to manage a diverse set of objects and categories at coarser resolutions, making them a promising tool even for satellite remote sensing applications.The text-based approach was found to be the least effective, primarily due to the model's difficulty in associating low-resolution objects with words.Still, it is important to noticed that, from all the datasets, the satellite multiclass problem proved to be the most difficult task for the model, with generally lower metrics than the others.
Qualitatively, our observations also revealed that bounding boxes were particularly effective for larger objects (Figure 6).However, for smaller objects, SAM tended to overestimate the object size by including shadows in the segmented regions.Despite this overestimation, the bounding box approach still offers a useful solution for applications where an approximate estimate of such larger objects suffices.For these types of objects, a single point or central location does not suffice; they are defined by a combination of features within a particular area.Bounding boxes provide a more spatially comprehensive prompt, encapsulating the entire object, which makes them more efficient in these instances.
The point-based approach outperformed the others across our dataset, specifically for distinct objects.By focusing on a singular point, SAM was able to provide precise segmentation results, thus proving its capability to work in detail (Figure 7).In the plantation dataset with 0.01 m resolution, for instance, when considering individual small plants, the point approach returned better results than bounding boxes.This approach may hold particular relevance for applications requiring precise identification and segmentation of individual objects in an image.Also, when isolating entities like single trees and vehicles, these precise spatial hints might suffice for the model to accurately identify and segment the object.
The textual prompt approach also yielded promising results, particularly with very high-resolution images (Figure 8).While it was found to be relatively comparable in performance with the point and bounding box prompts for the aerial datasets, the text prompt approach had notable limitations when used with lower spatial resolution images.The text-based approach also returned worse predictions on the plantation with 0.04 m.This may be associated with the models' limitation on understanding the characteristics of specific targets, especially when considering the upper view of remote sensing images.Since it relies on GroundDINO to interpret the text, it may be more of a limitation on it than on SAM, mostly because, when applying the general segmentation, the results visually returned overall better segmentation on these datasets (Figure 5).
Text prompts, though generally trailing behind in performance, still demonstrated commendable results, often closely following the top-performing prompt type.Text prompts offer ease of im- Preprint -The Segment Anything Model (SAM) for Remote Sensing Applications: From Zero to One Shot 14 plementation as their primary advantage.They don't necessitate specific spatial annotations, which are often time-consuming and resource-intensive to produce, especially for extensive remote sensing datasets.However, their effectiveness hinges on the model's ability to translate text-to-image information.Currently, their key limitation is that they are typically not trained specifically on remote sensing images, leading to potential inaccuracies when encountering remote sensing-specific terms or concepts.Improving the effectiveness of text prompts can be achieved through fine-tuning models on remote sensing-specific datasets and terminologies.This could enable them to better interpret the nuances of remote sensing imagery, potentially enhancing their performance to match or even surpass spatial prompts like boxes and points.
Regarding our one-shot approach, we noticed that the models' performance is improved in most cases, as evidenced by the segmentation metrics calculated on each dataset.Table 3 presents a detailed comparison of the different models' performance providing a summary of the segmentation results.Figure 9 offers a visual illustration of example results obtained from both approaches, particularly highlighting the performance of the model.The metrics indicate that, while the PerSAM approach with a human-sampled example may be more appropriate than the proposed text-based approach, it may not always be the case when considering the metric's standard deviation.This opens up the potential for adopting the fully-automated process instead.However, in some instances, specifically where Ground-DINO's not capable of identifying the object, to begin with, the human-labeling provides a more appropriate return.
In its zero-shot form, SAM tends to favor selecting shadows in some instances alongside its target, which can hinder its performance in tasks like tree detection.Segmenting objects with similar surrounding elements, especially when dealing with construction materials like streets and sidewalks, can be challenging for SAM, as noticed in our multi-class problem.Moreover, its performance with larger grouped instances, particularly when using the single-point mode, can be unsatisfactory.Also, the segmentation of smaller and irregular objects poses difficulties for SAM independently from the given prompt.SAM may generate disconnected components that do not correspond to actual features, specifically in satellite imagery where the spatial resolution is lower.
The text-based one-shot learning approach, on the other hand, automates the process of selecting the example.It uses the textbased prompt to choose the object with the highest probability (highest logits) from the image as the training example.This not only reduces the need for manual input but also ensures that the selected object is highly representative of the specified class due to its high probability.Additionally, the text-based approach is capable of handling multiple instances of the same object class in a more streamlined manner, thanks to the looping mechanism that iteratively identifies and segments objects based on their probabilities.The one-host, however, excluded some of the objects in the image to favorite only the objects similar to the given sample.
In summary, upon comparing these two methods, we found that the traditional one-shot learning approach outperforms the zero-shot learning approach in all datasets.Additionally, the combination of text-based with one-shot learning also, even when not improving on it, gets close enough in most cases.This comparison underscores the benefits and potential of integrating state-of-the-art models with natural language processing capabilities for efficient and accurate geospatial analysis.Nevertheless, it is important to remember that the optimal choice between these methods may vary depending on the specific context and requirements of a given task.
5 Future Perspectives on SAM for Remote Sensing SAM has several advantages that make it an attractive option for remote sensing applications.First, it offers zero-shot generalization to unfamiliar objects and images without requiring additional training [19].This capability allows SAM to adapt to the diverse and dynamic nature of remote sensing data, which often consists of varying land cover types, resolutions, and imaging conditions.Second, SAM's interactive input process can significantly reduce the time and labor required for manual image segmentation.The model's ability to generate segmentation masks with minimal input, such as a text prompt, a single point, or a bounding box, accelerates the annotation process and improves the overall efficiency of remote sensing data analysis.Lastly, the decoupled architecture of SAM, comprising a one-time image encoder and a lightweight mask decoder, makes it computationally efficient.This efficiency is crucial for large-scale remote sensing applications, where processing vast amounts of data on time is of utmost importance.
However, our study consists of an initial exploration of this model, where there's still much to be investigated.In this section, we discuss future perspectives on SAM and how it can be improved upon.Despite its potential, SAM has some limitations when applied to remote sensing imagery.One challenge is that remote sensing data often come in different formats, resolutions, and spectral bands.SAM, which has been trained primarily on RGB images, may not perform optimally with multispectral or hyperspectral data, which are common in remote sensing applications.A possible approach to this issue consists of either adapting SAM to read into multiple bands by performing rotated 3-band combinations or performing a fine-tuning to domain adaption.In our early experiments, a simple example run on different multispectral datasets demonstrated that, although the model has the potential to segment different regions or features, it still needs further exploration.This is something that we intend to explore in future research, but expect that others may look into it as well.
Regardless, the current model can be effectively used in various remote sensing applications.For instance, we verify that SAM can be easily employed for land cover mapping, where it can segment forests, urban areas, and agricultural fields.It can also be used for monitoring urban growth and land use changes, enabling policymakers and urban planners to make informed decisions based on accurate and up-to-date information.Furthermore, SAM can be applied to a pipeline process to monitor and manage natural resources.Its efficiency and speed make it suitable for real-time monitoring, providing valuable information to decision-makers.This is also a feature that could be potentially explored by research going forward with its implementation.
The one-shot technique of SAM, which is the capacity to generate accurate segmentation from a single example [61], could be further expanded into few-shot learning scenario.Our experimental results indicated an improvement in performance across most investigated datasets when this approach was utilized, especially considering the border of the objects.However, it is essential to note that one-shot learning may pose challenges to the generalization capability of the model, especially when dealing with remote sensing data that often exhibit a high degree of heterogeneity and diversity.For instance, a "healthy" tree can be a good sample for the model, but it can bias it to ignore "unhealthy" trees or canopies with different structures.
Expanding the one-shot learning to a few-shot scenario could potentially improve the model's adaptability to different environments or tasks by enabling it to learn from more than one examples (1 to 10) instead of a single one.This would involve using a small set of labeled objects for each land cover type during the training process [48,24].On the other hand, a more robust learning approach, which uses a larger number of examples for each class, could further enhance the model's ability to capture the nuances and variations within each class.This approach, however, may require more computational resources and training data, and thus may not be suitable for all applications.
Additionally, While SAM is a powerful tool for image segmentation, its effectiveness can be boosted when combined with some techniques.For example, integrating SAM into another ViT framework in a weakly-supervised manner could potentially improve the segmentation result, better handling the spatialcontextual information.However, it's worth noting that integrating it might also bring new challenges [52].One potential issue could be the increased model complexity and computational requirements, which might limit its feasibility.But, as the training of transformers typically requires large amounts of data, SAM can provide fast and relatively accurate labeled regions for it.
Furthermore, one of the key challenges to tackle would-be improving SAM's performance when applied to low spatial resolution imagery.As noted in our early experiments, SAM's accuracy tends to decrease when the image resolution is above 1 or 2 meters in size.This shortcoming can be addressed by coupling SAM with a Super-Resolution (SR) technique [56], creating a two-step process, where the first step involves using an SR model to increase the spatial resolution of the imagery, and the second step involves using the enhanced resolution image as an input to SAM.Given that SAM performs better with highresolution images, this process could improve SAM's overall performance with remote sensing imagery that has a lower native resolution.However, it should be noted that SR techniques are not perfect, as they introduce errors in the high-resolution images that are created [56], which, in turn, impacts SAM's performance.Regardless, this approach could be tested and validated rigorously in future research.
Lastly, as we explored the integration of SAM with other types of methods, such as GroundDINO [26], we noticed both strengths and limitations that were already discussed in the previous section.This combination demonstrates a high degree of versatility and accuracy in tasks such as instance segmentation, where GroundDINO's object detection and classification guided SAM's segmentation process.However, the flexibility of this approach extends beyond these specific models.Any similar models could be swapped in as required, expanding the applications and robustness of the system.Alternatives such as GLIP [22] or CLIP [27] may replace GroundDINO, allowing for further experimentation and optimization [64].Furthermore, integrating language models like ChatGPT [36] could offer additional layers of interaction, nuances and understanding, demonstrating the far-reaching potential of combining these expert models.This modular approach underpins a potent and adaptable workflow that could reshape our capabilities in handling remote sensing tasks.
Also, when integrated with Geographical Information Systems (GIS), the combined power of models like SAM and others can substantially enhance the user experience and capabilities of these systems.The promptability and modularity of this approach allow for the integration of other models that could offer complementary capabilities.For instance, incorporating NLM like ChatGPT could facilitate easier and more intuitive interaction between users and the GIS.This could make the GIS more accessible to non-expert users, as they could interact with the system using natural language prompts instead of complicated technical inputs [41].Overall, this integration could revolutionize the way users interact with and utilize GIS, making the system more user-friendly, efficient, and versatile.It offers a vision of a new generation of GIS that is more adaptable and intuitive, able to handle diverse tasks, and provide richer insights into geographical data.
In short, our study focused on demonstrating the potential of SAM adaptability for the remote sensing domain, as well as presenting a novel, fully-automated approach, to retrain the model with one example from the text-based approach.While there is much to be explored, was important to understand how the model works and how it could be improved upon.To summarize this discussion, there are many potential research directions and implementations for SAM in remote sensing applications, which can be condensed as follows: • Examining the most effective approaches and techniques for adapting SAM to cater to a variety of remote sensing data forms, including multispectral and hyperspectral data.• Analysing the potential of coupling SAM with few-shot or multi-shot learning, to enhance its adaptability and generalization capability across diverse remote sensing scenarios.• Investigating potential ways to integrate SAM with prevalent remote sensing tools and platforms, such as Geographic Information Systems (GIS), to augment the versatility and utility of these systems.• Assessing the performance and efficiency of SAM in real-time or near-real-time remote sensing applications to understand its capabilities for timely data processing and analysis.• Exploring how domain-specific knowledge and expertise can be integrated into SAM to enhance its ability to understand and interpret remote sensing data.• Evaluating the potential use of SAM as an alternative to traditional labeling processes and its integration with other image classification and segmentation techniques in a weakly-supervised manner to boost its accuracy and reliability.• Integrating SAM with SR approach to enhance its capability to handle low-resolution imagery, thereby expanding the range of remote sensing imagery it can effectively analyze.

Conclusions
In this study, we conducted a comprehensive analysis of both the zero and one-shot capabilities of the Segment Anything Model (SAM) in the domain of remote sensing imagery processing, benchmarking it against aerial and satellite datasets.We innovated by presenting a fully-automated one-shot operation on SAM based on a text-prompt example, a practice that further enhanced its segmentation capabilities on most of our tests.However, it's essential to note that this constitutes an early phase.
In this sense, more frameworks and larger, diverse datasets, will be crucial for further refining the model and solidifying these findings.
Our data also indicated that SAM delivers notable performance when contrasted with the ground-truth masks, thereby underscoring its potential efficacy as a significant resource for remote sensing applications.Our evaluation reveals that the prompt capabilities of SAM (text, point, box, and general), combined with its ability to perform object segmentation with minimal human supervision, can also contribute to a significant reduction in annotation workload.This decrease in human input during the labeling phase may lead to expedited training schedules for other methods, thus promoting more streamlined and cost-effective workflows.
Nevertheless, despite the demonstrated generalization, there are certain limitations to be addressed.Under complex scenarios, the model faces challenges, leading to less optimal segmentation outputs, by overestimating most of the objects' boundaries.Additionally, SAM's performance metrics display variability contingent on the spatial resolution of the input imagery (i.e., being prone to increase mistakes as the spatial resolution of the imagery is lowered).Consequently, identifying and rectifying these constraints is essential for further enhancing SAM's applicability within the remote sensing domain.
In conclusion, our analysis provided insights into the operational performance and efficacy of SAM in the sphere of remote sensing segmentation tasks.While SAM exhibits notable promise, there is a tangible scope for improvement, specifically in managing its limitations and refining its performance for task-specific implementations.Future research should be oriented towards improving SAM's functional capabilities and exploring its potential integration with other methods to address a broader array of complex and challenging remote sensing scenarios.

Supplementary
Here, we provide an open-access repository designed to facilitate the application of the Segment Anything Model (SAM) within the domain of remote sensing imagery.The incorporated codes and packages provide users the means to implement point and bounding box-based shapefiles in combination with the SAM.The repositories also include notebooks that demonstrate how to apply the text-based prompt approach, alongside one-shot modifications of SAM.These resources aim to bolster the usability of the SAM approach in diverse remote sensing contexts, and can be accessed via the following online repositories: GitHub: AI-RemoteSensing [42] and; GitHub: Segment-Geospatial [46].

Figure 1 :
Figure 1: Schematic representation of the step-by-step process undertaken in this study to evaluate the efficacy of SAM's approach in remote sensing image processing tasks.

Figure 2 :
Figure 2: Collection of image samples utilized in our research.The top row features UAV-based imagery with bounding boxes and point labels, serving as prompts for SAM.The middle row displays airborne-captured data representing larger regions, with both points and rectangular polygon shapes provided as model inputs.The bottom row reveals satellite imagery, again with bounding boxes and points as prompt inputs, offering a trade-off between lower spatial resolution and wider area coverage.

7 Figure 3 :
Figure 3: Visual representation of the one-shot-based text segmentation process in action.The figure provides a step-by-step illustration of how the model identifies and segments the most probable object based on a text prompt with "car" and "tree" as examples.

Figure 4 :
Figure 4: Comparative illustration of tree segmentation usingPerSAM and PerSAM-F.On the left, the PerSAM model segments not only the tree but also its shadow and a part of the car underneath it.On the right, the PerSAM-F model, fine-tuned for hierarchical structures and varying scales, accurately segments only the tree, demonstrating its improved ability to discern and isolate the target object in remote sensing imagery.

Figure 5 :
Figure 5: Examples of segmented objects using SAM's general segmentation method, drawn from diverse datasets based on their platforms.Objects are represented in random colors.As the model operates without any external inputs, it deduces object boundaries leveraging its zero-shot learning capabilities.

Figure 6 : 12 Figure 7 : 13 Figure 8 :
Figure 6: Illustrations of images processed using bounding-box prompts.The first column consists of the RGB image, while the second column demonstrates how the prompt was handled.The ground-truth mask is presented in the third column and the prediction result from SAM in the fourth.The last column indicates the false positive (FP) pixels from the prediction.

15 Figure 9 :
Figure 9: Visual illustration of the segmentation results using PerSAM and text-based PerSAM.The final column highlights the difference in pixels from the text-based PerSAM prediction to its ground truth.The graphic compares the range from the Dice values of both PerSAM and text-based PerSAM, illustrating how the proposed approach remains within the standard deviation of the traditional PerSAM approach, underscoring the potential of most practices to adopt the fully-automated process in such cases.

Table 1 :
Overview of the distinct attributes and specifications of the datasets employed in this study.

Table 2 :
Summary of metrics for the image segmentation task across different platforms, targets, and resolutions, and using different prompts for SAM in zero-shot.The values in red indicate the best performance for a particular target under specific conditions.

Table 3 :
Comparison of segmentation results on different platforms and targets when considering both the one-shot and the text-based one-shot approaches.The baseline values are referent to the best metric obtained by the previous zero-shot investigation, be it from a bounding box, a point, or a text prompt.The red colors indicate the best result for each scenario.