Aerial scene understanding in the wild: Multi-scene recognition via prototype-based memory networks

a multi-head attention-based memory retrieval module. To be more specific, we first learn the prototype representation of each aerial scene from single-scene aerial image datasets and store it in an external memory. Afterwards, a multi-head attention-based memory retrieval module is devised to retrieve scene prototypes relevant to query multi-scene images for final predictions. Notably, only a limited number of annotated multi-scene images are needed in the training phase. To facilitate the progress of aerial scene recognition, we produce a new multi-scene aerial image (MAI) dataset. Experimental results on variant dataset configurations demonstrate the effectiveness of our network. Our dataset and codes are publicly available 1 .

In recent years, many efforts [19], e.g., developing novel network architectures [20,21,22,23,24,25] and pipelines [26,27,28,29], publishing large-scale datasets [30,31], introducing multi-modal and multi-temporal data [32,33,34,35], have been deployed to address this task, and most of them treat it as a single-label classification problem.A common assumption shared by these researches is that an aerial image belongs to only one scene category, while in real-world scenarios, it is more often that there exist various scenes in a single image (cf. Figure 1).Furthermore, we notice that aerial images used to learn single-label scene classification models are usually well-cropped so that target scenes could be centered and account for the majority of an aerial image.Unfortunately, this might be infeasible for practical applications.Therefore, in this paper, we aim to deal with a more practical and challenging problem, multi-scene classification in a single image, which refers to inferring multiple scene-level labels for a large-scale, unconstrained aerial image.Figure 1 shows an example image, where we can see that multiple scenes, e.g., residential, parking lot, and commercial, co-exist in one aerial image.We note that there is another research branch of aerial image understanding, multi-label object classification, which refers to the process of inferring multiple objects present in an aerial image.These studies [36,37,38,39,40,41,42] mainly focus on recognizing object-level labels, while in our task, an image is classified into multiple scene categories, which provides a more comprehensive understanding of large-scale aerial images in scene-level.To the best of our knowledge, multi-scene recognition in unconstrained aerial images still remains underexplored in the remote sensing community.
To achieve this task, huge quantities of well-annotated multi-scene images are needed for the purpose of training models.However, we note that such annotations are not easy in the remote sensing community.This could be attributed to the following two reasons.On the one hand, the visual interpretation of multiple scenes is more arduous than that of a single scene in an aerial image, and therefore, labeling multi-scene images requires more work.On the other hand, low-cost annotation techniques, e.g., resorting to crowdsourcing OpenStreetMap (OSM) through keyword searching [30,31,43], perform poorly in yielding multi-scene datasets owing to the incompleteness and incorrectness of certain OSM data.Examples of erroneous OSM data are shown in Figure 2. In addition, manually rectifying annotations generated from crowdsourcing data are inevitable due to error-proneness.Such a pro- cedure is quite labor-consuming, as every scene is required to be checked in case that present ones are mislabeled as absent.Aiming to solve the aforementioned limitations, in this work, we propose to train a network for recognizing complex multi-scene aerial images by using only a small number of labeled multi-scene images but a huge amount of existing, annotated single-scene data.Our motivation is based on an intuitive observation about how humans learn to perceive complex scenes being composed of multiple entities [44,45,46]: we first learn and memorize individual objects (through flash cards for example) when we were babies and then possess the capability of understanding complex scenarios by learning from only a limited number of hard instances (cf. Figure 1).We believe that this learning process also applies to the interpretation of multi-scene aerial images.Driven by this observation, we propose a novel network, termed as prototype-based memory network (PM-Net), which is inspired by recent successes of memory networks in natural language processing (NLP) tasks [47,48] and video analysis [49,50,51].To be more specific, we first learn the prototype representation of each aerial scene from single-scene aerial images and then store these prototypes in the external memory of PM-Net.Afterwards, for a given query multi-scene image, a multi-head attention-based memory retrieval module is devised to retrieve scene prototypes that are associated with the query image from the external memory for inferring multiple scene labels.
The contributions of this work are fourfold.
• We take a step forward to a more practical and challenging task in aerial scene understanding, namely multi-scene classification in single images, which aims to recognize multiple scenes present in a largescale, unconstrained aerial image.Such a task is in line with real-world scenarios and capable of providing a comprehensive picture for a given geographic area.
• Given that labeling multi-scene images is very labor-intensive and timeconsuming, we propose a PM-Net that can be trained for our task by leveraging large numbers of existing single-scene aerial images and a small number of labeled multi-scene images.
• In order to facilitate the progress of multi-scene recognition in single aerial images, we create a new dataset, multi-scene aerial image (MAI) dataset.To the best of our knowledge, this is the first publicly available dataset for aerial multi-scene interpretation.Compared to existing single-scene aerial image datasets, images in our dataset are unconstrained and contain multiple scenes, which are more in line with the reality.
• We carry out extensive experiments with different configurations.Experimental results demonstrate the effectiveness of the proposed network.
The remaining sections of this paper are organized as follows.Section 2 reviews studies in memory networks and prototypical networks, and the architecture of the proposed prototype-based memory network is introduced in Section 3. Section 4 describes experimental configurations and analyzes results.Eventually, conclusions are drawn in Section 5.

Related Work
Since very few efforts have been deployed to this task in the remote sensing community, we only review literatures related to our algorithm in this section.

Memory Networks
A memory network takes as input a query and retrieves complementary information from the external memory.In [47], the memory network is first proposed and utilized to address question-answering tasks, where questions are regarded as queries, and statements are stored in the external memory.
To retrieve statements for predicting answers, the authors compute relative distances between queries and the external memory through dot product.In the following work, Miller et al. [48] improves the efficiency of retrieving large memories by pre-selecting small subsets with key hashing.Moreover, the memory network is further applied in video analysis [49,50,51] and image captioning [52].In [49], the authors devise a dual augmented memory network to memorize both target and background features of an video, and use a Long Short-Term Memory (LSTM) to communicate with previous and next frames.In [50], the authors propose a memory network to memorize normal patterns for detecting anomalies in an video.As an attempt in image captioning, Cornia et al. [52] devise a learnable memory to learn and memorize priori knowledge for encoding relationships between image regions.Inspired by these works, we devise a memory network and store scene prototypes in the memory for recognizing scenes present in multi-scene images.

Prototypical Networks
Prototypical networks are characterized by classifying images according to their distances from class prototypes.In learning with limited training samples, such networks are popular and achieved many successes recently [53,54,55,56,57,58].To be specific, Snell et al. [53] propose to first learn a prototype representation for each category and then identify images by finding their nearest category prototypes.Guerriero et al. [54] aim to alleviate the heavy expense of learning prototypes by initializing and updating prototypes with those learned in previous training epochs.Yang et al. [55] propose to combine prototypical networks and CNNs for tackling the open world recognition problem and improving the robustness and accuracy of networks.Similarly, Huang et al. [56] propose to integrate prototypical networks and graph convolutional neural networks for learning relational prototypes.Albeit variant, most existing works share a common way to extract prototypes, which is taking average of samples belonging to the same categories.Therefore, we follow this prototype extraction strategy in our work.

Overview
The proposed PM-Net consists of three essential components: a prototype learning module, an external memory, and a memory retrieval module.Specifically, the prototype learning module is devised to encode prototype Particularly, we first learn scene prototypes p s from well-annotated single-scene aerial images and then store them in the external memory M of PM-Net.Afterwards, given a query multi-scene image, a multihead attention-based memory retrieval module is devised to retrieve scene prototypes that are relevant to the query image, yielding z for the prediction of multiple labels.f φ denotes the embedding function, and its output is a D-dimensional feature vector.S and H represent numbers of scenes and heads, respectively.L and U denote channel dimensions of the key and value in the memory retrieval module.
representations of aerial scenes, which are then stored in the external memory.The memory retrieval module is responsible for retrieving scene prototypes related to query images through a multi-head attention mechanism.Eventually, retrieved scene prototypes are utilized to infer the existence of multiple scenes in the query image.

Scene Prototype Learning and Writing
Following the observation introduced in Section 1, we propose to learn and memorize scene prototypes with the support of single-scene aerial images.The procedure consists of two stages.We first employ an embedding function to learn semantic representations of all single-scene images.Then, feature representations belonging to the same scene category are encoded into a scene prototype and stored in the external memory.
Formally, let X s i denote the i-th single-scene image belonging to scene s, and i ranges from 1 to N s .N s is the number of samples annotated as s.The embedding function f φ can be learned via the following objective function: where φ represents learnable parameters of f φ , and y s is a one-hot vector denoting the scene label of X s i .g θ is a multilayer perceptron (MLP) with parameters θ and its outputs are activated by a softmax function to predict probability distributions.Following the overwhelming trend of deep learning, here we employ a deep CNN, e.g., ResNet-50 [59], as the embedding function f φ and learn its parameters on public single-scene aerial image datasets.After sufficient training, f φ is expected to be capable of learning discriminative representations for different aerial scenes.
Once f φ is learned, the scene prototype can be computed by averaging representations of all aerial images belonging to the same scene [53,54,55].Let p s be the prototype representation of scene s.We calculate p s with the following equation: By doing so, in the single-scene classification, an image closely around p s in the common embedding space is supposed to belong to scene s.Similarly, in the multi-scene scenario, the representation of an aerial image comprising scene s should show high relevance with p s .After encoding all scene prototypes, the external memory M can be formulated as follows: where S denotes the number of scenes.
Note that D varies when using different backbone CNNs as embedding functions.

Multi-head Attention-based Memory Retrieval
Inspired by successes of the multi-head self-attention mechanism [60] in natural language processing tasks [61,62,63,64], we develop a multi-head attention-based memory retrieval module to retrieve scene prototypes from the memory M for a given query image X.Given a query multi-scene aerial image X, to retrieve relevant scene prototypes from M , we develop a multihead attention-based memory retrieval module.In particular, we first extract the feature representation of X through the same embedding function f φ and linearly project it to an L-dimensional query Q(X).Similarly, we transform the external memory M into key K(M ) and value V(M ), and both are implemented as MLPs.The channel dimension of the key is L, while that of the value is U .The relevance between X and each scene prototype p s can be measured by dot product similarity and a softmax function as follows: The output is an S-dimensional vector, where each component represents a relevance probability that a specific scene prototype is related to the query image.Subsequently, the retrieved scene prototypes are computed by weightsumming all values with the following equation: (5) Since the memory retrieval is designed in a multi-head fashion, the final retrieved prototype is reformulated as follows: where H denotes the number of heads, and each head yields a retrieved prototype z h by transforming X and M to the variant query Q h (f φ (X)), key K h (M ), and value V h (M ).Eventually, the output z is fed into a fullyconnected layer followed by a sigmoid function for inferring presences of aerial scenes.

Implementation Details
For a comprehensive assessment of our PM-Net, we implement the embedding function with various backbone CNNs.Specifically, we conduct experiments on four CNN architectures, and details are as follows: • PM-VGGNet: f φ is built on VGG-16 [65] by replacing all layers after the last max-pooling layer in block5 with a global average pooling layer.
• PM-Inception-V3: Inception-V3 [66] is utilized, and layers before and including the global average pooling layer are employed as f φ .
• PM-ResNet: We modify ResNet-50 [59] by discarding layers after the global average pooling layer and using the remaining layers as f φ .
• PM-NASNet: The backbone of f φ is mobile NASNet [67].As with the modification in PM-ResNet, only layers before and including the global average pooling layer are used.
In our experiments, we train original deep CNNs on single-scene aerial image datasets and then take them as the embedding function f φ following the aforementioned points.Subsequently, we yield scene prototypes p s and concatenate all of them along the first axis to form M .

Experiments and Discussion
In this section, we introduce a newly produced multi-scene aerial image dataset, MAI dataset, and two single-scene datasets, i.e., UCM and AID datasets, which are used in experiments.Then network configurations and training schemes are detailed in Subsection 4.2.The remaining subsections discuss and analyze the performance of the proposed network thoroughly.

Dataset Description and Configuration 4.1.1. MAI dataset
To facilitate the progress of aerial scene interpretation in the wild, we yield a new dataset, MAI dataset, by collecting and labeling 3923 large-scale images from Google Earth imagery that covers the United States, Germany, and France.The size of each image is 512 × 512, and spatial resolutions vary from 0.3 m/pixel to 0.6 m/pixel.After capturing aerial images, we manually assign each image multiple scene-level labels from in total 24 scene categories, including apron, baseball, beach, commercial, farmland, woodland, parking lot, port, residential, river, storage tanks, sea, bridge, lake, park, roundabout, soccer field, stadium, train station, works, golf course, runway, sparse shrub, and tennis court.Notably, OSM data associated with the collected images cannot be directly employed as reference owing to the problems presented in Section 1.Such a labeling procedure is extremely time-and labor-consuming, and annotating one image costs around 20 seconds, which is ten times more than labeling a single-scene image.Several example multi-scene images are shown in Figure 4. Numbers of aerial images related to various scenes are reported in Figure 5.Among existing datasets, BigEarthNet [68] is one of the most relevant datasets, which consists of Sentinel-2 images acquired over the European Union with spatial resolutions ranging from 10 m/pixel to 60 m/pixel.Spatial sizes of images vary from 20 × 20 pixels to 120 × 120 pixels, and each is assigned multiple land-cover labels provided from the CORINE Land Cover map2 .Compared to BigEarthNet, our dataset is characterized by its high-resolution large-scale aerial images and worldwide coverage.

UCM dataset
UCM dataset [69] is a commonly used single-scene aerial image dataset produced by Yang and Newsam from the University of California Merced.This dataset comprises 2100 aerial images cropped from aerial ortho imagery provided by the United States Geological Survey (USGS) National Map, and the spatial resolution of the collected images is one foot.The size of each image is 256 × 256 pixels, and all image samples are classified into 21 scenelevel classes: overpass, forest, beach, baseball diamond, building, airplane, freeway, intersection, harbor, golf course, runway, agricultural, storage tank, mobile home park, medium residential, sparse residential, chaparral, river, tennis courts, dense residential, and parking lot.The number of aerial images collected for each scene is 100, and several example images are shown in Figure 6.To learn scene prototypes from these single-scene images, we randomly choose 80% of image samples per scene category to train and validate the embedding function and utilize the rest for testing.

AID dataset
AID dataset [30] is a another popular single-scene aerial image dataset which consists of 10000 aerial images with a size of 600 × 600 pixels.These images are captured from Google Earth imagery that is taken over China, the United States, England, France, Italy, Japan, and Germany, and spatial resolutions of the collected images vary from 0.5 m/pixel to 8 m/pixel.In total, there are 30 scene categories, including viaduct, river, baseball field, center, farmland, railway station, meadow, bare land, storage tanks, beach, mountain, park, bridge, playground, church, commercial, desert, forest, parking, industrial, square, sparse residential, pond, medium residential, port, resort, airport, school, stadium, and dense residential.The number of images in different classes ranges from 220 to 420.Similar to the data split in the UCM dataset, 20% of images are chosen from each scene as test samples, while the remaining images are utilized to train and validate the embedding function.Some example images of the AID dataset are exhibited in Figure 7.

Dataset configuration
In order to widely evaluate the performance of our method, we utilize two variant dataset configurations, UCM2MAI and AID2MAI, based on common scene categories shared by UCM/AID and MAI.Specifically, the UCM2MAI configuration consists of 1600 single-scene aerial images from the UCM dataset and 1649 multi-scene images from our MAI dataset.16 aerial scenes that are commonly included in both two datasets are considered in UCM2MAI, and numbers of their associated images are listed in Table 1.Besides, the AID2MAI configuration is composed of 7050 and 3239 aerial images from the AID and MAI datasets, respectively.20 common scene categories are taken into consideration, and the number of images related to each scene is present in Table 1.Although such configurations might limit the number of recognizable scene classes, we believe this limitation can be addressed by collecting more single-scene images by crawling OSM data and producing large-scale multi-scene aerial image datasets.We select only 90 and 120 multi-scene aerial images from UCM2MAI and AID2MAI as training instances, respectively, and test networks on the remaining multi-scene images.For rare scenes (e.g., port and train station), we select all associated training images, while for common scenes, we randomly select several of their training samples.It is noteworthy that we yield the scene prototype of residential by taking an average of high-level representations of aerial images belonging to scene medium residential and dense residential.Besides, although the UCM and AID datasets do not contain images for sea, their images for beach often comprise both sea and beach (cf.(c) in Figure 7).Therefore, we make use of training samples labeled as beach to yield the prototype representation of sea.

Training Details
The training procedure consists of two phases: 1) learning the embedding function f φ on large quantities of single-scene aerial images and 2) training the entire PM-Net on a limited number of multi-scene images in an end-toend manner.Thus, various training strategies are applied to each phase and detailed as follows.
In the first training phase, the embedding function f φ is initialized with the corresponding deep CNNs pretrained on ImageNet [70], and weights in g θ are initialized by a Glorot uniform initializer.Eq. ( 1) is employed as the loss of the network, and Nestrov Adam [71] is chosen as the optimizer, of which parameters are set as recommended: β 1 = 0.9, β 2 = 0.999, and = 1e − 08.The learning rate is set as 2e − 04 and decayed by √ 0.1 when the validation loss fails to decrease for two epochs.
In the second learning phase, we initialize f φ with parameters learned in the previous training stage and employ the Glorot uniform initializer to initialize all weights in Q h , V h , K h , and the last fully-connected layer.L and U are set to the same value of 256, and the number of heads is defined as 20.Notably, all weights are trainable, and the embedding function is tuned during the second training phase as well.Multiple scene-level labels are encoded as multi-hot vectors, where 0 indicates the absence of the corresponding scene while 1 refers to existing scenes.Accordingly, the loss is defined as binary cross-entropy.The optimizer is the same as that in the first training phase, but here we make use of a relatively large learning rate, 5e − 4. The network is implemented on TensorFlow and trained on one NVIDIA Tesla P100 16GB GPU for 100 epochs.We set the size of training batch to 32 for both training phases.

Evaluation Metrics
For the purpose of evaluating the performance of networks quantitatively, we utilize example-based F 1 [72] and F 2 [73] scores as evaluation metrics and * indicates that the number of images is not counted in total amounts, as the scene prototype of beach and sea are learned from the same images.
calculate them with the following equation: where F N e , F P e , and T P e represent numbers of false negatives, false positives, and true positives in an example, respectively.In our case, an example is a multi-scene aerial image, and by averaging scores of all examples in the test set, the mean example-based F scores, precision, and recall can be eventually computed.In addition to example-based evaluation metrics, we also calculate label-based precision p l and recall r l with Eq. 8 but replace F N e , F P e , and T P e with numbers of false negatives, false positives, and true positives in respect of each scene category.The mean p l and r l can then be calculated.Note that principle indexes are the mean F 1 and F 2 scores.

Results on UCM2MAI
For a comprehensive evaluation, we compare the proposed PM-Net with two baselines, CNN* and CNN.The former is initialized with parameters pretrained on ImageNet, and the latter is pretrained on single-scene datasets.Besides, we compare our network with a memory network, Mem-N2N [47].Since Mem-N2N was proposed for the question answering task, we adapt it to our task by replacing its inputs, i.e., embeddings of questions and statements, with query image representations f φ (X) and scene prototypes p s , respectively.To be more specific, we feed X to a CNN backbone and take its output as the input of Mem-N2N.Scene prototypes are stored in the memory of Mem-N2N and retrieved according to f φ (X).The initialization of f φ is the same as that of our network, and the entire Mem-N2N is trained in an end-to-end manner.Various backbones of embedding functions are test, and quantitative results are reported in Table 3. Besides, we also compare Here we analyze results from the following three perspectives.This demonstrates that employing NASNet as the embedding function can enhance the robustness of PM-Net.Comparisons between PM-Inception-V3 with Inception-V3 show that the external memory module contributes to improvements of 4.60% and 6.78% in the mean F 1 and F 2 scores, respectively.To summarize, memorizing and leveraging scene prototypes learned from huge quantities of single-scene images can improve the performance of network in multi-label scene recognition when limited training samples are available.For a deep insight, we further conduct ablation studies on the prototype modality and embedding function.
Single-vs.multi-prototype representations.We note that images collected over variant countries show high intra-class variability, and therefore, we wonder whether learning multi-prototype scene representations could improve the effectiveness of PM-Net.Specifically, instead of yielding scene prototypes via Eq. 2, we partition representations of single-scene aerial images belonging to the same scene into several clusters and take cluster centers as multi-prototype representations of each scene.In our experiments, we test two clustering methods, K-Means [75] and Agglomerative [76], with PM-ResNet on both UCM2MAI and AID2MAI, and results are shown in Figure 9.We can see that the performance of PM-ResNet is decreased with the increasing number of cluster centers either using K-Means or Agglomerative clustering algorithms.Explanations could be that there are no obvious subclusters within each scene category (cf. Figure 13), and thus PM-Net does not benefit from fine-grained multi-prototype representations.
Frozen vs. trainable embedding function.The embedding function plays a key role in both scene prototype learning and memory retrieval.In the former, we train the embedding function on single-scene images, while in the latter, the function is fine-tuned on multi-scene images.To explore the effectiveness of fine-tuning, we conduct experiments on freezing the embedding function when learning the memory retrieval module.The comparisons between PM-Net learned with frozen and trainable embedding functions are shown in Figure 10.It can be observed that PM-Net with a trainable embedding function shows higher performance on both UCM2MAI and AID2MAI configurations.The reason could be that sources of single-and multi-scene images are variant, and fine-tuning can narrow their gaps.
Triplet vs. cross-entropy loss.Triplet loss [77] is known as learning discriminative representations by minimizing distances between embeddings of the same class while pushing away those of different classes.To study its performance in our task, we train the embedding function by replacing Eq. 1 with the following equation: ) where X s pos and X s neg denote positive and negative samples, i.e., images belonging to common and different classes, respectively, and α is set as default, 0.5.The trained embedding function is then utilized to extract scene prototypes and initialize f φ in the phase of learning the memory retrieval module.Besides, all the other setups are remained the same.We compare the performance of PM-Net using embedding functions trained through different loss functions in Figure 11.It can be seen that training embedding functions with the triplet loss leads to decrements of the network performance.This can be attributed to that limited numbers of positive and negative samples in each batch can lead to local optimum.More specifically, the size of training batches is 32, and the number of scenes are 16 and 20 in UCM2MAI and AID2MAI, respectively.Thus, it is high probably that only a certain number of scenes are included in one batch, and comprehensively modeling relations between embeddings of samples from all scenes is infeasible.This also illustrates the larger performance decay on UCM2MAI compared to AID2MAI.
4.4.2.The effectiveness of our multi-head attention-based memory retrieval module As a key component of the proposed PM-Net, the multi-head attentionbased memory retrieval module is designed to retrieve scene prototypes from the external memory, and we evaluate its effectiveness by comparing PM-Net with Mem-N2N.As shown in Table 3, PM-Net outperforms Mem-N2N with variant embedding functions.Specifically, PM-VGGNet increases the mean F 1 and F 2 scores by 2.26% and 0.23%, respectively, compared to Mem-N2N-VGGNet.While taking ResNet as the embedding function, the improvement can reach 2.58% in the mean F 1 score.Besides, the highest increments of mean F 1 and F 2 scores, 4.96% and 6.52, are achieved by PM-NASNet.These observations demonstrate that our memory retrieval module plays a key role in inferring multiple aerial scenes.An explanation could be that compared to the memory reader in Mem-N2N, our module comprise multiple heads, and each of them focuses on encoding a specific relevance between the query image and variant scene prototypes.In this case, more comprehensive scenerelated memories can be used for inferring multiple scene labels.Moreover, we analyze the influence of the number of heads in the memory retrieval module.Figure 8 shows mean F 1 scores achieved by PM-Net with variant head numbers on both UCM2MAI and AID2MAI.We can observe that the network performance is first boosted with an increasing number of heads and then decreased gradually when the number exceeds 20.
Moreover, we also conduct experiments on directly utilizing relevances for inferring multiple scene labels.Specifically, we set the number of heads to 1 and replace the softmax activation in Eq. 4 with the sigmoid function.Relevances between the query image and scene prototypes can then be interpreted as the existence of each scene.We compare it with our memory retrieval module on variant backbones, and results are shown in Figure 12.We can see that utilizing relevances R(X, M ) as weights for aggregating scene prototypes leads to higher network performance.

The benefit of exploiting single-scene training samples
Let's start with the conclusion: exploiting single-scene images significantly contributes to our task.To analyze its benefit, we mainly compare CNNs* and CNNs.It can be observed that even with identical network architectures, the performance of CNN is superior to that of CNN*.More specifically, VGGNet achieves the highest improvement of the mean F 1 scores, 19.26%, in comparison with VGGNet*.NASNet shows higher performance  in all metrics compered to ResNet*, while other CNNs perform poorly in only the mean example-based precision with respect to their corresponding CNNs*.Besides, we visualize features of single-scene images learned by VG-GNet on UCM and AID datasets via t-SNE, respectively.As shown in Figure 13, extracted features are discriminative and separable in the embedding space, which demonstrates the effectiveness of learning the embedding function on single-scene aerial image datasets.To summarize, except for learning scene prototypes, single-scene training samples can also benefit multi-label scene interpretation by pretraining CNNs which are further utilized to initialize the embedding function.
We exhibit several example predictions of PM-ResNet trained on UCM2MAI in Table 4. False positives are marked as red, while false negatives are in blue.As shown in the forth example at the top row, we see that PM-Net can accurately perceive aerial scenes even in complex contexts, but unseen scene appearance (i.e.apron and runway in snow) can influence its prediction.

Results on AID2MAI
Table 5 reports numerical results on the AID2MAI configuration.It can be seen that the performance of PM-Net is superior to all competitors in the mean F 1 score.Compared to Mem-N2N-VGGNet, the proposed PM-VGGNet increases the mean F 1 and F 2 scores by 6.70% and 7.56%, respectively, while improvements reach 6.07% and 0.64% in comparison with VGGNet.PM-ResNet achieves the best mean F 1 score and example-based precision, 57.42% and 70.62, respectively.With NASNet as the backbone, exploiting the proposed memory retrieval module contributes to increments of 1.03% and 1.71% in mean F 1 and F 2 scores compared to directly learning NASNet on a small number of multi-scene samples.
We present some example predictions of PM-ResNet in Table 6.As shown in the top row, PM-ResNet learned with a limited number of annotated multi-scene images can accurately identify various aerial scenes even image contextual information is complicated.The bottom row shows some inaccurate predictions.It can be observed that although bridge and parking lot account for relatively small areas in last two examples at the top row, the proposed PM-Net can successfully detect them.Similar observations can also be found in the first and third example at the bottom row that residential and parking lot are recognized by our network, even they are located at the corner.In conclusion, quantitative results illustrate the effectiveness of our network in learning to perform unconstrained multi-scene classification, and example predictions further demonstrate it.

Conclusion
In this paper, we propose a novel multi-scene recognition network, namely PM-Net, to tackle both the problem of aerial scene classification in the wild and scarce training samples.To be more specific, our network consists of three key elements: 1) a prototype learning module for encoding prototype representations of variant aerial scenes, 2) a prototype-inhabiting external memory for storing high-level scene prototypes, and 3) a multi-head attention-based memory retrieval module for retrieving associated scene prototypes from the external memory for recognizing multiple scenes in a query aerial image.For the purpose of facilitating the progress as well as evaluating our method, we propose a new dataset, MAI dataset, and experiment with two dataset configurations, UCM2MAI and AID2MAI, based on two single-scene aerial image datasets, UCM and AID.In scene prototype learning, we train the embedding function on most of single-scene images as we aim to simulate the real-life scenario, where massive single-scene samples can be collected at low cost by resorting to OSM data.To learn memory retrieval, our network is fine-tuned on only around 100 training samples from the MAI dataset.Experimental results on both UCM2MAI and AID2MAI illustrate that learning and memorizing scene prototypes with our PM-Net can significantly improve the classification accuracy.The best performance is achieved by employing ResNet as the embedding function, and the best mean F 1 score reaches nearly 0.6.We hope that our work can open a new door for further researches in a more complicated and challenging task, multi-scene interpretation in single images.Looking into the future, we intend to apply the proposed network to the recovery of weakly supervised scenes.

Figure 2 :
Figure 2: Examples of incomplete (red) and incorrect (yellow) OSM data.Red: the commercial is not annotated in OSM data.Yellow: the orchard is mislabeled as residential.

Figure 3 :
Figure3: Architecture of the proposed PM-Net.Particularly, we first learn scene prototypes p s from well-annotated single-scene aerial images and then store them in the external memory M of PM-Net.Afterwards, given a query multi-scene image, a multihead attention-based memory retrieval module is devised to retrieve scene prototypes that are relevant to the query image, yielding z for the prediction of multiple labels.f φ denotes the embedding function, and its output is a D-dimensional feature vector.S and H represent numbers of scenes and heads, respectively.L and U denote channel dimensions of the key and value in the memory retrieval module.

Figure 4 :Figure 5 :
Figure 4: Example images in our MAI dataset.Each image is 512 × 512 pixels, and their spatial resolutions range from 0.3 m/pixel to 0.6 m/pixel.We list their scene-level labels here: (a) farmland and residential; (b) baseball, woodland, parking lot, and tennis court; (c) commercial, parking lot, and residential; (d) woodland, residential, river, and runway; (e) river and storage tanks; (f) beach, woodland, residential, and sea; (g) farmland, woodland, and residential; (h) apron and runway; (i) baseball field, parking lot, residential, bridge, and soccer field.

Figure 8 :
Figure 8: The influence of the number of heads on both dataset configurations.Blue and yellow dot lines represent mean F 1 scores on UCM2MAI and AID2MAI.The Red line indicates the average of them.

Figure 9 :
Figure9: The influence of the number of cluster centers on both dataset configurations.K-Means (turquoise and orange dash lines) and Agglomerative (blue and red lines) clustering algorithms are tested with PM-ResNet on both UCM2MAI and AID2MAI, respectively.

Figure 10 :
Figure 10: Comparisons between freezing and fine-tuning embedding functions on (a) UCM2MAI and (b) AID2MAI, respectively.Blue bars represent the performance of PM-Net with frozen embedding functions, and brown bars denote the performance of PM-Net with trainable embedding functions.

Figure 11 :
Figure 11: Comparisons of different loss functions on (a) UCM2MAI and (b) AID2MAI, respectively.Green bars denote the performance of PM-Net using embedding functions trained by the triplet loss, and brown bars denote the performance of PM-Net with the cross-entropy loss as L.

Figure 12 :
Figure 12: Comparisons between taking relevance R(X, M ) as predictions and prototype weights on (a) UCM2MAI and (b) AID2MAI, respectively.Gray and brown bars represent the performance of PM-Net making predictions from relevances and aggregated scene prototypes, respectively.

Figure 13 :
Figure 13: T-SNE visualization of image representations and scene prototypes learned by VGGNet on (a) UCM and (b) AID datasets, respectively.Dots in the same color represent features of images belonging to the same scene, and stars denote scene prototypes.

Table 1 :
The Number of Images Associated with Each Scene.

Table 2 :
[74]erences between Two Training Phases.andre denote example-based precision and recall[74].We calculate p e and r e as follows:

Table 4 :
Example Images and Predictions on UCM2MAI.Blue predictions are false negatives, while red predictions indicate false positives.inferring multiple labels.In our experiments, K is set as default, 3, and input sizes of the three branches are 224 × 224, 112 × 112, and 56 × 56, respectively.

Table 6 :
Example Images and Predictions on AID2MAI.