Recurrently Exploring Class-wise Attention in A Hybrid Convolutional and Bidirectional LSTM Network for Multi-label Aerial Image Classification

Aerial image classification is of great significance in remote sensing community, and many researches have been conducted over the past few years. Among these studies, most of them focus on categorizing an image into one semantic label, while in the real world, an aerial image is often associated with multiple labels, e.g., multiple object-level labels in our case. Besides, a comprehensive picture of present objects in a given high resolution aerial image can provide more in-depth understanding of the studied region. For these reasons, aerial image multi-label classification has been attracting increasing attention. However, one common limitation shared by existing methods in the community is that the co-occurrence relationship of various classes, so called class dependency, is underexplored and leads to an inconsiderate decision. In this paper, we propose a novel end-to-end network, namely class-wise attention-based convolutional and bidirectional LSTM network (CA-Conv-BiLSTM), for this task. The proposed network consists of three indispensable components: 1) a feature extraction module, 2) a class attention learning layer, and 3) a bidirectional LSTM-based sub-network. Particularly, the feature extraction module is designed for extracting fine-grained semantic feature maps, while the class attention learning layer aims at capturing discriminative class-specific features. As the most important part, the bidirectional LSTM-based sub-network models the underlying class dependency in both directions and produce structured multiple object labels. Experimental results on UCM multi-label dataset and DFC15 multi-label dataset validate the effectiveness of our model quantitatively and qualitatively.

extraction module is designed for extracting fine-grained semantic feature maps, while the class attention learning layer aims at capturing discriminative class-specific features.As the most important part, the bidirectional LSTM-based sub-network models the underlying class dependency in both directions and produce structured multiple object labels.Experimental results on UCM multi-label dataset and DFC15 multi-label dataset validate the effectiveness of our model quantitatively and qualitatively.

Introduction
With the booming of remote sensing techniques in the recent years, a huge volume of high resolution aerial imagery is now accessible and benefits a wide range of real-world applications, such as urban mapping [1,2,3,4], ecological monitoring [5,6], geomorphological analysis [7,8,9,10], and traffic management [11,12,13].As a fundamental bridge between aerial images and these applications, image classification, which aims at categorizing images into semantic classes, has obtained wide attention, and many researches have been conducted recently [14,15,16,17,18,19,20,21,22,23].However, most existing studies assume that each image belongs to only one label (e.g., scene-level labels in Fig. 1), while in reality, an image is usually associated with multiple labels [24].Furthermore, a comprehensive picture of objects present in an aerial image is capable of offering a holistic understanding of such image.With this intention, numerous researches, i.e., semantic segmentation [25,26,27] and object detection [25,28,29,30], have emerged recently.Unfortunately, it is extremely labor-and time-consuming to acquire ground truths for these studies (i.e., pixel-wise segmentation masks and bounding-box-level annotations).Compared to these expensive labels, image-level labels (cf.multiple object-level labels in Fig. 1) are at a fair low cost and readily accessible.To this end, multi-label classification, aiming at assigning an image with multiple object labels, is arising in both remote sensing [31,32,33,34] and computer vision communities [35,36,37].In this paper, we deploy our efforts in exploring an efficient multi-label classification model.

The Challenges of Multi-label Classification
Benefited from the fast-growing remote sensing technology, large quantities of high-resolution aerial images are available and widely used in many visual tasks.Along with such huge opportunities, challenges have come up inevitably.
On one hand, it is difficult to extract high-level features from highresolution images.Considering its complex spatial structure, conventional hand-crafted features, and mid-level semantic models [15,38,39,40,41] suffer from the poor performance of capturing holistic semantic features, which leads to an unsatisfactory classification ability.
On the other hand, underlying correlations between dependent labels are required to be unearthed for an efficient prediction of multiple object labels.E.g., the existence of ships infers to a high probable co-occurrence of the sea, while the presence of buildings is almost always accompanied by the coexistence of pavement.However, the recently proposed multi-label classification methods [31,32,33,34] assumed that classes are independent and employed a set of binary classifiers [31] or a regression model [32,33,34] to infer the existence of each class separately.
To summarize, a well-performed multi-label classification system requires powerful capabilities of learning holistic feature representations and should be capable of harnessing the implicit class dependency.

The Motivation of Our Work
As our survey of related work shows above, recent approaches make few efforts to exploit the high-order class dependency, which constrains the performance in multi-label classification.Besides, direct utilization of CNNs pre-trained on natural image datasets [32,33,34] leads to a partial interpretation of aerial images due to their diverse visual patterns.Moreover, most state-of-the-art methods decompose multi-label classification into separate stages, which cuts off their inter-correlations and makes end-to-end training infeasible.
To tackle these problems, in this paper, we propose a novel end-to-end network architecture, class attention-based convolutional and bidirectional LSTM network (CA-Conv-BiLSTM), which integrates feature extraction and high-order class dependency exploitation together for multi-label classification.Contributions of our work to the literature are detailed as follows: • We regard the multi-label classification of aerial images as a structured output problem instead of a simple regression problem.In this manner, labels are predicted in an ordered procedure, and the prediction of each label is dependent on others.As a consequence, the implicit class relevance is taken into consideration, and structured outputs are more reasonable and closer to the real-world case as compared to regression outputs.
• we propose an end-to-end trainable network architecture for multi-label classification, which consists of a feature extraction module (e.g., a modified network based on VGG-16), a class attention learning layer, and a bidirectional LSTM-based sub-network.These components are designed for extracting features from input images, learning discriminative class-specific features, and exploiting class dependencies, respectively.Besides, such a design makes it feasible to train the network in an end-to-end fashion, which enhances the compactness of our model.
• Considering that class dependencies are diverse in both directions, a bidirectional analysis is required for modeling such correlations.Therefore, we employ a bidirectional LSTM-based network, instead of a oneway recurrent neural network, to dig out class relationships.
• We build a new challenging dataset, DFC15 multi-label dataset, by reproducing from a semantic segmentation dataset, GRSS DFC 2015 (DFC15) [42].The proposed dataset consists of aerial images at a spatial resolution of 5 cm and can be used to evaluate the performance of networks for multi-label classification.
The following sections further introduce and discuss our network.Specifically, Section 2 provides an intuitive illustration of the class dependency and then details the structure of the proposed network in terms of its three fundamental components.Section 3 describes the setup of our experiments, and experimental results are discussed from quantitative and qualitative perspectives.Finally, the conclusion of this paper is drawn in Section 4.

An Observation
Current aerial image multi-label classification methods [32,33,34] consider such problem as a regression issue, where models are trained to fit a binary sequence, and each digit indicates the existence of its corresponding class.Unlike one-hot vectors, a binary sequence is allowed to contain more than one 'hot' value for indicating the joint existence of multiple candidate classes in one image.Besides, several researches [31] formulate multi-label classification into several single-label classification tasks, and thus, train a set of binary classifiers for each class.Notably, one common assumption of these studies is that classes are independent of each other, and classifiers predict the existence of each category independently.However, this is violent and not accord with real life.As illustrated in Fig. 1, although images obtained in diverse scenes are assigned with multiple different labels, there are still common classes, e.g., car and pavement, coexisting in each image.This is because, in the real-life world, some classes have a strong correlation, for example, cars are often driven or parked on pavements.To further demonstrate the class dependency, we calculate conditional probabilities for each of the two categories.Let C r denote referenced class, and C p denote potential co-occurrence class.Conditional probability P (C p |C r ), which depicts the possibility that C p exhibits in an image, where the existence of C r is priorly known, can be solved with Eq. 1, P (C p , C r ) indicates the joint occurrence probability of C p and C r , and P (C r ) refers to the priori probability of C r .Conditional probabilities of all class pairs in UCM multi-label datasets are listed in Fig. 2, and it is intuitive that some classes have strong dependencies.For instance, it is highly possible that there are pavements in images, which contain airplanes, buildings, cars, or tanks.Moreover, it is notable that class dependencies are not symmetric due to their particular properties.For example, P (water|ship) is twice as P (ship|water) due to the reason that the occurrence of ships always infer to the co-occurrence of water, while not vice versa.Therefore, to thoroughly dig out the correlation among various classes, it is crucial to model class probabilistic dependencies bidirectionally in a classification method.
To this end, we boil the multi-label classification down into a structured output problem, instead of a simple regression issue, and employ a unified framework of a CNN and a bidirectional RNN to 1) extract semantic features from raw images and 2) model image-label relations as well as bidirectional class dependencies, respectively.

Network Architecture
The proposed CA-Conv-BiLSTM, as illustrated in Fig. 3, is composed of three components: a feature extraction module, a class attention learning layer, and a Bidirectional LSTM-based recurrent sub-network.More specifically, the feature extraction module employs a stack of interleaved convolutional and pooling layers to extract high-level features, which are then fed into a class attention learning layer to produce discriminative class-specific features.Afterwards, a bidirectional LSTM-based recurrent sub-network is attached to model both probabilistic class dependencies and underlying relationships between image features and labels.
Section 2.2.1 details the architecture of the feature extraction module, and Section 2.2.2 describes the explicit design of the class attention learning layer.Finally, Section 2.2.3 introduces how to produce structured multi-label outputs from class-specific features via a bidirectional LSTM-based recurrent sub-network.

Dense High-level Feature Extraction
Learning efficient feature representations of input images is extremely crucial for the image classification task.To this end, a modern popular trend is to employ a CNN architecture to automatically extract discriminative features, and many recent studies [43,11,16,44,17,23] have achieved great progress in a wide range of classification tasks.Inspired by this, our model adapts VGG-16 [45], one of the most welcoming CNN architectures for its effectiveness and elegance, to extract high-level features for our task.
Specifically, the feature extraction module consists of 5 convolutional blocks, and each of them contains 2 or 3 convolutional layers (as illustrated in the left of Fig. 3).Notably, the number of filters is equivalent in a common convolutional block and doubles after each pooling layer, which is utilized to reduce the spatial dimension of feature maps.The purpose of such design is to enable the feature extraction module to learn diverse features at a less computational expense.The receptive field of all convolutional filters is 3×3, which increases nonlinearities inside the feature extraction module.Besides, the convolution stride is 1 pixel, and the spatial padding of each convolutional layer is set as 1 pixel as well.Among these convolutional blocks, max-pooling layers are interleaved for reducing the size of feature maps and meanwhile, maintaining only local representative, such as maximum in a 2 × 2-pixel region.The size of pooling windows is 2 × 2 pixels, and the pooling stride is 2 pixels, which halves feature maps in width and length.
Features directly learned from a conventional CNN (e.g., VGG-16) are proved to be high-level and semantic, but their spatial resolution is significantly reduced, which is not favorable for generating high-dimensional classspecific features in the subsequent class attention learning layer.To address this, max-pooling layers following the last two convolutional blocks are discarded in our model, and atrous convolutional filters with dilation rate 2 are employed in the last convolutional block for preserving original receptive fields.Consequently, our feature extraction module is capable of learning high-level features with finer spatial resolution, so-called "dense", compared to VGG-16, and it is feasible to initialize our model with pre-trained VGG-16, considering that all filters have equivalent receptive fields.
Moreover, it is noteworthy that other popular CNN architectures can be taken as prototypes of the feature extraction module, and thus, we extend researches to GoogLeNet [46] and ResNet [47] for a comprehensive evaluation of CA-Conv-BiLSTM.Regarding GoogLeNet, i.e., Inception-v3 [48], the stride of convolutional and pooling layers after "mixed7" is reduced to 1 pixel, and the dilation rate of convolutional filters in "mixed9" is 2. For ResNet (we use ResNet-50), the convolution stride in last two residual blocks is set as 1 pixel, and the dilation rate of filters in the last residual block is 2. Besides, layers after global average pooling layers, as well as itself, are removed to ensure dense high-level feature maps.

Class Attention Learning Layer
Although Features extracted from pre-trained CNNs are high-level and can be directly fed into a fully connected layer for generating multi-label predictions, it is infeasible to learn high-order probabilistic dependencies by recurrently feeding it with identical features.Therefore, extracting discriminative class-wise features plays a key role in discovering class dependencies and effectively bridging CNN and RNN for multi-label classification tasks.
Here, we propose a class attention learning layer to explore features with respect to each category, and the proposed layer, illustrated in the middle of Fig. 3, consists of the following two stages: 1) generating class attention maps via a 1×1 convolutional layer with stride 1, and 2) vectorizing each class attention map to obtain class-specific features.Formally, given feature maps X, extracted from the feature extraction module, with a size of W × W × K, and let w l represent the l-th convolutional filter in the class attention learning layer.The attention map M l for class l can be obtained with the following formula: where l ranges from 1 to the number of classes, and * represents convolution operation.Considering that the size of convolutional filters is 1 × 1, a class attention map M l is intrinsically a linear combination of all channels in X.
With this design, the proposed class attention learning layer is capable of learning discriminative class attention maps.Some examples are shown in Fig. 4.An aerial image (cf.Fig. 4a) in UCM multi-label dataset is first fed into the feature extraction module, adapted from VGG-16, and outputs of its last convolutional block are considered as the feature maps X in Eq. 2. Thus, X is abundant in high-level semantic information, and the size of X is 14 × 14 × 512.Afterwards, a class attention learning layer, where the number of filters is equivalent to that of classes, is appended to generate class-specific feature representations with respect to all categories.With sufficient training, they are supposed to learn class-wise attention maps.It is observed that class attention maps highlight discriminative areas for different categories and exhibit almost no activations with respect to absent classes (as shown in Fig. 4c).Subsequently, class attention maps M l are transformed into class-wise feature vectors v l of W 2 dimensions by vectorization.Instead of fully connecting class attention maps to each hidden unit in the following layer, we construct class-wise connections between class attention maps and their corresponding hidden units, i.e., corresponding time steps in an LSTM layer in our network.In this way, features fed into different units are retained to be class-specific discriminative and significantly contribute to the exploitation of the dynamic class dependency in the subsequent bidirectional LSTM layer.

Class Dependency Learning via a BiLSTM-based Sub-network
As an important branch of neural networks, RNN is widely used in dealing with sequential data, e.g., textual data and temporal series, due to its strong capabilities of exploiting implicit dependencies among inputs.Unlike CNN, RNN is characterized by its recurrent neurons, of which activations are dependent on both current inputs and previous hidden states.However, conventional RNNs suffer from the gradient vanishing problem and are found difficult to learn long-term dependencies.Therefore, in this work, we seek to model class dependencies with an LSTM-based RNN, which is first proposed in [49] and has shown great performance in processing long sequences [50,51,52,53,54].
Instead of directly summing up inputs as in a conventional recurrent layer, an LSTM layer relies on specifically designed hidden units, LSTM units, where information, such as the class dependency between category l and l − 1, is "memorized", updated, and transmitted with a memory cell and several gates.Specifically, given a class-specific feature v l obtained from the class attention learning layer as an input of the LSTM memory cell c l at time step l, and let h l represent the activation of c l .New memory information cl , learned from the previous activation h l−1 and the present input feature v l , is obtained as follows: where W cv and W ch denote weight matrix from input vectors to memory cell and hidden-memory coefficient matrix, respectively, and b c is a bias term.Besides, tanh(•) is the hyperbolic tangent function.In contrast to conventional recurrent units, where the cl is directly used to update the current state h l , an LSTM unit employs an input gate i l to control the extent to which cl is added, and meanwhile, partially omits uncorrelated prior information from c l−1 with a forget gate f l .The two gates are performed by the following equations: Consequently, the memory cell c l is updated by where represents element-wise multiplication.Afterwards, an output gate o l , formulated by is designed to determine the proportion of memory content to be exposed, and eventually, the memory cell c l at time step l is activated by Although it is not difficult to discover that the activation of the memory cell at each time step is dependent on both input class-specific feature vectors and previous cell states.However, taking into account that the class dependency is bidirectional, as demonstrated in Section 2.1, a single-directional LSTM-based RNN is insufficient to draw a comprehensive picture of interclass relevance.Therefore, a bidirectional LSTM-based RNN, composed of two identical recurrent streams but with reversed directions, is introduced in our model, and the hidden units are updated based on signals from not only their preceding states but also subsequent ones.
In order to practically adapt a bidirectional LSTM-based RNN to modeling the class dependency, we set the number of time steps in our bidirectional LSTM-based sub-network equivalent to that of classes under the assumption that distinct classes are predicted at respective time steps.Validated in Section 3.3 and 3.4, such design enjoys two outstanding characteristics: on one hand, the LSTM memory cell at time step l, c l , focuses on learning dependent relationship between class l and others in dual directions (cf.Fig. 5), and on the other hand, the occurrence probability of class l, P l , can be predicted from outputs [h l , h l ] with a single-unit fully connected layer: where h l denotes the activation of c l in the other direction, and σ is used as the activation function.

Experiments and Discussion
In this section, two high-resolution aerial datasets of different resolution used for evaluating our network are first described in Section 3.1, and then, the training strategies are introduced in Section 3.2.Afterwards, the performance of the proposed network on the two datasets is quantitatively and qualitatively evaluated in the following sections.3.1.Data description 3.1.1.UCM Multi-label Dataset UCM multi-label dataset [55] is reproduced from UCM dataset [15] by reassigning them with multiple object labels.Specifically, UCM dataset consists of 2100 aerial images of 256×256 pixels, and each of them is categorized into one of 21 scene labels: airplane, beach, agricultural, baseball diamond, building, tennis courts, dense residential, forest, freeway, golf course, mobile home park, harbor, intersection, storage tank, medium residential, overpass, sparse residential, parking lot, river, runway, and chaparral.For each of them, there are 100 images with a spatial resolution of one foot collected by cropping manually from aerial ortho imagery provided by the United States Geological Survey (USGS) National Map.
In contrast, images in UCM multi-label dataset are relabeled by assigning each image sample with one or more labels based on their primitive objects.The total number of newly defined object classes is 17: airplane, sand, pavement, building, car, chaparral, court, tree, dock, tank, water, grass, mobile home, ship, bare soil, sea, and field.It is notable that several labels, namely, airplane, building, and tank, are defined in both datasets but with variant level.In UCM dataset, they are scene-level labels, since they are predominant objects in an image and used to depict the whole image, while in UCM multi-label dataset, they are object-level labels, regarded as candidate objects in a scene.The numbers of images related to each object category are listed in Table 1, and examples from each scene category are shown in Fig. 6, as well as their corresponding object labels.To train and test our network on UCM multi-label dataset, we select 80% of sample images evenly from each scene category for training and the rest as the test set.

DFC15 Multi-label Dataset
Considering that images collected from the same scene may share similar patterns, alleviating task challenges, we build a new multi-label dataset, DFC15 multi-label dataset, based on a semantic segmentation dataset, DFC15 [42], which was published and first used in 2015 IEEE GRSS Data Fusion Contest.DFC15 dataset is acquired over Zeebrugge with an airborne sensor, which is 300m off the ground.In total, 7 tiles are collected in DFC dataset, and each of them is 10000 × 10000 pixels with a spatial resolution of 5 cm.Unlike UCM dataset, where images are assigned with image-level labels, all tiles in DFC15 dataset are labeled in pixel-level, and each pixel is categorized into 8 distinct object classes: impervious, water, clutter, vegetation, building, tree, boat, and car.Notably, vegetation refers to low vegetation, such as bushes and grasses, and has no overlap with trees.Impervious indicates impervious surfaces (e.g., roads) excluding building rooftops.
Considering our task, the following processes are conducted: First, we crop large tiles into images of 600 × 600 pixels with a 200-pixel-stride sliding window.Afterwards, images containing unclassified pixels are ignored, and labels of all pixels in each image are aggregated into image-level multi- labels.An important characteristic of images in DFC15 multi-label dataset is lower inter-image similarity due to that they are cropped from vast regions consecutively without specific preferences, e.g., seeking images belonging to a specific scene.Moreover, extremely high resolution makes it more challenging as compared to UCM multi-label dataset.The numbers of images containing each object label are listed in Table 2, and example images with their image-level object labels are shown in Fig. 7. To conduct the evaluation, 80% of images are randomly selected as the training set, while the others are utilized to test our network.

Training details
The proposed CA-Conv-BiLSTM is initialized with separate strategies with respect to three dominant components: 1) the feature extraction mod-  ule is initialized with CNNs pre-trained on ImageNet dataset [56], 2) convolutional filters in the class attention learning layer is initialized with a Glorot uniform initializer, and 3) all weights in the bidirectional 2048-d LSTM layer are randomly initialized in the range of [−0.1, 0.1] with a uniform distribution.Notably, weights in the feature extraction module are trainable and fine-tuned during the training phase of our network.
Regarding the optimizer, we chose Adam with Nesterov momentum [57], claimed to converge faster than stochastic gradient descent (SGD), and set parameters of the optimizer as recommended: β 1 = 0.9, β 2 = 0.999, and = 1e − 08.The learning rate is set as 1e − 04 and decayed by 0.1 when the validation accuracy is saturated.The loss of the network is defined as the binary cross entropy.We implement the network on TensorFlow and train it on one NVIDIA Tesla P100 16GB GPU for 100 epochs.The size of the training batch is 32 as a trade-off between GPU memory capacity and training speed.To avoid overfitting, we stop training procedure when the loss fails to decrease in five epochs.Concerning ground truths, multiple labels of an image are encoded into a multi-hot binary sequence, of which the length is equivalent to the number of all candidate labels.For each digit, 1 indicates the existence of its corresponding label, while 0 denotes the absent label.

Results on UCM Multi-label Dataset 3.3.1. Quantitative Results
To evaluate the performance of CA-Conv-BiLSTM for multi-label classification of high resolution aerial imagery, we calculate both F 1 [58] and F 2 [59] score as follows: where p e is the example-based precision [60] of predicted multiple labels, and r e indicates the example-based recall.They are computed by: p e = T P e T P e + F P e , r e = T P e T P e + F N e , where T P e , F P e , and F N e indicate the numbers of positive labels, which are predicted correctly (true positives) and incorrectly (false positives), and negative labels, which are incorrectly predicted (false negatives) in an example (i.e., an image with multiple object labels in our case), respectively.Then, the average of F 2 scores of each example is formed to assess the overall accuracy of multi-label classification tasks.Besides, example-based mean precision as well as mean recall are calculated to assess the performance from the perspective of examples, while label-based mean precision and mean recall can help us understand the performance of the network from the perspective of object labels: where T P l , F P l , and F N l represent the numbers of correctly predicted positive images, incorrectly predicted positive images, and incorrectly predicted negative images with respect to each label.For a fair validation of CA-Conv-BiLSTM, we decompose the evaluation into two components: we compare 1) CA-Conv-LSTM with standard CNNs to validate the effectiveness of employing LSTM-based recurrent subnetwork, and 2) CA-Conv-BiLSTM with CA-Conv-LSTM for further assess the significance of the bidirectional structure.The detailed configurations of these competitors are listed in Table 3.For standard CNNs, we substitute last softmax layers, which are designed for single-label classification, with sigmoid layers to predict multi-hot binary sequences, where each digit indicates the probability of the presence of its corresponding category.To calculate evaluation metrics, we binarize outputs of all models with a threshold of 0.5 for producing binary sequences.Besides, our model is compared with a relevant existing method [32] for a comprehensive evaluation of its performance.Table 4 exhibits results on UCM multi-label dataset, and it can be seen that compared to directly applying standard CNNs to multi-label classification, CA-Conv-LSTM framework performs superiorly as expected due to taking class dependencies into consideration.CA-VGG-LSTM increases the mean F 1 score by 1.03% with respect to VGGNet, while for CA-ResNet-LSTM, an increment of 1.68%, is obtained compared to ResNet.Mostly enjoying this framework, CA-GoogLeNet-LSTM achieves the best mean F 1 score of 81.78% and an increment of 1.10% in comparison with other CA-Conv-LSTM models and GoogLeNet, respectively.Moreover, CA-ResNet-LSTM shows an improvement of 3.08% of the mean F 2 score in comparison with ResNet, while CA-GoogLeNet-LSTM obtains the best F 2 score of 85.16%.To summarize, all comparisons demonstrate that instead of directly using a standard CNN as a regression task, exploiting class dependencies plays a key role in multi-label classification.
Concerning the signification of employing a bidirectional structure, CA-Conv-BiLSTM performs better than CA-Conv-LSTM in the mean F 1 score, and compared to Conv-RBFNN, our models achieve higher mean F 1 and F 2 scores, increased by at most 0.98% and 2.80%, respectively.Another important observation is that our proposed model is equipped with higher example-based recall but lower example-based precision, which leads to a relatively higher mean F 2 score.Notably, the F 2 score is an evaluation index used in Kaggle Amazon contest [59] to assess the performance of recognizing challenging rare objects in aerial images, and a higher score indicates a stronger capability.Table 5 exhibits several example predictions in UCM multi-label dataset.Although our model successfully predicts most multiple object labels, it is observed that the grass and tree are prone to be misclassified due to their analogous appearances.In the 4th image, the grass is a false positive when there exist trees, while in the 5th image, the tree is a false positive when the grass presents.Likewise, the bare soil in the 5th image is neglected unfortunately for its similar visual patterns with the grass.

Qualitative Results
In addition to validate classification capabilities of the network by computing the mean F 2 score, we further explore the effectiveness of class-specific features learned from the proposed class attention learning layer and try to"open" the black box of our network by feature visualization.Example class attention maps produced by the proposed network on UCM multi-label dataset are shown in Fig. 8, where column (a) is original images, and columns (b)-(i) are class attention maps for different objects: (b) bare soil, (c) building, (d) car, (e) court, (f) grass, (g) pavement, (h) tree, and (i) water.As we can see, these maps highlight discriminative regions for positive classes, while present almost no activations when corresponding objects are absent in original images.For example, object labels of the image at the first row in Fig. 8 are building, grass, pavement, and tree, and its class attention maps for these categories are strongly activated.From images at the fourth row of Fig. 8, it can be seen that regions of the grassland, forest, and river are highlighted in their corresponding class attention maps, leading to positive predictions, while no discriminative areas are intensively activated in the other maps.Following the evaluation on UCM multi-label dataset, we assess our network on DFC15 multi-label dataset by calculating the mean F 1 and F 2 score as well as mean example-and label-based precision and recall.Table 6 shows experimental results on this dataset, and the conclusion can be drawn that modeling class dependencies with a bidirectional structure contributes significantly to multi-label classification.Specifically, the mean F 1 score achieved by CA-ResNet-BiLSTM is 4.87% and 5.55% higher than CA-ResNet-LSTM and ResNet, respectively.CA-VGG-BiLSTM obtains the best mean F 1 score of 76.25% in comparison with VGGNet and CA-VGG-LSTM, and the mean F 1 score of CA-GoogLeNet-BiLSTM is 78.25%, higher than its competitors.In comparison with Conv-RBFNN, CA-Conv-BiLSTM exhibits an improvement of at most 5.29% and 4.18% in terms of the mean F 1 and F 2 score, 9 show that the network pays high attention to impervious regions, such as parking lots, while figures at column (i) highlight regions of cars.However, some of class attention maps for negative object labels exhibit unexpected strong activations.For instance, the class attention map for the car at the third row of Fig. 9 is not supposed to highlight any region due to its absence of cars.This can be explained as the highlighted regions share similar patterns as cars, which also illustrates why the network made wrong predictions (cf.wrongly predicted car label in Fig. 9).Overall, the visualization of class attention maps demonstrates that the features captured from the proposed class attention learning layer are discriminative and class-specific.Besides, we note that there exist strong border artifacts in figures, especially those at column (b) of Fig. 9, which questions whether improving the quality of class attention maps benefits the effectiveness of the BiLSTM-based sub-network.Then we experimented with using the skip connection scheme in order to refine class attention maps.Experimental results demonstrated that this provides negligible improvements.

Conclusion
In this paper, we propose a novel network, CA-Conv-BiLSTM, for the multi-label classification of high-resolution aerial imagery.The proposed network is composed of three indispensable elements: 1) a feature extraction module, 2) a class attention learning layer, and 3) a bidirectional LSTMbased sub-network.Specifically, the feature extraction module is responsible for capturing fine-grained high-level feature maps from raw images, while the class attention learning layer is designed for extracting discriminative classspecific features.Afterwards, the bidirectional LSTM-based sub-network is used to model the underlying class dependency in both directions and predict multiple object labels in a structured manner.With such design, the prediction of multiple object-level labels is performed in an ordered procedure, and outputs are structured sequences instead of discrete values.We evaluate our network on two datasets, UCM multi-label dataset and DFC15 multilabel dataset, and experimental results validate the effectiveness of our model from both quantitative and qualitative respects.On one hand, the mean F 2 score is increased by at most 0.0446 compared to other competitors.On the other hand, visualized class attention maps, where discriminative regions for existing objects are strongly activated, demonstrate that features learned from this layer are class-specific and discriminative.Looking into the future, the application of our network can be extended to fields, such as weakly supervised semantic segmentation and object localization.

Figure 1 :
Figure 1: Example high resolution aerial images with their scene labels and multiple object labels.Common label pairs are highlighted .(a) Free way: bare soil, car, grass, pavement and tree.(b) Intersection: building, car, grass, pavement and tree.(c) Parking lot: car and pavement.

Figure 2 :
Figure 2: The co-occurrence matrix of labels in UCM multi-label dataset.Notably, all images are taken into consideration when calculating this matrix.Labels at Y-axis represent referenced classes C r , while labels at X-axis are potential co-occurrence classes C p .The conditional probability P (C p |C r ) of each class pair is presented in the corresponding block.

Figure 3 :
Figure 3: The architecture of the proposed CA-Conv-BiLSTM for the multi-label classification of aerial images.

Figure 4 :
Figure 4: Example class attention maps of an a) aerial image, with respect to different classes: b) bare soil, c) building, and d) water.

Figure 5 :
Figure 5: Illustration of the bidirectional structure.The direction of the upper stream is opposite to that of the lower stream.Notably, h l−1 , c l−1 denotes the activation and memory cell in the upper stream at the time step, which corresponds to class l − 1 for convenience (considering that the subsequent time step is usually denoted as l + 1).

Figure 6 :
Figure 6: Example images from each scene category and their corresponding multiple object labels in UCM multi-label dataset.Each image is 256 × 256 pixels with a spatial resolution of one foot, and their scene and object labels are introduced: (a) Agricultural: field and tree.(b) Airplane: airplane, bare soil, car, grass and pavement.(c) Baseball diamond: bare soil, building, grass, and pavement.(d) Beach: sand and sea.(e) building: building, car, and pavement.(f) Chaparral: bare soil and chaparral.(g) Dense residential: building, car, grass, pavement, and tree.(h) Forest: building, grass, and tree.(i) Free way: bare soil, car, grass, pavement, and tree.(j) Golf course: grass, pavement, sand, and tree.(k) Harbor: dock, ship, and water.(l) Intersection: building, car, grass, pavement, and tree.(m) Medium residential: building, car, grass, pavement, and tree.(n) Mobile home park: bare soil, car, grass, mobile home, pavement, and tree.(o) Overpass: bare soil, car, and pavement.(p) Parking lot: car, grass, and pavement.(q) River: grass, tree, and water.(r) Runway: grass and pavement.(s) Sparse residential: bare soil, building, car, grass, pavement, and tree.(t) Storage tank: bare soil, pavement, and tank.(u) Tennis court: bare soil, court, grass, and tree.

Figure 8 :
Figure 8: Example class attention maps of (a) images in UCM multi-label dataset with respect to (b) bare soil, (c) building, (d) car, (e) court, (f) grass, (g) pavement, (h) tree, and (i) water.Red indicates strong activations, while blue represents non-activations.Besides, normalization is performed based on each row for a fair comparison among class attention maps of the same images.

Figure 9 :
Figure 9: Example class attention maps of (a) images in DFC15 dataset with respect to (b) impervious, (c) water, (d) clutter, (e) vegetation, (f) building, (g) tree, (h) boat, and (i) car.Red indicates strong activations, while blue represents non-activations.Besides, normalization is performed based on each row for a fair comparison among class attention maps of the same images.

Table 1 :
The Number of Images in Each Object Class

Table 2 :
The Number of Images in Each Object Class

Table 4 :
Quantitative Results on UCM Multi-label Dataset (%) Model m.F 1 m.F 2 m.P e m.R e m.P l m.R l 1 and m.F 2 indicate the mean F 1 and F 2 score.m.P e and m.R e indicate mean example-based precision and recall.m.P l and m.R l indicate mean label-based precision and recall.

Table 5 :
Example Predictions on UCM and DFC15 Multi-label Dataset