Spatial Context-Aware Object-Attentional Network for Multi-Label Image Classification

Multi-label image classification is a fundamental but challenging task in computer vision. To tackle the problem, the label-related semantic information is often exploited, but the background context and spatial semantic information of related objects are not fully utilized. To address these issues, a multi-branch deep neural network is proposed in this paper. The first branch is designed to extract the discriminant information from regions of interest to detect target objects. In the second branch, a spatial context-aware approach is proposed to better capture the contextual information of an object in its surroundings by using an adaptive patch expansion mechanism. It helps the detection of small objects that are easily lost without the support of context information. The third one, the object-attentional branch, exploits the spatial semantic relations between the target object and its related objects, to better detect partially occluded, small or dim objects with the support of those easily detectable objects. To better encode such relations, an attention mechanism jointly considering the spatial and semantic relations between objects is developed. Two widely used benchmark datasets for multi-labeling classification, MS COCO and PASCAL VOC, are used to evaluate the proposed framework. The experimental results demonstrate that the proposed method outperforms the state-of-the-art methods for multi-label image classification.

Abstract-Multi-label image classification is a fundamental but challenging task in computer vision. To tackle the problem, the label-related semantic information is often exploited, but the background context and spatial semantic information of related objects are not fully utilized. To address these issues, a multibranch deep neural network is proposed in this paper. The first branch is designed to extract the discriminant information from regions of interest to detect target objects. In the second branch, a spatial context-aware approach is proposed to better capture the contextual information of an object in its surroundings by using an adaptive patch expansion mechanism. It helps the detection of small objects that are easily lost without the support of context information. The third one, the object-attentional branch, exploits the spatial semantic relations between the target object and its related objects, to better detect partially occluded, small or dim objects with the support of those easily detectable objects. To better encode such relations, an attention mechanism jointly considering the spatial and semantic relations between objects is developed. Two widely used benchmark datasets for multi-labeling classification, MS COCO and PASCAL VOC, are used to evaluate the proposed framework. The experimental results demonstrate that the proposed method outperforms the state-of-the-art methods for multi-label image classification.
Index Terms-Multi-label image classification, adaptive patch expansion, spatial context-aware object detection, object clustering, spatial semantic attention.

I. INTRODUCTION
T HE performance of single-label image classification has been greatly improved with the advancement of deep neural networks [1]. An image often contains multiple objects, Fig. 1. Due to the dim environment, small scales or partial occlusions, some objects may be difficult to recognize by existing solutions that overemphasize the features extracted from the region containing the target object. Additional background contextual information (examples shown in the first row) and object-to-object spatial semantic information (examples shown in the second row) may help to recognize these challenging objects. (Best viewed in color.) and the difficulty of modeling rich semantic information and higher-order label co-occurrence makes the multi-label image classification very challenging [2]. Some examples of challenging objects are shown in the green rectangles in Fig. 1.
Transforming the multi-labeling problem into multiple binary classification tasks is one of the common strategies [3], which extends the single-label algorithms to the multi-labeling task by training one binary classifier for each label. But this strategy ignores the important label interdependency in multi-label image classification [4], [5], [6], [7], [8], [9]. Many algorithms have been developed to capture the label correlation. Some researchers adapted the pairwise ranking mechanism to capture the label dependency [10], [11], or embedded labels into a well-designed latent space to learn the label correlation [12], [13], [14]. Recurrent neural networks (RNNs) [4], [5], [6] and graph neural networks [15], [16] have also been utilized to capture the label dependency. However, these approaches do not make full use of the spatial and context information of objects.
The spatial information of objects is important for the multilabel image classification [7], [8], [17], [18], [19], [20], [21], [22], [23], [24]. A two-stage pipeline is often utilized to derive regions of interest first under the guidance of ground-truth bounding boxes and label them by using well-trained deep neural networks in the second stage [18], [19], [20], [21]. The spatial relationship between objects can be exploited to better detect objects with strong spatial correspondence. In [25], [26], [27], and [28], both label and spatial relations are utilized for multi-label image classification. But these methods overemphasize features extracted from regions of interest, while ignoring the useful background context and the spatial semantic attention from nearby objects.
Multiple objects in an image may be semantically related and spatially clustered. It is hence beneficial to handle these objects as a cluster, especially for objects that are difficult to detect individually. As shown in the second row of Fig. 1, by forming object clusters with easily recognizable objects such as 'umbrella', 'computer' and 'bus', challenging objects such as 'keyboard' and 'person' become easier to be recognized. The spatial semantic relations among the objects within the same cluster can help to label these challenging objects. Some researchers utilized the cluster concept to recognize a set of small similar objects in aerial images [39], [40], [41], while spatially close and semantically related objects may also form object clusters, which may have diversified semantic relations and spatial differences such as object sizes and shapes. Those spatial semantic relations are much more difficult to model.
To address these challenges, the following two techniques are developed in this paper. First, to make use of the complementary information from the background, a dynamic patch expansion mechanism is designed. The region containing an object is adaptively enlarged to capture its surrounding background context with a careful design of balancing the enrichment of supporting context information and the increment of intra-class variations due to the patch expansion.
Secondly, to better utilize the object-to-object spatial semantic information, an object-attentional mechanism based on the concept of object cluster is designed. The spatially close and semantically related objects are first grouped into clusters. Unlike Vision Transformer [42] that embeds the pixel locations as features, in this paper, the proposed objectattentional techniques explicitly model the spatial relations between objects as the overlap between objects' bounding boxes. Intuitively, the closer the objects, the stronger spatial co-occurrence between objects. The extracted spatial relations are aggregated with the semantic attentions between objects through the proposed object-attentional network. Benefiting from this design, the proposed model can model the spatial semantic dependencies between the target object and its surrounding objects, and hence could well label dim or partially occluded objects that previous models can not recognize.
The proposed Spatial Context-aware Object-attentional Deep Convolutional Neural Network (SCO-DCNN) is compared with nineteen state-of-the-art methods for multi-label image classification. The experimental results on two large benchmark datasets, MS COCO [43] and PASCAL VOC [44], demonstrate the effectiveness of the proposed approach, with an improvement of previous best mAP from 85.1% to 87.7% and 94.8% to 96.3%, respectively.
The contributions of this paper can be summarized as follows: 1) A spatial context-aware branch with an adaptive patch-expansion mechanism is proposed to make good use of the background contextual information to better recognize small objects. 2) An object-attentional branch is proposed to model spatially close and semantically related objects as object clusters, and explicitly model and aggregate the spatial and semantic co-occurrences of objects within the cluster. 3) The two proposed techniques shown above are integrated into a three-branch network for multi-label image classification, which shows better performance than state-of-the-art methods.

II. RELATED WORK A. Multi-Label Image Classification
The widespread use of multi-object images in realworld applications makes the problem of multi-label image classification more and more attractive. In the early days, the multi-labeling problem was often transformed into a series of binary classification tasks, by considering one possible label at a time [3]. This formulation ignores the inter-dependencies among labels, which have been proved useful in [4] and [12]. Many algorithms have been developed to exploit label dependencies [45], [46], [47], e.g., Sun et al. learned the label dependencies through the sparse factor representation [45]. Jing et al. adapted the dictionary learning to capture the label co-occurrences [46]. Nguyen et al. developed a Bayesian approach to consider the label correlations in pairs [47].

B. Spatial Context for Image Classification
The spatial relationships of objects have been utilized for multi-label image classification [8], [17], [18], [19], [20], [21], [26]. Researchers devised a two-stage pipeline: region proposal generation and region labeling. Specifically, Wei et al. [18] designed a Hypotheses-CNN-Pooling framework to first generate object segment hypotheses, and then aggregate the label score for each object hypothesis. Yang et al. [19] employed the selective search strategy to generate region proposals, and encoded the generated proposals and corresponding groundtruth bounding-box annotations into a Fisher vector in a multiview pipeline. Sun et al. [7] developed a cascaded neural network focusing on the areas that may contain objects. Attention maps [8], [14], [22], [26] have been developed to focus on promising object-specific areas. The two-stage pipeline effectively focuses on the image regions containing objects and encodes the spatial correlations among objects.
The background context information has been shown helpful for object detection [48], [49], [50], [51], [52]. Jeon and Landgrebe [49] developed a cascaded spatio-temporal contextual classifier to exploit both spatial and temporal interpixel dependency and aggregated the objects using Gibbs random fields and Markov random fields. Melgani and Serpico [50] utilized both the spectral information and spatial-temporal contextual information for classifying multi-temporal multisensor remote-sensing images. Yang et al. [51] developed a context-transformer to handle the scarcity of annotations in detecting objects with few shots. The background context is also useful in salient object detection [29], [31], [32], [33], [34] and image segmentation tasks [30], [35], [36], [37], [38]. In [35] and [38], a Context Contrasted Local (CCL) model is designed to leverage the useful context and spotlight the local information in contrast to the context. In [30], a shape-adaptive convolutional layer is designed to confine the contextual region dynamically, thereby aggregating the context information in scale-and shape-variant regions.

C. Object-Attentional Information for Image Classification
The object-to-object information has been utilized in various recognition tasks. Heterogeneous cues such as semantic, spatial and scale contexts of objects are jointly integrated to construct a higher-order descriptor (HOOD) in [53]. Oh et al. adapted HOOD to detecting abnormal objects by using fullyconnected conditional random fields [54]. Researchers also attempted to capture the interactions between human and objects for semantic action recognition [55], [56]. Gupta et al. designed a Bayesian video interpretation framework that integrates various visual perceptual tasks to better understand human-object interactions [57]. Wang et al. developed an attention framework to capture subtle interactions between humans and objects [58]. To dynamically attain the discriminant features, Zou et al. designed a transformer architecture and a quintuple matching loss to capture the relations between humans and objects from a global image context [59].
Objects in multi-label images may be spatially clustered. LaLonde et al. developed a two-stage framework comprising ClusterNet and FoveaNet for detecting small objects in large scenes [39], where the ClusterNet is designed to generate a set of object clusters and the FoveaNet is designed to recognize objects within each cluster. Following the similar design, Yang et al. developed an end-to-end framework named ClusDet to detect objects in aerial images [40]. Tang et al. developed a coarse anchor-free detector to adaptively generate clusters for small objects, and removed overlapping clusters by a position refinement strategy [41]. The cluster concept in [39], [40], and [41] only considers a collection of sub-regions containing a series of small similar objects, but does not consider clusters formed by spatially close and semantically related objects. In this paper, these objects are modeled as object clusters, and a spatial semantic attention mechanism is designed to aggregate the complicated spatial semantic relations among objects.

A. Overview of Proposed SCO-DCNN
In this paper, a Spatial Context-aware Object-attentional Deep Convolutional Neural Network (SCO-DCNN) is proposed to solve the multi-label classification problem. The overall architecture of the proposed model is shown in Fig. 2. A feature extractor integrating ResNeXt-101 [60] with Feature Pyramid Networks (FPN) [61] is utilized to extract features. A multi-branch multi-object classification framework is designed to make full use of both the discriminant information of the object itself and the spatial contextual information around the object for better recognition. While the Object-focus Branch extracts the discriminant information of objects from the tightly cropped image patches, the other two branches exploit the spatial context around objects, one from the background and the other from related objects. The Spatial Context-aware Branch is designed to effectively exploit the background context. The adaptive patch-expansion mechanism dynamically expands the image patch to capture the contextual information from the surrounding background while avoiding unnecessary increase of intra-class variations. To make use of the object-to-object information, the Objectattentional Branch is designed to first group the spatially close and semantically related objects into clusters, and then encode the spatial semantic information within the same cluster by using the proposed object-attentional network. Finally, the classification results from these three branches are combined for final prediction.

B. Object-Focus Branch
This branch extracts the discriminant information of objects from tightly cropped image patches. Two network headers are designed to process the patches: a patch regressor to determine the object's bounding box and a score predictor to predict the probability of the object belonging to a certain class.
1) Patch Regressor: Denote the input image as I, and the feature pyramid X is obtained as X = F (I; θ), where θ represents the parameters of the feature extractor F. The estimated bounding boxesB can be obtained as: where F R is the patch regressor with model parameters θ R . F R is designed to locate the objects accurately to make good use of the discriminant information of the object, which is guided by a location regression loss L l during training. Similarly as in [62], the objective here is to minimize the intersection over union (IOU) between the generated bounding boxesB and the ResNeXt-101 with feature pyramid network is used as the backbone to generate the feature maps. Then, in the Object-focus Branch, a set of bounding boxes are generated for regions containing objects. In the Spatial Context-aware Branch, an adaptive patch-expansion mechanism is designed to expand tightly cropped patches to include the background contextual information, while avoiding excessive inclusion of the unnecessary background to increase intra-class variations. In the Object-attentional Branch, the spatially close and semantically related objects are grouped to form object clusters, and the spatial semantic information within clusters is encoded using the proposed spatial semantic attentional network.
where N B denotes the total number of bounding boxes and i indicates the index of bounding boxes. x, y, w, h correspond to the centre coordinates, height and width of the patch, respectively. φ(x) is a non-sensitive outlier-removal function for enhancing the training robustness, 2) Score Predictor: After determining the object's bounding box, a score predictor is utilized to predict the confidence scores of all possible labels for each generated patch. The predictor is composed of a dense network following the design principle of multi-label image classification. ueueThe cross-entropy loss L p is adopted to optimize the predictor. More specifically, denote the ground-truth label vector corresponding to the n-th patch as y n = y n,1 , y n,2 , . . . , y n,C ∈ R C , where C represents the number of possible labels, and y n,l = 1 if the patch is annotated with the class label l, and 0 otherwise. Denote the predicted score vector of the n-th patch as y O n = y O n,1 , y O n,2 , . . . , y O n,C . The predictor loss L O p is calculated as, where Y O and Y represent the predicted score vectors and the ground-truth score vectors for N B patches, respectively.

C. Spatial Context-Aware Branch
In the Object-focus Branch, a set of object patches are generated to cover the discriminant information of the objects. However, a tightly cropped patch only contains the information of the object, but ignores the surrounding context. An intuitive example in Fig. 3 shows how the background context helps detect the target object. With additional surrounding context information, the object 'train' could be easier to recognize.
To exploit the background context around objects, a Spatial Context-aware Branch is designed in this paper. Tightly cropped bounding boxes are first expanded into larger ones, to cover both the object and background context. However, over-expanding the patch will increase the intra-class variations, i.e., various background scenes may be included. To balance the trade-off between adding contextual information and increasing intra-class variations, an adaptive patchexpansion mechanism is designed. Specifically, the tightly cropped bounding boxes are expanded with different scales, where F E represents the expansion function, and λ i and B i represent the expansion scale and expanded bounding boxes for scale i, respectively. The score vector y i j ∈ R C of the j-th expanded bounding box B i j can be obtained as follows: where F S represents the score predictor for the Spatial Context-aware Branch and θ S represents the network parameters. To determine the most suitable scale, an intuitive but effective method is designed. The scale selection strategy is based on the entropy of the score vector. Intuitively, the entropy of the ground-truth score vector is low whereas any uncertainty in the predicted score vector will increase the entropy. The optimal scale is hence chosen to minimize the uncertainty of y i j . Its entropy is calculated as follows: where y i j,l represents the score of the l-th class of the j-th score vector for scale λ i . The total entropy for N B expanded patches of scale λ i can be obtained as, The scale with the smallest total entropy is hence selected as the most suitable expending scale, In this way, the patches can be optimally expanded to include as many background contexts as possible without adding too many intra-class variations. The cross-entropy loss is employed to optimize this branch. Suppose that N B patches are generated for each scale, and denote the predicted score vectors of the optimal expanding scale λ i * and the ground-truth score vectors as Y S and Y , respectively. The multi-label classification loss used to guide the optimization of this branch is defined as follows: (y j,l log y i * j,l + 1 − y j,l log(1 − y i * j,l )).

D. Object-Attentional Branch
Besides the background context around objects, the spatial context from nearby objects is also important. To utilize the spatial and semantic relations among objects, a spatial The spatial semantic attentional information of objects within the cluster is extracted using the proposed spatial semantic attention blocks, each of which consists of the three stages: 1) semantic attention modeling, 2) spatial attention modeling and 3) spatial semantic aggregation. Finally, the object-attentional features of objects are derived using residual layers and feed-forward layers. Two examples to illustrate the importance of object clustering. The objects bounded in column (a) are recognized by previous solutions [62], [63], while the 'person' bounded with the green boxes in the middle column is not recognized due to the occlusion or poor illumination. With the additional spatial semantic support provided by the 'person-umbrella' and 'person-horse' clusters shown in column (c), the 'person' can be better recognized.
semantic attention network based on object clustering is designed to better recognize poorly illuminated or partially occluded objects. As shown in Fig. 4, the proposed Objectattentional Branch consists of four main stages: 1) Group spatially close and semantically related objects into object clusters; 2) Measure the appearance composability of objects in clusters to model the semantic attention; 3) Calculate the geometric composability of objects in clusters to model the spatial object-to-object attention; 4) Aggregate the spatial semantic attention and generate a set of object-attentional features F A . The generated features are then passed to the score predictor to derive the score vectors Y A for this branch.
1) Object Clustering: Spatially close and semantically related objects form object clusters. As shown in Fig. 5, the persons in the both images are difficult to detect using previous solutions [62], [63]. Object clusters can be formed for the 'person' and 'umbrella' in the first image, and for the 'person' and 'horse' in the second image, as they are spatially close and semantically related. Both 'umbrella' and 'horse' are large and easily recognizable objects while the persons are small or largely occluded. With the support of those easily recognizable objects in the cluster, challenging objects can be better recognized. The object-attentional blocks are then developed to exploit the object correlations in appearance and geometry within the same cluster.
2) Semantic Attention Modeling: For a given target object, its semantic attention to other objects in the same cluster is obtained by measuring the pairwise appearance correlation between them. The higher the semantic similarity to the target object, the more the attention is given to. More specifically, to enhance the feature representations, input features for all objects are firstly packed as P, and then projected into three embedding spaces as the query matrix Q, key matrix K and value matrix V , respectively.
where F Q , F K and F V represent the transformation networks, and θ Q , θ K and θ V represent the learnable parameters in these three networks, respectively. The objects' appearance attention S( Q, K ) is modeling by where d is the feature dimension. To ensure a stable gradient, the attention map is normalized by √ d. The higher score in the appearance attention map, the more related two corresponding objects in the semantic space. In such a way, correlated objects are weighted more heavily to improve the discriminant power of the target object. The value matrix V will be used later in Eqn. (16) for aggregating the spatial semantic information.
3) Spatial Attention Modeling: In Natural Language Processing (NLP) [64] and computer vision [5], [25], [27], [65], it is widely used to measure the feature attention by modeling the eometric relations between different elements to achieve weighted information aggregation. While the spatial attention in NLP tasks [64] models the relative word orders in a sentence, the spatial locations of patch pixels in an image are often utilized [5], [25], [27]. Different from these approaches, to model the spatial relations between different objects, the positional embedding of the proposed attention mechanism in this paper is defined as the intersection over union (IOU) between objects. Denote spatial relations between two object patches P i and P j as R(B i ,B j ), whereB i andB j are two corresponding bounding boxes, where A i and A j denote the areas of object patches P i and P j , and notation and represent the intersection and union operations between two patches, respectively. Intuitively, the greater the value of R(B i ,B j ), the closer the two object patches P i and P j are in the spatial space, and hence more spatial attention could be provided for each other. When R(B i ,B j ) = 0, it indicates that two objects are disjoint. The proposed pairwise 2D positional embedding models the spatial relations of objects while also constraining the object cluster to a local image patch. As shown shortly, such spatial relations among objects are integrated with semantic relations to model the spatial semantic relations of objects in the cluster.

4) Spatial Semantic Aggregation:
We now show how to aggregate spatial and semantic attentions. The spatial semantic aggregated features F A are derived by utilizing the value matrix V defined in Eqn. (13), the semantic appearance attention features S( Q, K ) defined in Eqn. (14) and the spatial embedding R defined in Eqn. (15) as follows, where f s (·) is the softmax function. The dot product of S( Q, K ) and R ensures that both appearance and geometric information is incorporated to better recognize the target object. The weights associated with appearance and geometric aggregation are normalized using the softmax function.
The effectiveness of weighted aggregation in the attention mechanism has been promoted from NLP to the visual field. Following the design in [64], a series of methods have been designed, and collectively are referred to as Vision Transformer (ViT) [42]. These solutions reshape a global image into a sequence of flattened 2D patches and capture the long-range dependencies between patches using the attention mechanism. In contrast, the proposed spatial semantic attention mechanism generates a set of object clusters, models the semantic attention between objects using the object-attentional mechanism, models the spatial attention using IOU between objects, and aggregates the spatial semantic information through an aggregation network, which effectively exploits the object-to-object attentional information within the object cluster.
After the spatial semantic attention network, a feed-forward network is designed to better encode the object-attentional information. It consists of two fully connected layers with a ReLU activation in between. In addition, two residual layers are used to alleviate the gradient degradation problem, one is added after the attention layer, and the other is added after the feed-forward layer, as shown in Fig. 4. These feed-forward layers and the preceding subnetworks for spatial attention modeling, semantic attention modeling and spatial semantic aggregation form the object-attentional block. This block is stacked several times (as shown in Fig. 2 and 4 by ×N ) to form the Object-attentional Branch to obtain the clusterbased object-attentional features F A . A score predictor is then applied on the generated features to derive the score vectors Y A of the Object-attentional Branch. This branch is trained using the traditional multi-label classification loss defined as, where y A n,l is the predicted confidence score that the n-th patch contains the label l in the Object-attentional Branch. Finally, the total loss function L is calculated as follows: where α, β and γ are the weights to balance the losses. The outputs of the three branches are then fused for final prediction, i.e., the labeling results of the input image.

IV. EXPERIMENTAL RESULTS
A. Experimental Settings 1) Datasets: PASCAL VOC (Visual Object Classes) 2007 dataset [44] has been widely used for multi-label image classification. It covers 20 different categories and contains 9963 images in total. As defined in [44], 5011 images are used for training and 4952 images for testing.
MS COCO (Microsoft Common Objects in Context) 2017 dataset [43] is a large-scale dataset, primarily built for scene context learning such as image recognition, segmentation and captioning. It contains 80 categories with approximately 2.9 labels in each image. It has been widely used as the benchmark dataset for image multi-labeling. Officially, the dataset is split into 118K, 5K and 41K images for training, validation and testing, respectively [43]. As the ground-truth labels of the testing set are not provided, following the common setting in other models [14], [16], [26], the validation set is adopted as the testing set for evaluation.
2) Implementation Details: The ResNeXt-101 [60] integrated with feature pyramid networks (FPN) [61] is pre-trained on the ImageNet [66] dataset and used as the backbone of our model. The former aggregates a set of transformations to improve the classification capabilities of deep neural networks, and the latter employs the pyramid representations to extract a rich visual semantic abstract. The last pooling and classification layers in ResNeXt-101 are removed and the feature maps from the last convolutional layer are used as the input of FPN. A 4-stage feature pyramid from high resolution to low resolution is constructed in FPN.
Four input image sizes are commonly used in literature, e.g., 288 × 288 [6], [22], 448 × 448 [14], [15], [16], [26], 576 × 576 [23], [24], and the multi-scale strategy [27] that resizes the input image to a size in {288, 448, 608, 768, 928}. The proposed method is evaluated on all these four input image settings. The default loss weights α, β and γ are set to 1, unless otherwise stated. The stochastic gradient descent strategy is employed with a weight decay rate of 0.01, a momentum rate of 0.9, a dropout rate of 0.5, and an initial learning rate of 0.002. To speed up the training convergence, the transfer-learning strategy is employed to optimize the proposed model, i.e., load and freeze the pre-trained weights of the backbone, and then train the remaining modules. Once converged, the whole model is then jointly optimized in an end-to-end manner. For the PASCAL dataset, the pretrained model is fine-tuned on the COCO dataset first before conducting an end-to-end training on the PASCAL dataset. Our empirical studies show that such a training strategy could slightly boost the classification accuracy compared with directly fine-tuning on the PASCAL dataset.
3) Compared Methods: The proposed method is compared against nineteen state-of-the-art models. Depending on whether spatial and semantic label dependencies are utilized, they are grouped into the following categories: a) Single-label image classification: ResNet101 [1] is initially designed for single-label image classification while it performs well in the multi-labeling task by using appropriate loss functions, and hence it is selected as the baseline method. b) Exploiting semantic label dependencies only: Many models [4], [6], [14], [15], [16], [23] in literature exploit the semantic label dependencies to tackle the multi-labelling problem, e.g., CNN-RNN [4] adapts the recurrent neural network, PLA (Predicted Label Alignment) [6] is utilized in Orderless Recurrent Models to dynamically determine the ordering of sequence, ML-GCN [15] and KSSNet [16] utilizes graph neural networks, and ResNet-CRL [14] and C-Tran [23] constructs a latent vector space to explicitly model the complex dependencies among visual features and labels. c) Exploiting spatial semantic label dependencies: In literature, many approaches [5], [8], [9], [18], [19], [20], [21], [22], [24], [25], [26], [27] model the both spatial and semantic label co-occurrences. Through exploiting the dependencies of attentive regions, RNN-Attention [25] and ResNet101-SRN [8] are capable of capturing both semantic and spatial relations for objects. In ResNet101-ACfs [22], an attention consistency loss is designed to measure the differences between the attentional regions of the original image and that of the affinetransformed image. RNN-frequency [5] and WSL-GCN [27] utilize the recurrent neural network and graph neural network respectively to embed the labels on the basis of locating attentive regions. In MS-CMA [26], a cross-modality attention mechanism with graph embedding is designed to extract class-specific features and capture spatial semantic label dependencies. As some labels are associated with the entire image, DELTA [9] and MCAR [24] utilize both local and global attentive information concurrently. Some approaches utilize bounding box annotations as additional supervision information [18], [19], [20], [21], e.g., Hypotheses-CNN-Pooling (HCP) [18] that and aggregates the label score for each generated object segment hypothesis, FeV-LV [19] that leverages visual features and label dependencies in a multi-view pipeline, RLSD+ft-RPN [20] that utilizes Regional Latent Semantic Dependencies (RLSD) model to characterize the latent semantic dependencies, and TSGCN [21] that simultaneously characterizes object spatial relationships and semantic label correlations using two respective graph streams.

4) Evaluation Metrics:
The same evaluation metrics used in other models [14] are employed to evaluate the proposed method, i.e., precision (P-C and P-O), recall (R-C and R-O) and F1 score (F1-C and F1-O), where '-C' denotes 'percategory' metric, which is obtained by averaging over the corresponding values of each category, and '-O' denotes 'overall' metric, which calculates the average value directly over all samples. In general, the former is susceptible to the rare categories, and the latter is easily dominated by the majority classes. Precision is the ratio of correctly annotated positive labels in all predicted positive labels, and recall is the ratio of correctly annotated positive labels in all ground-truth positive labels. F1 score is a harmonic average of precision and recall to evaluate the overall performance.
The average precision (AP) and mean average precision (mAP) are also widely used as the evaluation metrics [4], [16], [26], where AP calculates the mean of precision scores after recognizing each relevant label (sorted by the confidence scores from high to low), and mAP is the mean value of the class-wise APs. mAP, F1-C and F1-O provide  TERMS OF MAP USING DIFFERENT COMBINATIONS OF BACKBONES AND INPUT IMAGE SIZES. OB AND SCB  REPRESENT FOR OBJECT-FOCUS BRANCH AND SPATIAL CONTEXT-AWARE BRANCH, RESPECTIVELY. THE PERFORMANCE GAIN OF  OB-SCB OVER OB ILLUSTRATES THE BENEFITS OF INCORPORATING THE BACKGROUND CONTEXT INTO THE MODEL BY  USING THE SPATIAL CONTEXT-AWARE BRANCH. THE PERFORMANCE GAIN OF THE PROPOSED METHOD OVER OB-SCB  ILLUSTRATES THE BENEFITS OF ADDING THE SUPPORTIVE INFORMATION FROM NEARBY OBJECTS USING THE  OBJECT-ATTENTIONAL BRANCH   TABLE II   one-value evaluation for the model, whereas P and R need to be analyzed in pair. Hence mainly these three metrics are used for the analysis of results in this paper.
B. Ablation Study 1) Analysis of Proposed Multi-Branch Structure: An ablation study is carried out to evaluate the proposed multibranch structure. The compared methods are summarized as follows.
OB Object-focus Branch is used only while the other two branches are removed. It is used as the baseline method here.
OB-SCB Both Object-focus Branch and Spatial Contextaware Branch are used. This set of results will demonstrate the performance gain by adding the Spatial Context-aware Branch.
Proposed SCO-DCNN All three branches, Object-focus Branch, Spatial Context-aware Branch and Object-attentional Branch, are used.
By using the Spatial Context-aware Branch and the Objectattentional Branch, the proposed model consistently improves the mAP of the baseline model that uses the Object-focus Branch under six different combinations of backbones and input image sizes. These results illustrate the benefits of adding the background context and object-to-object spatial semantic information into the model. The performance gains on the MS COCO dataset are larger than that on the PASCAL VOC dataset because there are more labels in one image of the MS COCO dataset, i.e., on average 2.9 labels in the MS COCO dataset and 1.4 labels in the PASCAL VOC dataset. Hence there is a higher chance that multiple objects in an image of the MS COCO dataset could form an object cluster so that the spatial semantic relations among objects could be better exploited. The results also show that significantly better performance can be achieved by using ResNeXt-FPN as the backbone than using ResNet-101, and the multi-scale strategy exhibits better performance than using the image input size of 448 × 448 and 576 × 576. Because of the better performance, ResNeXt-FPN is utilized as the backbone and the multi-scale image input strategy is utilized for the proposed model for the remaining experiments, unless otherwise stated.
2) Effect of α, β and γ : Ablation studies are conducted on the three hyper-parameters, weight α, β and γ of the total training loss. ResNeXt-FPN is used as the backbone and the multi-scale input strategy is employed for this set of ablation studies. Two parameters are fixed to the default values and the  Table II.
Experimental results show that the model achieves a similar performance when only adjusting one loss weight in the range of 0.3 and 3. The model is insensitive to the choice of the loss weights. It achieves the best performance when all three branches share the same loss weight of 1.
3) Effect of the Number of Objects in an Image: The proposed method deals with the problem of multi-label image classification. Intuitively, if there are more objects in an image, it is more likely that the proposed method could make use of the spatial, contextual and semantic relations among objects. To validate this conjecture, the proposed model is evaluated on the subsets obtained by dividing the test set of the MS COCO dataset according to the number of objects in each image. The evaluation criterion 'mAP' is reported. The comparison results are summarized in Table III. As shown in Table III, the mAP of the proposed method depicts an upward trend, when the number of objects in an image increases, and then objects are more likely to form object clusters, and hence the power of adding the proposed Object-attentional Branch becomes more significant. These results show the importance of utilizing the spatial, contextual and semantic information among objects, validate the importance of the proposed formulation of object cluster and reflect the effectiveness of the designed object-attentional mechanism to exploit the spatial semantic relations between objects in the cluster. Since most images in the PASCAL dataset contain only one object, such analysis is not performed on this dataset.

4) Time Complexity Analysis:
An ablation study on the computational cost and empirical testing time per image has been carried out. ResNeXt-FPN is utilized as the backbone and the multi-scale input strategy is utilized for this set of experiments. Table IV presents the ablation study of the number of model parameters for the backbone, OB, OB-SCB, and the full model, and the approximated inference time per image for these four models. The proposed full model roughly doubles the parameter size and inference time of the backbone.

C. Comparison With State-of-the-Art Models
The proposed method is compared with fourteen stateof-the-art methods on the MS COCO dataset [43] and the PASCAL VOC dataset [44].
1) Results on MS COCO: The comparison results to the state-of-the-art approaches on the MS COCO dataset are summarized in Table V. Clearly, the proposed model significantly outperforms all the state-of-the-art models in terms of the key evaluation metrics such as mAP and F1 scores under the same experimental settings, i.e., using the same backbone and the same image size. Specifically, when the most common experimental setting is used, i.e., ResNet-101 used as the backbone and the input image size as 448 × 448, the proposed method outperforms the state-of-the-art MS-CMA [26] and MCAR [24] by 1.7% in terms of mAP. When compared to the previously best performed method, C-Tran [23], the proposed method increases the mAP from 85.1% to 86.0% when utilizing the same settings, i.e., ResNet-101 as the backbone and the input image size of 576 × 576. When compared to the previous second best performed approach, WSL-GCN [27], which utilizes ResNet-101 as the backbone and a multi-scale input strategy, the proposed method boosts the mAP by 1.6%. If utilizing ResNeXt-FPN as the backbone and the multi-scale input strategy, the proposed method further increases the mAP to 87.7%. These results demonstrate that the proposed approach performs well on exploiting the background context and spatial semantic information among objects for multi-label image classification.
2) Results on PASCAL VOC: The comparisons to the state-of-the-art methods on the PASCAL VOC dataset are summarized in Table VI, in terms of the average precision for each of the 20 classes and the mAP. The results of all the previous solutions are taken directly from their original papers.
As shown in Table VI, in terms of mAP, the proposed model outperforms the previously best performed method MCAR [24] by 1.0% when using the same backbone ResNet-101 and the same input size 576 × 576. Compared with the previous second best method, WSL-GCN [27] that utilizes ResNet101 as the backbone and the multi-scale input strategy, the proposed model boosts the mAP of WSL-GCN by 1.2%. The proposed method achieves an mAP of 96.3% 1 if using the ResNeXt-FPN as the backbone and the multiscale input strategy, which is significantly better than the state-of-the-art methods. For the AP values, the proposed method performs the best in the most of categories. Notably, the performance gains of the proposed methods on some categories are significant, e.g., 'bottle' and 'plant'. Those relatively small objects can be better recognized with the help of background context information and/or semantically related objects nearby. Benefiting from the easily recognizable objects nearby, small, dim, or occluded objects can be better recognized. The experimental results on the PASCAL dataset once again validate the effectiveness of the proposed method on recognizing multiple objects in an image, by utilizing the background context and the spatial semantic information among objects.
3) Visualization of Multi-Label Image Classification Results: To visualize how the proposed model performs, we show some of the multi-label image classification results in Fig. 6. The green bounding boxes indicate the objects that are recognized by the proposed method but not by C-Tran [23], the previous best method on the MS COCO dataset. The blue bounding boxes represent objects detected by both methods. The yellow bounding boxes present the failure cases that objects cannot be detected by both methods. From Fig. 6, it can be seen that the proposed model plays a key role in Fig. 6. Visualization of multi-label image classification results. The green bounding boxes indicate the objects detected by the proposed method but not by C-Tran [23], blue bounding boxes indicate the objects detected by the both and yellow bounding boxes indicate the failure cases of the both. The proposed method could make good use of the background context and the attentional information among objects, so that the small, partially occluded or dim objects can be better recognized with the support from the background context and/or those easily recognizable objects. recognizing small objects that are easily neglected, e.g., the 'keyboard' in the workspace and the 'person' in the 'train'. This is achieved partially through the Spatial Context-aware Branch by incorporating the background context information into the model, and partially owing to the support from nearby easily recognizable objects through the Object-attentional Branch. Some objects are hard to be recognized due to either partial occlusion or the dim environment, e.g., the 'bottles' or 'chairs' occluded by 'person', and 'person' or 'tie' in the dark. Some objects are difficult to detect due to changes in shape or only part of object visible, e.g., 'kite' and 'apple' pieces, or severe occlusions, e.g., 'backpack', 'chair', 'tie' and 'suitcase'. These objects are even difficult to be recognized by humans. The proposed method makes good use of the background context and the spatial semantic information among objects to better recognize these challenging objects.

V. CONCLUSION AND FUTURE WORK
In this paper, a Spatial Context-aware Object-attentional Deep Convolutional Neural Network is designed for multiclass image classification to make use of spatial and context information among objects. The Object-focus Branch extracts discriminant information of objects. The adaptive patch expansion mechanism in the Spatial Context-aware Branch enables the model to leverage the background contextual information, while avoiding excessive increase of intra-class variations. The object cluster designed in the object-attentional branch embeds the spatial and semantic dependencies among objects. The proposed spatial attention mechanism explicitly models the spatial relations among objects. It effectively aggregates both spatial and semantic relations among objects. The proposed method is compared with nineteen state-of-theart approaches on the MS COCO dataset and the PASCAL VOC dataset. The experimental results demonstrate that the proposed method consistently outperforms all the compared models.
The spatial and context information jointly extracted by the proposed method are important in various computer vision applications such as object segmentation and detection. As demonstrated in [30], [31], [35], [36], [37], and [38], the background context could help to spotlight the local information and enhance the contrast between foreground and background, thereby recovering image details and refining the segmentations and saliency maps. Moreover, as demonstrated in [29], [39], [40], and [41], grouping objects into clusters that are spatially close and semantically related can help to better recognize small objects in a large scene. We will explore the potential applications of our method in these applications in future. This work could also be extended to the spatial embedding of objects in 3D spatial space by integrating some techniques such as depth estimation to better understand the spatial relations of objects.