Multi-label Image Classification using Adaptive Graph Convolutional Networks: from a Single Domain to Multiple Domains

This paper proposes an adaptive graph-based approach for multi-label image classification. Graph-based methods have been largely exploited in the field of multi-label classification, given their ability to model label correlations. Specifically, their effectiveness has been proven not only when considering a single domain but also when taking into account multiple domains. However, the topology of the used graph is not optimal as it is pre-defined heuristically. In addition, consecutive Graph Convolutional Network (GCN) aggregations tend to destroy the feature similarity. To overcome these issues, an architecture for learning the graph connectivity in an end-to-end fashion is introduced. This is done by integrating an attention-based mechanism and a similarity-preserving strategy. The proposed framework is then extended to multiple domains using an adversarial training scheme. Numerous experiments are reported on well-known single-domain and multi-domain benchmarks. The results demonstrate that our approach achieves competitive results in terms of mean Average Precision (mAP) and model size as compared to the state-of-the-art. The code will be made publicly available.


Introduction
Multi-label image classification has been widely investigated by the computer vision community, given its practical relevance in numerous application fields, such as human attribute recognition (Li et al., 2016), scene classification (Shao et al., 2015) and deepfake detection (Singh et al., 2023).In contrast to traditional image classification approaches that associate a single label to an input image, multi-label image classification aims to detect the presence of a set of objects.
Thanks to the latest progress in deep learning, most of recent methods rely on a single-stream Deep Neural Networks (DNN), including Convolution Neural Networks (CNN) (Simonyan and Zisserman, 2015;He et al., 2016;Zhu et al., 2017;Ge et al., 2018;Gao and Zhou, 2021;Ridnik et al., 2021b), Residual Neural Networks (RNN) (Wang et al., 2016) and Transformers (Lanchantin et al., 2021;Cheng et al., 2022;Wang et al., 2022a;Chen et al., 2020).Nevertheless, their impressive performance comes at the cost of very large architectures that are unsuitable for memory-constrained environments.As an alternative, another line of research has tried to integrate priors related to label correlations (Chen et al., 2019b,a;Li et al., 2019;Wang et al., 2020b;Singh et al., 2022b;Wang et al., 2022b;Sun et al., 2022).As demonstrated in (Singh et al., 2022b), such an approach contributes to improving scalability.In other words, fewer parameters are required to achieve comparable performance with traditional DNNs.
Graph-based approaches (Chen et al., 2019b;Singh et al., 2022b), are among the most popular multi-label classification methods that aim at modeling label correlations.In addition to a standard DNN that extracts image features, a second stream based on a Graph Convolutional Network (GCN) is used for generating inter-dependent label classifiers.In particular, the input graph is used to model label dependencies.Each node represents a label, and each edge is characterized by the probability of co-occurrence of a label pair.Graph-based methods have also been shown to be successful in the challenging context of cross-domain multi-label classification (Lin et al., 2021).Standard multi-label image classification methods such as (Chen et al., 2019b;Singh et al., 2022b) usually assume that unseen images and training data are drawn from the same distri-Fig.1: Comparison of our approach (ML-AGCN) without (top) and with UDA (down) to recent state-of-the-art methods in terms of number of parameters (millions) and mean Average Precision (mAP) on MS-COCO and Clipart → VOC.The considered state-of-the art methods are: MlTr-m (Cheng et al., 2022), TResNet-L (Ridnik et al., 2021b), ML-Decoder (Ridnik et al., 2023), ML-GCN (Chen et al., 2019b), ResNet101 (He et al., 2016), DA-MAIC (Lin et al., 2021), and DANN (Ganin et al., 2016).bution, i.e. the same domain, hence ignoring a possible domain shift problem (Lin et al., 2021).This leads, therefore, to poor generalization capabilities under cross-domain settings.UDA is a plausible solution to overcome this challenge without relying on costly annotation efforts (Ganin and Lempitsky, 2015).UDA aims at learning domain invariant features to bridge the gap between a source domain and a target domain without access to the associated labels.In particular, one of the most successful UDA approaches for multi-label classification, namely DA-MAIC (Lin et al., 2021), leverages graph representations to model label inter-dependencies and couple it with an adversarial training approach (Ganin et al., 2016) Despite their usefulness in both single and cross-domain settings, graph-based methods (Chen et al., 2019b,a;Li et al., 2019;Wang et al., 2020b;Singh et al., 2022b;Wang et al., 2022b;Sun et al., 2022;Lin et al., 2021) are unfortunately subject to three major limitations, namely: (1) The graph structure is heuristically defined.In particular, it is computed based on the co-occurrence of labels in the training data.Hence, this topology might not be ideal for the specific task of multi-label image classification; (2) A threshold is empirically fixed for discarding edges with a low co-occurrence probability.This means that infrequent co-occurrences are assumed to be noisy.Although this might be true in many cases, assuming that any rare event corresponds to noise does not always hold; and (3) it has been proven in (Jin et al., 2021) that successive aggregation operations in the GCN usually dissipate the node similarity in the original feature space, hence potentially leading to a de-crease in terms of performance.
Herein, we posit that by integrating adequate mechanisms in graph-based approaches for addressing the aforementioned issues, it should be possible to reduce the network size even more while achieving competitive performance in both singledomain and cross-domain settings.
In this paper, we propose an adaptive graph-based multi-label classification method called Multi-Label Adaptive Graph Convolutional Network (ML-AGCN) for both contexts, single domain and across domains.Our idea consists in: (1) learning two additional adjacency matrices in an end-to-end manner instead of solely relying on a heuristically defined graph topology.Note that no threshold is applied, avoiding the loss of weak yet relevant connections.In particular, the first learned graph topology computes the importance of each node pair.This is carried out by employing an attention mechanism similar to Graph Attention Networks (GAT) (Velickovic et al., 2017).The second is built based on the similarity between node features and overcomes the information loss happening through successive convolutions; (2) integrating the proposed adaptive graph-based architecture in an adversarial domain adaptation framework for aligning a labeled source domain to an unlabeled target domain.As shown in Fig. 1, the results suggest that our method is competitive with respect to the state-of-the-art in terms of both mean Average Precision (mAP) and network size under the single domain and cross-domain settings.
This paper is an extended version of (Singh et al., 2022a).In comparison to (Singh et al., 2022a), the main contributions of this article are given below: The remainder of this paper is organized as follows.Section 2 reviews the state-of-the-art on multi-label image classification and domain adaptation for multi-label classification.Section 3 formulates the problem of graph-based multi-label image classification and its applicability to different domains.In Section 4 and Section 5, the proposed method is detailed.Section 6 presents the experimental results and analysis.Finally, Section 7 concludes this work and highlights interesting future directions.

Related Works
In this section, we discuss the state-of-the-art approaches for multi-label image classification in both single and cross-domain settings.
As discussed in Section 1, most multi-label classification methods (Razavian et al., 2014;Wei et al., 2015;Lanchantin et al., 2021;Cheng et al., 2022;Wang et al., 2022a;Ridnik et al., 2021b,a) employ a single stream DNN.More specifically, they mainly take inspiration from successful architectures proposed in the context of single-label image classification such as CNN, RNN and transformers.For example, Razavian et al. (2014) have used a pre-trained OVERFeat (Sermanet et al., 2014) model and have adapted it to multi-label image classification.Wei et al. (2015) have leveraged the prediction of multiple CNN architectures pretrained on ImageNet (Deng et al., 2009) such as AlexNet (Krizhevsky et al., 2012) and VGG-16 (Simonyan and Zisserman, 2015).Ridnik et al. (Ridnik et al., 2021a) have employed TResNet (Ridnik et al., 2021b) using a novel loss called Asymmetric Loss (ASL) that focuses more on positive labels than negative ones.TResNet introduced in (Ridnik et al., 2021b) is based on a ResNet architecture with a series of modifications for optimizing the GPU network capabilities while maintaining the performance.
Recently, Lanchantin et al. ( 2021) attempted to leverage transformers for modeling complex dependencies among visual features and labels.Similarly, MlTr (Cheng et al., 2022) combines the pixel attention and the attention among image patches to better excavate the transformer's activity in multi-label image classification.More recently, ML-Decoder (Ridnik et al., 2023) proposed a transformer-based classification head instead of the standard Global Average Pooling (GAP), for improving the generalization capability.
Going deeper into the network enables the model to learn more abstract features from the image.However, this comes at the cost of high memory requirements.Moreover, these single-stream methods do not explicitly model the relationship between labels which is an important semantic element to consider (Chen et al., 2019b).
In order to incorporate the information of label correlations, a second class of methods have used a second subnetwork in addition to the main backbone.For instance, Zhu et al. (2017) have introduced a Spatial Regularization Net (SRN) to learn the underlying relationships between labels by generating labelwise attention maps.Similarly, Qu et al. (2023)  Graphs can also be an interesting way for modeling label correlations (Chen et al., 2019b;Singh et al., 2022b;Wang et al., 2020aWang et al., , 2022a)), while keeping the network size reasonable.Graph-based methods (Chen et al., 2019b;Singh et al., 2022b;Wang et al., 2020a) usually employ a GCN for learning interdependent label-wise classifiers and combine it with a standard DNN that learns discriminative image features.The input graph is heuristically predefined based on label co-occurrences in the training set, where low values are ignored.However, following such an empirical strategy might lead to sub-optimal performances.Additionally, destroying the similarity of node features through the GCN layers might lead to a decrease in performance, as discussed in (Jin et al., 2021).
Moreover, similar to any DL method, the aforementioned multi-label classification methods heavily depend on the availability of a large amount of annotated data.For mitigating the huge cost caused by data annotation, the field of unsupervised domain adaptation (Li et al., 2020;Zhang et al., 2019;Long et al., 2017;Ganin et al., 2016;Long et al., 2018) has been widely investigated over the last decade.It aims at making use of an existing labeled dataset from a related domain called source domain to enhance the model performance on a domain of interest termed target domain, for which only unlabeled data are provided.Unsupervised domain adaptation methods can be separated into two main categories.The first one (Li et al., 2020;Zhang et al., 2019;Long et al., 2017) aims at explicitly reducing the domain gap by minimizing statistical discrepancy measures between the two domains.Alternatively, the second class of methods implicitly minimizes this domain gap by adopting an adversarial training approach (Ganin et al., 2016;Long et al., 2018).The main idea consists in using a domain classifier to play a min-max two-player game with the feature generator.This strategy is designed to enforce the generation of domain-invariant features that are sufficiently discriminative.
Nevertheless, most existing techniques focus on the task of single-label image classification.In fact, very few papers have considered domain adaptation for multi-label image classification (Li et al., 2021;Pham et al., 2021;Chen et al., 2019b).
Among these rare references, we can mention ML-ANet (Li et al., 2021), which explicitly minimizes the domain gap by optimizing multi-kernels maximum mean discrepancies (MK-MMD) in a Reproducing Kernel Hilbert Space (RKHS).More recently, an adversarial approach has been adopted in (Pham et al., 2021) where a condition-based domain discriminator similar to conditional-GANs (Mirza and Osindero, 2014) has been employed.However, similar to the first category of methods for traditional multi-label image classification, these two approaches (Li et al., 2021;Pham et al., 2021) neglect the important information of label dependencies.A graph-based approach called DA-MAIC has been then proposed as an alternative (Lin et al., 2021).As in ML-GCN (Chen et al., 2019b), the authors have proposed to build a graph for modeling label correlations based on label co-occurrences.Additionally, to reduce the domain shift between the source and target domains, they have used a domain classifier that is trained in an adversarial manner.Unfortunately, DA-MAIC is impacted by the same drawbacks affecting ML-GCN (Chen et al., 2019b) as detailed in Section 1.More details regarding these issues are given in the next section.

Background and Problem Formulation
In this section, Graph Convolutional Networks (GCN) are first reviewed.Then, the problem of multi-label image classification in both single and cross-domain settings is formulated.

Background: Graph Convolutional Networks (GCN)
CNNs are defined on a regular grid and are, therefore, not directly applicable to non-Euclidean structures such as graphs.GCN (Kipf and Welling, 2017) have been proposed as the generalization of traditional CNN to graphs.They have been very successful in numerous computer vision applications, including human pose estimation (Cai et al., 2019) and human action recognition (Papadopoulos et al., 2020a,b).Let us denote a graph by G = (V, E, F).The set V = {v 1 , v 2 , ..., v N } is formed by N nodes, while E = {e 1 , e 2 , ..., e M } refers to the set of M edges connecting the nodes.Finally, F = {f 1 , f 2 , ..., f N } represents the node features such that f i ∈ R d corresponds to the features of the node v i .
Let us assume that F l is the input node features of the l th layer and A ∈ R N×N the adjacency matrix of the graph G.Each GCN layer can be seen as a non-linear function h(.) that computes the node features F l+1 ∈ R N×d ′ of the (l + 1) th layer as follows, where W l ∈ R d×d ′ is the weight matrix.Note that A is normalized before using Eq.(1).

Problem Formulation 3.2.1. Graph-based multi-label image classification
The goal of multi-label image classification is to predict the presence or absence of a set of objects O = {1, 2, ..., N} in a given image I.This can be done by learning a function f such that, where w and h define the pixel-wise width and height of the image, respectively, and y i = 1 indicates the presence of the label i in I, in contrast to y i = 0. Graph-based multi-label image classification methods such as ML-GCN (Chen et al., 2019b) and IML-GCN (Singh et al., 2022b) usually involve two subnetworks: (1) A feature generator denoted by f g , and (2) an estimator of N inter-dependent binary classifiers denoted by f c .The generator mostly corresponds to an out-off-the-shelf CNN network which produces a d f -dimensional image feature representation as described below, For instance, ML-GCN integrates a ResNet101 (He et al., 2016), while IML-GCN makes use of TResNet-M (Ridnik et al., 2021b).
On the other hand, the second subnetwork f c is a GCN formed by L layers which takes a fixed graph G = (V, E, F) as input where card(V) = N.In fact, each node v i ∈ V refers to the label i ∈ {1, 2, ..., N} and each f i ∈ F corresponds to its associated label embedding.The adjacency matrix A ∈ R N×N of G is usually pre-computed based on the co-occurrence probabilities that are estimated over the training set (Singh et al., 2022b;Chen et al., 2019b).In addition, rare co-occurrences are discarded as they are assumed to be noisy.In other words, given an empirically fixed threshold τ, with p i j = p(I j |I i ) being the probability of co-occurrence of label j and label i in the same image.
Given F l ∈ R d l ×L as the input node features of the l th layer, the input features F l+1 of the (l + 1) th are therefore computed by following Eq.( 1).
Finally, the vertex features produced by the last GCN layer, i.e., F L ∈ R N×d f , form the N inter-dependent classifiers.
In summary, f c can be defined as follows, where G(N) represents the set of graphs with N nodes, and F L i for i ∈ {1, ..., N} is the generated inter-dependent binary classifier associated to the label i.
Hence, f , which returns the final prediction, can be defined as follows, where sig(x) = 1 1+e −x is the sigmoid activation function.However, as explained in Section 1, three main limitations can be noted in graph-based methods: (1) the computation of the adjacency matrix A is made heuristically and is decoupled from the training process; (2) a threshold τ is empirically fixed for completely ignoring rare co-occurrences; and (3) as shown in (Jin et al., 2021), aggregating successively node features in a graph may induce the loss of the similarity/dissimilarity information present in the initial feature space.

Graph-based unsupervised domain adaptation for multilabel image classification
Let us consider a source dataset for multi-label image classification {(I k s , y k s )} n s k=1 drawn from a distribution D s and a similar unlabelled target dataset {(I k t )} n t k=1 drawn from a different distribution D t .The matrix I k s ∈ R w×h refers to the k th image sample of D s and y k s ∈ [[0, 1]] N is its associated label vector, while I t s ∈ R w×h denotes the k th image sample of D t .The variables n s and n t refer respectively to the total number of samples in D s and D s .The goal of unsupervised domain adaptation is to train a model using labeled source domain data sampled from D s and unlabelled target domain samples from D t for making accurate predictions on the target domain.Despite its relevance in real-world applications, few domain adaptation methods have been proposed for multi-label image classification.Among the most recent and accurate approaches, one can mention DA-MAIC (Lin et al., 2021), which also takes advantage of the graph representation for modeling label correlations.It directly extends ML-GCN (Chen et al., 2019b) by integrating an adversarial training for domain adaptation.Consequently, DA-MAIC is subject to the same limitations induced by ML-GCN (Chen et al., 2019b) mentioned in Section 3.2.1.

Multi-Label Adaptive Graph Convolutional Network (ML-AGCN)
To handle the challenges mentioned in Section 3, a novel graph-based approach called Multi-Label Adaptive Graph Convolutional Network (ML-AGCN) is introduced.  l) .Finally, the classifiers are applied to the CNN features for predicting the labels.

Overview of the Proposed Architecture
Similar to (Chen et al., 2019b) and (Singh et al., 2022b), a network formed by two subnets is adopted as illustrated in Fig. 2. The first is a CNN that extracts a discriminative representation from a given input image, while the second is a GCN-based network that learns N interdependent classifiers.As in (Singh et al., 2022b), TResNet-M which represents a small version of TResNet (Ridnik et al., 2021b) is employed as a CNN subnetwork.TResNet is a direct extension of ResNet, which fully exploits the GPU capabilities to boost the model efficiency.The graph-based subnet, Adaptive Graph Convolutional Network (AGCN), uses the same image embeddings proposed in (Singh et al., 2022b) as feature nodes.Nevertheless, in contrast to (Singh et al., 2022b) and (Chen et al., 2019b), it relies on an end-to-end learned graph topology.More details regarding this subnetwork are provided in Section 4.2.Similar to (Singh et al., 2022b), the Asymmetric Loss (ASL) (Ridnik et al., 2021a) denoted by L c is used for optimizing ML-AGCN such that, where γ + and γ − are focusing parameters for positive and negative samples, respectively, and y (i) s and p (i) are the respective ground truth and predicted probability with respect to the label i and p (i)  m is the shifted probability given by max(p (i) − m, 0), where m is a threshold used for reducing the effect of easy negative samples (Ridnik et al., 2021a).

Graph-based Subnet: Adaptive Graph Convolutional Network (AGCN)
Our intuition is that by integrating a suitable mechanism, it should be possible to boost the classification performance and reduce at the same time the size of graph-based methods.Hence, as illustrated in Fig. 3, we propose to adaptively learn the graph topology by reformulating Eq. (1) as below, where σ is a LeakyRELU activation function.
Instead of relying solely on the adjacency matrix A defined in (Chen et al., 2019b), two additional parameterized graphs called attention-based and similarity adjacency graphs, respectively denoted by B (l) and C (l) , are defined.In this case, no threshold is applied to A for ignoring rare co-occurrences.It is also important to note that A is fixed, while B (l) and C (l) vary from one layer to another.In the following, we detail how B (l)  and C (l) are computed.

Attention-based adjacency matrix
Instead of ignoring rare co-occurrences in A, the matrix B (l) = (b (l)  i j ) i, j∈O is defined based on an attention mechanism where the importance of each edge is quantified.To that aim, inspired by (Velickovic et al., 2017), an attention score denoted by e i j is calculated for each pair of vertices (v i , v j ) as follows, where W ∈ R d (l+1) ×d (l) represents a learnable weight matrix, a (l) T ∈ R 2d (l+1) are the learnable attention coefficients and || refers to the concatenation operation.A softmax function is then applied to the computed normalized attention scores such that, α (l)  i j = exp(e (l) i j ) with N(i) defining the neighborhood of the node i and α i j being the obtained normalized attention score.We recall that the goal of the GCN subnet is to generate interdependent label classifiers.This means that each classifier must be predominated by the information related to the label it belongs to.Hence, the attention score of the node in question should be maximal.For this purpose, an additional step called self-importance mechanism is proposed for computing the attention-based adjacency matrix B (l) = (b (l)  i j ) i, j∈O as follows,

Similarity-based adjacency matrix (C)
As illustrated in Fig. 4, the GCN aggregation tends to modify the node similarity in the original feature space (Jin et al., 2021).To overcome this issue, a node-similarity preserving matrix C (l) = (c (l)  i j ) i, j∈O is proposed.It is obtained by calculating the cosine similarity c (l)  i j between each pair of vertices (v i ,v j ) as follows, where ∥.∥ denotes the L 2 Euclidean norm.Finally, the output of the final layer L denoted by F L is used in Eq. (3.2.1) for predicting the labels.

Domain Adaptation for Multi-label Image Classification using ML-AGCN
The proposed architecture for multi-label image classification called DA-AGCN is illustrated in Fig. 5. Similar to ML-AGCN, we make use of TResNet-based backbone f g to extract discriminative image features and an Adaptive Graph Convolutional Network f c to learn N interdependent classifiers.Given source and target input images from D s and D t , respectively, the goal of f g is to generate D-dim domain-invariant image features.The latter is used to predict the image labels, as depicted in Eq. (3.2.1).The label classification loss L c is therefore defined as in Eq. ( 6).
For obtaining domain-invariant representations, a domain classifier f d : R d f → [[0, 1]] is employed and trained in an adversarial manner.Given the features X = f g (I) extracted from a given image I, f d predicts the domain of the input image as follows, where d is the predicted domain label of I.Note that the ground-truth domain label d = 0 if I is from the source domain and d = 1 if sampled from the target domain.
As in (Ganin et al., 2016), a domain loss is defined as below, Given L c defined in Eq. ( 6), the final objective function used to optimize the network is E(θ g , θ d , θ c ) defined by, where θ g , θ d and θ c refer, respectively, to the parameters of f g , f d and f c and λ is a hyper-parameter defining the weight of L d .The network is trained in an adversarial manner using the GRL for obtaining the optimal parameters ( θg , θc , θd ) such that, ( θg , θc ) = min Fig. 5: Architecture of the proposed DA-AGCN for multi-label image classification (best viewed in color).Images from both source and target datasets are given as input to the CNN subnet that generates image features.The AGCN-subnet learns in an end-to-end manner the attention and similarity-based adjacency matrices B (l)  and C (l) , respectively, and generates accordingly inter-dependent label classifiers using only labeled source images.In addition, a domain classifier is considered.

Experiments
In this section, the experimental settings and results are presented and discussed for: (1) single-domain multi-label image classification (Section 6.1); and (2) domain-adaptation for multi-label image classification (Section 6.2).

Implementation details
As discussed in Section 4, TResNet-M (Ridnik et al., 2021b) is used as the CNN backbone.In particular, the fully connected layer after the Global Average Pooling (GAP) layer is removed.The output of the GAP layer produces a 2048-dimensional latent image representation.The dimension of the AGCN output has also been set to 2048.For the node features, we use the image-based embeddings proposed in (Singh et al., 2022b) instead of the usual word-based embeddings used in (Chen et al., 2019b).The model is trained for 40 epochs using Adam, with a maximum learning rate of 1e − 4 using a cosine decay.

Datasets
In the following, the datasets that have been used for the experiments are presented.(Lin et al.) is a widely used large-scale multi-label image dataset.It contains 80K training images and 40k testing images.Each image is annotated with multiple object labels from a total of 80 categories.

MS-COCO. MS-COCO
VG-500.The VG-500 dataset (Chen et al., 2020) is a wellknown dataset for multi-label image classification.It includes 500 different objects as categories.The dataset comes with a training set of 98,249 images and a testing set of 10,000 images.
PASCAL-VOC 2007.The PASCAL Visual Object Classes Challenge (Everingham et al., 2010) introduced in 2007 is one of the most commonly used multi-label image classification datasets.It contains about 10K image samples with 5011 and 4952 images as training and testing sets, respectively.The images show 20 different object categories with an average of 2.5 categories per image.

Quantitative analysis
Comparison with state-of-the-art methods in terms of mAP and model size.We compare the performance of ML-AGCN with current state-of-the-art methods by reporting the mean Average Precision (mAP) as well as the number of model parameters.Additionally, similar to (Singh et al., 2022b;Chen et al., 2019b), we report the following evaluation metrics on the MS-COCO dataset, namely, average per-Class Precision (CP), average per-Class Recall (CR), average per-Class F1-score (CF1), average Overall Precision (OP), average overall recall (OR) and average Overall F1-score (OF1).
Table 1, 2 and 3 report the quantitative comparison of the proposed approach with respect to state-of-the-art methods on the MS-COCO, the VG-500, and the VOC datasets, respectively.It can be clearly seen that our method achieves competitive results as compared to existing methods in terms of mAP while considerably reducing the model size.More specifically, similar to SSGRL, we achieve the best mAP performance on VOC-2007 while reducing the number of parameters from 92.2 to 35.8 million.Moreover, ML-AGCN reaches the second-best mAP performance on MS-COCO and VG-500.On MS-COCO, the proposed approach presents a slightly lower mAP as compared to ML-decoder with 86.9% against 88.1% but requires only 35.9 million parameters against 51.3.Similarly, ML-AGCN only registers a decrease of 0.5% on VG-500 in terms of mAP as compared to the best-performing approach C-Tran with a con-  (He et al., 2016) 44.5 30.9 39. 1 25.6 31.0 61.4 35.9 45.4 ML-GCN (Chen et al., 2019b) 44.9 32.6 42.8 20.2 27.5 66.9 31.5 42.8 TResNet-M (Ridnik et al., 2021a) 29.5 33.6 ------IML-GCN (Singh et al., 2022b) 32.1 34.5 ------SSGRL (Chen et al., 2019a) 92.2 36.6 ------KGGR (Chen et al., 2020) 45.0 37.4 47.4 24.7 32.5 66.9 36.5 47. siderably lower number of parameters (almost a quarter).Moreover, this is worth mentioning that ML-AGCN outperforms the ML-GCN (Chen et al., 2019b) baseline by 3.9%, 6.3%, and 0.5% in terms of mAP while keeping a comparable number of parameters.
In addition, we also report that the obtained performance when including only one layer in the graph-subnet of ML-GCN (Chen et al., 2019b), IML-GCN (Singh et al., 2022b) and ML-AGCN.The obtained results confirm the relevance of the proposed adaptive learning.In fact, it can be seen that our method outperforms existing graph-based methods under this setting.More specifically, ML-AGCN achieves an improvement of 5.7% and 5.3% in terms of mAP when compared to ML-GCN and IML-GCN, respectively, on the MS-COCO dataset.Similarly, on the VG-500 benchmark, a significant increase of 20.7% is recorded in terms of mAP in comparison to IML-GCN.
Ablation study.In order to analyze the impact of each adjacency matrix used in our approach, namely, the attention-based matrix B and similarity-based matrix C, an ablation study is carried out as shown in Table 4.It can be noted that by considering B in addition to A (without threshold), an improvement of 5.1%, 20.5%, and 0.04% in terms of mAP is made, respectively, on MS-COCO, VG-500, and VOC-2007.The magnitude of improvement seems dependent on the number of classes contained in the considered dataset.In fact, while VG-500 is formed by 500 classes, MS-COCO and VOC-2007 are respectively composed of 80 and 20 categories.This highlights the importance of adaptively modeling label correlations, especially when dealing with a large number of classes, which is likely to occur in a practical scenario.An additional mAP improvement of 0.1%, 0.4%, and 0.17% can be seen when including C in the AGCN subnet.In conclusion, it can be noted that B contributes more importantly to the resulting enhancement.This could be explained by the fact that C is less needed, as only two layers are considered in the graph subnet.This would confirm that the initial feature similarity is not completely lost through layers.

Qualitative analysis
In Fig. 6, the Gradient-weighted Class Activation Mapping (Grad-CAM) (Selvaraju et al., 2017) is visualized for some examples from the VOC dataset.While the first column of images represents the input images, the second, third, and fourth columns show, respectively, the Grad-cam visualization from a model trained using only A, A and B, and finally A, B and C.These qualitative results are in line with the quantitative ones.As shown in Fig. 6, the use of B allows activating more precisely the regions of interest in the image, while the contribution of C in this refinement is less impressive but remains visible.

Implementation details
We reproduce the results of current state-of-the-art methods due to the limited availability of DA approaches for multi-label image classification.In particular, we first consider standard   multi-label image classification methods (without DA) and refer to them as MLIC.Since no target images were employed during the learning process, this is equivalent to source-only training.Second, we reproduce the outcomes of two stateof-the-art domain adaptation methods, namely DANN (Ganin et al., 2016) and DA-MAIC (Lin et al., 2021), that we refer as DA in our experiments.Moreover, we replace the multi-label softmargin-loss by the traditional cross-entropy loss when training DANN.We generate Glove-based word embeddings (Pennington et al., 2014) as node features for training DA-MAIC.Additionally, in order to showcase the effectiveness of the proposed AGCN subnet, we present the results of DANN and DA-MAIC by considering the same backbone as ours: an additional experiment based on TResNet-M instead of the traditional ResNet101 is carried out.The images have been resized to 224 × 224, unless stated differently.Our domain classifier includes one hidden layer of dimension 1024.A maximum learning rate of 1e − 4 using a cosine decay is considered.The model is trained for a total of 40 epochs or until convergence.

Datasets
Similar to DA-MAIC (Lin et al., 2021), we use three multi-label aerial image datasets in our experiments, namely, AID (Hua et al., 2020), UCM (Chaudhuri et al., 2018) and DFC15 (Hua et al., 2019).Additionally, due to the limited number of suitable datasets for the task of DA in MLIC, we convert two well-known object detection datasets, initially used for DA in the context of object detection, to multi-label annotations, namely PASCAL VOC 2007 (Everingham et al., 2010) and Cli-part1k (Inoue et al., 2018).
AID multi-label aerial dataset.The original AID dataset (Xia et al., 2017) contains 10000 high-resolution aerial images.The images cover a total of 30 categories.A multi-label version of this dataset was produced in (Hua et al., 2020), where 3000 aerial images from the original AID dataset have been selected and assigned with multiple object labels.In total, they include 17 labels: airplane, sand, pavement, buildings, cars, chaparral, court, trees, dock, tank, water, grass, mobile-home, ship, baresoil, sea, and field.80% and 20% of the images have been used respectively for training and testing (Hua et al., 2020).
UCM multi-label aerial dataset.UCM multi-label dataset (Chaudhuri et al., 2018) is derived from the UCM dataset (Yang and Newsam, 2010).It consists of images showing 21 land-use classes.The images have a resolution of 256 x 256 pixels.Later in (Chaudhuri et al., 2018), 2100

MLIC
of these aerial images were annotated with multiple tags in order to generate a multi-label aerial image dataset.This dataset shares the same number of labels as AID multi-label dataset (Hua et al., 2020) i.e., 17 labels.In our experiments, 80% and 20% of data are respectively used for training and testing.
DFC15 multi-label aerial dataset.The DFC15 multi-label dataset (Hua et al., 2019) was initially introduced in 2015.It has a total of 3342 high-resolution image samples and includes 8 object labels.In our experiments, the 6 categories in common with UCM and AID datasets, including water, grass, building, tree, ship, and car, are considered.80% and 20% are respectively used for training and testing.
VOC and Clipart1k datasets.The Clipart1k (Inoue et al., 2018) dataset contains 20 object categories, similar to PASCAL-VOC 2007 (Everingham et al., 2010).We create a multi-label annotation for each image by considering the category of each object bounding box.Clipart1k provides a total of 1000 image samples.50% and 50% are respectively used for training and testing.

Quantitative analysis
Comparison with state-of-the-art methods in terms of mAP and model size.The proposed domain adaptation approach for multi-label image classification is compared with state-of-theart methods.The same metrics described in Section 6.1.3are used, including mAP, CP, CR, CF1, OP, OR and OF1.In addition to the four protocols followed in DA-MAIC (Lin et al., 2021), i.e., AID → UCM, UCM → AID, AID → DFC, and UCM → DFC, two more combinations VOC → Clipart and Clipart → VOC are provided.Two categories of methods are considered in our evaluation: (1) the conventional Multi-Label Image Classification (MLIC); and the Domain Adaptation-based (DA) methods that aim at explicitly reducing the gap between source and target datasets.
Table 5 reports the results obtained considering AID and UCM datasets.Two different settings are followed: AID → UCM, where AID is the source dataset and UCM is the target dataset, and UCM → AID, where UCM is the source dataset and AID is the target dataset.It can be clearly seen, in both types of DA settings, that the proposed DA-AGCN outperforms existing state-of-the-art methods in terms of mAP while requiring a lower number of parameters.More specifically, an improvement of mAP by 10.65% and 4.59% has been respectively recorded for AID → UCM and UCM → AID in terms of mAP as compared with the DA-MAIC (Lin et al., 2021) baseline.This suggests that learning the graph topology in an end-to-end manner helps improve the performance in the presence of a domain shift.
In Table 6, DA-AGCN is also compared with existing methods by considering the settings UCM → DFC and AID → DFC.It can be clearly noticed that DA-AGCN outperforms the existing works in terms of mAP and model size for both UCM → DFC and AID → DFC.More precisely, an improvement of 1.4% and 6.2% is achieved in terms of mAP compared to the second-best performing methods.
Table 7 also confirms the superiority of the proposed approach as compared to other state-of-the-art techniques.However, it is to note that the improvement given by the adaptive graph remains limited.This might be explained by the fact that DA for MLIC datasets include a relatively low number of classes (8 to 20 categories).Hence, the interest of adaptively modeling label correlations might be not entirely visible.This demonstrates the necessity of creating benchmarks for DA under an MLIC context, including a wider number of classes.
Ablation study.Table 8 reports the obtained mAP when considering each module of DA-AGCN.The results confirm that  adding both an AGCN subnet and an adversarially trained domain classifier (DC) enhances the performance.Additionally, in Table 9, the performance when discarding the proposed adjacency matrices B and C is reported using UCM, AID and DFC datasets.The obtained results confirm the relevance of using the proposed attention-based and similarity-based adjacency matrices for modeling label correlations.

Qualitative analysis
Grad-CAM (Selvaraju et al., 2017) visualization in the presence of a domain shift can be seen in Fig. 7.While Fig. 7  clearly seen that adaptively learning the graph topology B and C helps activate the most relevant areas of interest, leading to better classification performance.

Failure cases
In Fig. 8 , some failure cases are presented using the Grad-CAM visualization with A, A and B and A, B and C. It can be noticed that when the targeted object occupies most of the image, the use of B and C makes the model confuse the object with the background.A good illustration of this can be seen in the first row of Fig. 8.The water is confused with the background.In future work, modeling the object occupancy will be investigated for handling these failure cases.

Conclusion
Existing graph-based methods have shown great performances for multi-label image classification in the context of both single-domain and cross-domain.However, these methods mostly fix the graph topology heuristically while discarding edges with rare co-occurrences.Furthermore, it has been demonstrated in (Jin et al., 2021) that successive GCN aggregations tend to destroy the initial feature similarity.Hence, as a solution, an adaptive strategy for learning the graph in an endto-end manner is proposed.In particular, attention-based and similarity-preserving mechanisms are adopted.The proposed framework for multi-label classification in a single domain is then extended to multiple domains.For that purpose, an adversarial domain adaptation strategy is employed.The results performed for both single and cross-domain support the effectiveness of the proposed method in terms of model performance and size as compared to recent state-of-the-art methods.

Fig. 2 :
Fig.2: Architecture of ML-AGCN: On the one hand, the CNN subnet learns relevant image features from an input image.On the other hand, the GCN subnet estimates interdependent label classifiers by taking into account one fixed adjacency matrix A and two adaptive adjacency matrices B (l) and C(l) .Finally, the classifiers are applied to the CNN features for predicting the labels.
Fig. 3: (a) An example of a fixed label graph with a threshold set to τ = 0.1.Dashed (red) edges indicate the ignored edges; (b) The proposed parameterized graph topology considering all the edges.

Fig. 4 :
Fig. 4: Features of graph nodes: the similarity between the node features is lost after GCN aggregations.

Fig. 6 :
Fig. 6: Grad-CAM visualization of the predictions using samples from the VOC dataset using only A, then A and B, and finally A, B and C.

have employed image representations from intermediate convolutional layers to model both local and global label semantics.
)

Table 1 :
Comparison with the state-of-the-art methods on the MS-COCO dataset.(Best and second-best performances are indicated in bold and are underlined, respectively).

Table 2 :
Comparison with the state-of-the-art methods on the VG-500 dataset.(Best and second to the best performance is indicated in bold and underline respectively).

Table 3 :
Comparison with the state-of-the-art methods on the VOC-2007 dataset.(Best and second to the best performance is indicated in bold and underline respectively).

Table 4 :
The ablation of adaptively learning B and C) for multi-label image classification.

Table 5 :
Comparison with the state-of-the-art in terms of mAP and number of model parameters using two settings, i.e., AID → UCM and UCM → AID.(Best performance is indicated in bold and second-best performance is undelined).

Table 8 :
Ablation study: impact of using an adversarial strategy and learning an adaptive graph topology.

Table 9 :
Ablation study: effect of adaptively learning B and C in the presence of a domain shift.