A Hybrid BERT Model That Incorporates Label Semantics via Adjustive Attention for Multi-Label Text Classification

The multi-label text classification task aims to tag a document with a series of labels. Previous studies usually treated labels as symbols without semantics and ignored the relation among labels, which caused information loss. In this paper, we show that explicitly modeling label semantics can improve multi-label text classification. We propose a hybrid neural network model to simultaneously take advantage of both label semantics and fine-grained text information. Specifically, we utilize the pre-trained BERT model to compute context-aware representation of documents. Furthermore, we incorporate the label semantics in two stages. First, a novel label graph construction approach is proposed to capture the label structures and correlations. Second, we propose a neoteric attention mechanism—adjustive attention to establish the semantic connections between labels and words and to obtain the label-specific word representation. The hybrid representation that combines context-aware feature and label-special word feature is fed into a document encoder to classify. Experimental results on two publicly available datasets show that our model is superior to other state-of-the-art classification methods.


I. INTRODUCTION
Multi-label text classification (MLTC) is a fundamental and challenging task in natural language processing. The purpose of MLTC is to assign a given text with multiple labels. MLTC has been widely applied in many fields such as sentiment analysis [1], intent recognition [2], and recommendation system [3]. With the development of deep learning, single-label classification has made a great success [4], [5], [6]. By treating the problems as a series of single-label classification tasks, the single-label text classification can be naively extended to MLTC task [7]. However, such oversimplified extensions often bring poor performance. Unlike conventional single-label classification (where labels are independent), there are semantic dependencies among various labels. Besides, label relationships can provide implicit and supplemental information, especially when some labels do not have enough training examples. For example, academic literature The associate editor coordinating the review of this manuscript and approving it for publication was Chao Wang . tagged with ''artificial intelligence'' is usually accompanied by ''deep learning''. In addition, in single-label classification, the prior knowledge of mutual exclusion between labels has been shown and modeled in the final classification. The relationships between multi-label classification are more complicated and not implicitly modeled. Exploiting label correlations has become the primary impetus to improve classification performance [8].
Compared with the shallow model, the deep neural networks have achieved satisfactory performance in the MLTC task [9], [10], [11]. However, they mainly depend on document-level representation. The semantic relationships between labels and word-level information of documents are not modelled explicitly. In other words, the fine-grained document information which will provide clear classification clues is ignored. For example, in information retrieval, related terms such as ''missiles'' and ''tanks'' may be useful to distinguish ''military'' and ''technical'' from various types of documents. Obviously, the words in one document make different contributions to each label. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ From the above analysis, MLTC needs to focus on the following two points: 1) how to adequately mine and make use of the correlations among labels. 2) how to extract the discriminatory information of corresponding labels from original documents. Our intuition is to model the labels through a global perspective first and then use the semantic information of labels as the guidance to capture important fine-grained document information.
Recently, a new research direction called graph embedding representation has attracted wide attention [12]. Of all the methods of embedding, the Graph Convolutional Network (GCN) [13], [14] is very beneficial for tasks with a rich relational structure. GCN can retain the global structure information of the graph, thus capturing the semantic dependencies among multiple labels from the perspective of spatial context.
In this paper, we propose a Hybrid BERT model incorporates Label semantics via Adjustive attention (HBLA), which searches and identifies semantic dependencies of label space and text space simultaneously. Firstly, we model the label correlations by label graph built from adjacency-based similarity and then encode the label graph by using GCN, which captures structure information and the rich semantic correlations among labels. Moreover, to capture the label-related discrimination information from each document, we use Bidirectional Encoder Representation from Transformers (BERT) [15] to obtain the implicit representation in the context of each word. An innovative attention mechanism-adjustive attention is proposed to calculate the semantic relationship between word and label explicitly and then based on it to generate label-special word representation. Compared with normal attention methods, the focus area of adjustive attention mechanism becomes more meaningful and discriminatory.
To sum up, we achieve the following contributions in this paper: • This paper proposes an HBLA related to label graph embedding, which can simultaneously model document and label, and obtains the hybrid word representation.
• We design a novel attention mechanism called adjustive attention to measure the semantic relation between word and label. Adjustive attention learned from word-label is to weight the important fine-grained semantic information in a document.
• Experimental results on two widely-used benchmark datasets achieve superior performance over previous state-of-the-art methods. Extensive validation experiments can prove the effectiveness of label graph embedding and attention mechanism. The remainder of this paper is organized as follows: Section II introduces related works about the multi-label classification and label embedding methods. Section III describes the HBLA model in detail. Section IV gives extensive experiments to validate the effectiveness of our approach. Experimental results and discussion are given in section V and section VI, respectively. In Section VII, we explore the application of the HBLA in the medical field. The conclusion and future work are summarized in Section VIII.

II. RELATED WORK A. MULTI-LABEL CLASSIFICATION
The current models for the multi-label classification task can be categorized into three methods: problem transformation, algorithm adaptation, and neural network.
Problem transformation methods are algorithm-independent that transform the multi-label classification task into multiple single-label learning tasks by decomposing the sample set. The most common approach is the Binary Relevance (BR) algorithm, whose core idea is to treat each label as a separate class classification problem and train a binary classifier for each class separately, but without considering label correlations [16]. To fully capture the relationships among labels, the Label Powerset (LP) algorithm treats each combination of labels as a new class. This method is highly complex in training due to the exponential increase of the label [17]. The Classifier Chain (CC) algorithm connects labels in a ''chain'' manner. The prediction of one label can help predict another label to a certain extent. However, these methods only capture low-order correlations [18].
Algorithm adaptation methods, as the name suggests, are to transform the algorithm to deal with multi-label problem. Rank-SVM adopts decision tree based on multi-label entropy for multi-label classification [19]. ML-KNN determines the label set for each sample using the k nearest neighbour algorithm and the maximum posterior principle was used to determine the label set of each sample [20]. RAKEL uses random label subsets as the training set for each LP classifier, and finally integrates predictions of multiple LP classifiers by voting [21].
With the development of deep learning, neural network models have performed well in multi-label classification tasks. BP-MLL algorithm captures the features of multi-label learning by replacing its error function with a pairwise ranking loss function defined [22]. Considering that the labels tend to be correlated, CNN-RNN model combines CNN and RNN to capture local and global semantic information and model high-order correlations among labels with lower complexity [23]. SGM [10] used seq2seq structure to model the relationships between multiple labels and used a gate mechanism to consider global label information. Ashutosh et al. explored the fine-tuning BERT for document classification [24].
Different from those approaches that are based on document-level features, we propose the HBLA model with adjustive attention to building label-special word representation, which sufficiently exploits both document content and global label semantics.

B. LABEL EMBEDDING METHODS
The effectiveness of label embedding has been proved in various multi-label learning tasks. The goal of label embedding is to map the label space to low-dimensional vectors and preserve label dependencies.
In computer vision, there is much research on label embedding in image node classification [25] and image recognition [26]. In natural language processing, [27] proposed joint word and label embedding to learn text representation. Reference [28] used label distribution sequences to capture potential long-range label dependencies to improve the performance of sequence labelling.
The graph has been proved to be more effective for label structure modeling. Reference [29] is the first to propose the use of label graph to deal with MLTC, but the shallow neural network model is used for graph representation, and there are limitations in learning the complicated relationships between graph nodes.
Recently, GCN has witnessed prevailing success in modeling relationships among vertices of a graph, a neural network operating on graphs, suited to model syntactic dependency graphs. In this paper, we construct a label graph based on the distribution of labels in the dataset and use GCN to map nodes in the label graph to the same space. Moreover, a novel loss function is designed to constrain nodes in the space. Concretely, labels with more similar distributions are closer in space, otherwise are farther. By separating the non-adjacent nodes, it is possible to capture high-order semantic correlations among labels using network topology structure.

III. MODEL
In this section, we introduce the HBLA model in detail, leveraging the attention mechanism to incorporate label representation and fine-grained word-level representation of documents. As can be seen from Fig. 1, HBLA mainly contains four components: • Word embedding module projects the input word of a document into context-aware representation.
• Label graph embedding module takes the label graph as input to learn label embeddings which encode the semantic correlations among labels.
• Adjustive attention module calculates the attention scores between word and labels to generate label-specific word representation.
• Aggregation layer integrates the proper information from two aspects (context-aware word representation and label-specific word representation) and uses the hybrid word representation for classification.

A. PROBLEM DEFINITION
means documents and its corresponding targets, and L = {λ 1 , · · · , λ c } is a finite set of predefined labels. The MLTC task can be modeled as learning a function f that maps input document d to binary vectorsŷ (assigning a value of 0 or 1 for each label inŷ).

B. WORD EMBEDDING
The first part of our model is the word embedding module, which embeds the original words into vectors with VOLUME 8, 2020 low dimensional. Conventional methods such as Word2Vec [30] and Glove [31] are a kind of fixed word vector representation method. They assume the word with similar meanings no matter whatever contexts it appears. However, the polysemous challenge makes context-independent word embedding difficult for the classification task. For example, the word ''apple'' would have the same context-independent representation in ''APPLE Inc.'' and ''apple juice''. To better represent the text content, we compute context-aware representations for each word by using the pre-trained BERT model, which is based on a multi-layer bidirectional Transformer [32] that generates different word embeds for a word in different contexts. BERT takes the input of a sequence of no more than 512 tokens and outputs the representation of the sequence.
Let d be an input document consisting of k words, denoted as [w 1 , w 2 , . . . , w i , . . . , w k ], where w i refers to the i th word in the text. A visualization of BERT's architecture is shown in Fig. 1 (bottom left). The arrows indicate the information flow from one layer to the next. The sequence of indicates the contextualized representation of each input word.

C. LABEL GRAPH EMBEDDING
We focus on using the label graph to reflect the label structures and represent the label graph in a low-dimension latent space. In this paper, an adjacency-based similarity label graph construction method is proposed to model the interdependencies among labels. We treat each label as a node, and each node gathers features from all neighbours to form its representation. Each edge reflects the semantic correlations between the nodes. If the labels co-exist, there is an edge. This is a flexible way to capture the topology in the label space. The co-occurrence of labels can be described as a joint probability, which is suitable for modeling the labels relationships.
To be more special, we formally define the label graph contains non-negative weights between any two nodes. We build this adjacency matrix through a data-driven way. Firstly, we count the occurrence among all label pairs using the label annotations of samples on the training set and get the matrix C ∈ R C×C , by using this label co-occurrence matrix, we can get the adjacency matrix by where C ij means the co-occurrences of label λ i and λ j . I is the identity matrix which means that every node is assumed to be connected to itself.
We also construct a word-label adjacency matrix B in the same way as (1) and (2), where B ij is the relationship between w i and λ j . C ij for B means the co-occurrences of w i and λ j in samples.
The label embedding is determined from the label co-exist graph and captures the label semantic information defined by the graph structure. We introduce GCN to propagate messages through the graph and learn contextualized label embedding. GCN aggregates the values of all neighboring nodes to update the current node. Each convolutional layer processes only the first-order neighborhood information. Multi-order neighborhood information can be achieved by stacking several convolutional layers. Our goal is to represent the labels in a low-dimensional latent space so that two nearby labels in the graph have similar representation and non-adjacent nodes to be mutually exclusive. For each node v i ∈ V, we first initialize with one-hot vector e Then, the label embedding can be expressed as: whereÃ is the normalized symmetric adjacency matrix and ∈ R C×C is a trainable weight. N (i) means the neighbor nodes of i. ρ is a ReLU activation function. In this paper, we consider a two-layer GCN [13], [33] for label embedding which means k is set to 2. Then, we can achieve the label embedding set E = [e 1 , e 2 , . . . , e c ].

D. ADJUSTIVE ATTENTION
The attention mechanism is able to capture the global importance of word tokens [34]. As mentioned above, words have fine-grained information for classification, i.e. the word ''missiles'' is preferable to class ''military''. Our attention module explicitly introduces the rich label semantics, attentional regions are more meaningful and discriminatory.
However, label space and word space exist a semantic gap. Here we firstly project word space into the label space. We employ a fully connected layer φ to re-encoder word representation.
where H * ∈ R K ×C . We adopt an attention operator to calculate the attention scores between the target word t and each label. A simple way is calculating the dot-product between H * t and E, of which the formulation is: The attention I t ∈ R C is normalized by softmax.
For those documents that relate to few labels, the other labels can be regarded as redundant information there and in which case, filtering out unnecessary information plays a relatively essential role.
In order to focus on fine-grained classification clues so that mitigating the irrelevance and redundancy of document contents, we propose the adjustive attention in this paper based on the dot-product attention mechanism. The model dynamically assigns the weight of the label to the word through adjustive attention.
As the degree of association between a word token and the class label may impact their attention scores, the adjustive attention can be divided into two stages. The task of the first stage is to judge the correlation between word and label, we regard this task as a binary classification task, therefore the sigmoid function is adapted. If some of the correlation scores are less than threshold τ , we consider that the word is irrelevant to those labels.
In the second stage, the attention score is calculated by softmax as above which normalizes the probability distribution. Therefore, the weight of irrelevant labels is reduced, and the weight of relevant labels is enlarged.
The overall operation is described as the following equations: Then, the adjustive attention is used to weightily average the label embedding for the word t.
where h l t ∈ R C is the label-specific word representation, it gives a thought that different labels have inherent characteristics that can be distinguished. Finally, label-special word sequence can be represented via H l = [h l 1 , h l 2 , . . . , h l k ]. The label graph embedding module encodes a label graph to label embedding. The combination of the attention module and label graph embedding module can be regarded as processes of clustering and aggregating. The purpose is to learn a prototype representation for each class and then based on it to generate the label-specific word representation, which aggregates the label semantic.

E. AGGREGATION LAYER
After the above steps, we can obtain two kinds of word representations H and H l . The former cares about the meaning of words in context, while the latter focuses on the semantic relation between word and label. This layer is designed to aggregate the information from two aspects. For simplicity, the embedding H and H l are merged by concatenation as shown in (13), whereĤ ∈ R C+D is the final hybrid word embedding and then provided as input to the document encoder.Ĥ Then, we use a bidirectional long short-term memory network (Bi-LSTM) [35] as the document encoder to generate document representation. The Bi-LSTM can learn the word embedding for each input text through the forward and backward side. At the time t, the hidden state can be formulated as: We use the final hidden state h k to represent the whole document. Finally, we input h k to a classifier to predict the confidence score of each label for the document. The classifier consists of a fully connected layer and a sigmoid function:ŷ here W ∈ R C×(C+D) is the trainable parameter of fully connected layer. D is the word vector dimensions.

F. LOSS
Similar to previous studies [10], we use binary classification loss as our loss function for the MLTC task: Besides, we restrict the label graph embedding so that similar labels are closer together in the label semantic space and non-adjacent labels to be mutually exclusive. One way to encode such property is to make the cosine similarity Φ(e i , e j ) to be close to the corresponding edge weight A ij for all i, j. The loss of label graph embedding can be formulated as: As mentioned above, we regard the label embedding module and attention module as a clustering process, which requires the label-special word representation to be closer to the centre of its category. Hence, we have designed another loss function to measure the result of clustering, which can be formulated as: Finally, we define our loss function as follows:

IV. EXPERIMENTS
In this section, we evaluate the proposed model on two standard benchmark datasets AAPD [10] and RCV1-V2 [36] to verify the performance.

A. DATASETS 1) ARXIV ACADEMIC PAPER DATASET (AAPD)
The AAPD dataset is a large dataset for MLTC. It consists of 55,840 abstracts of papers about computer science from Arxiv. 1 An academic paper may have multiple subjects, and there are 54 subjects in total.

2) REUTERS CORPUS VOLUME I (RCV1-V2)
This dataset consists of over 800 K manually annotated news made available by Reuters Ltd for research purposes. Multiple topics can be assigned to each news, and there are 103 topics in total. Table 1 shows the descriptive statistics of datasets used in our experiments.

B. DETAILS
For pre-processing details, we use WordPiece tokenizer 2 to tokenize the text and lowercase all characters. Each text is limited to 510 tokens.
For model details, our models are implemented by deep learning framework Pytorch Geometric (PyG) [37] and Allennlp [38] and trained on a single GTX1080TI GPU. The label embedding is initialized with one-hot vector. The hidden size of LSTM is 100. We use Adam [39] as the optimizer with the initial learning rate of 2e-5, and the batch size is set to 8. We set the dropout as 0.3 to prevent overfitting and clip the gradients to the maximum norm of 5. Other parameters in our model are initialized randomly.

C. EVALUATION METRICS
To fairly compare the results of our model with those baselines, we adopt the Hamming Loss, Micro-Precision, Micro-Recall and Micro-F1 score as our main evaluation metrics, which are defined as below.
• Hamming-Loss is used to measure the mismatches between the real and predicted labels, which is calculated as follows: where y i,c is the target and z i,c is the prediction. C is the numbers of labels and N is the sample size. 1 is an indicator function.
• Micro-Precision is the fraction of relevant instances among the retrieved instances, while Micro-Recall is the fraction of the total amount of relevant instances that were actually retrieved. And Micro-F1 score will be simply the harmonic mean of Precision and Recall. The calculation formulas of the three metrics are as follows: where TPs, FNs and FPs represent the number of true positives, false negatives and false positives, respectively.

D. BASELINES
To prove the effectiveness of our proposed model, we use the following baselines to compare it on two aforementioned datasets: • BR decomposes multi-label classification task to multiple independent binary classification problems [16].
• CC considers the previous predicted label value and links binary classifiers in a given sequential order [18].
• LP combines any two labels in the label set as a new category and transforms a multi-label problem into a multi-class problem [17].
• CNN uses a convolution network to extract the text features, then input the features into a fully connected layer. Finally, the sigmoid function is used to output the probability distribution in the label space [40].
• Seq2Seq Attn adapts the deep Seq2Seq model with attention mechanism to perform multi-label classification [41].
• SGM views the multi-label classification task as a sequence generation problem, and applies a novel decoder structure and an attention mechanism to solve it [10].

V. RESULTS
The quantitative results of our model and all the baselines are shown in Table 2 and Table 3. In each line, the best result is boldface. Results show that the HBLA model outperforms most of the existing methods and reaches a new state-of-the-art performance in the main evaluation metrics. Take the experimental result on AAPD dataset as an example, and some points are observed as follows: • The models based on deep learning achieve better performance than conventional models on most metrics. Conventional methods are inadequate to cope with more and more complex data, so novel deep learning approaches are becoming increasingly popular. They can take full advantage of the supervision from the training set and can capture more precious features and more in-depth semantic information of texts.
• Compared with CNN which only considers the document content, HBLA reduces the Hamming Loss by 12.9%, and improves the Micro-F1 score by 12.1%. It shows that modeling label semantic correlations can bring performance improvement.
• We can conclude that models with attention outperform other baselines by a large margin. The adjustive attention we propose can incorporate vital fine-grained classification clues to generate label-special word representation. Compared with other attention models, HBLA achieves a Hamming Loss reduction by 5.5% to 14.6%. We detailedly discuss the effect of attention mechanism below.  We can also see that Micro-Recall is relatively low. In addition, CNN has little hyper-parameter tuning and static vectors on the classification tasks may be another reason. Table 3 presents the results of the HBLA and the baselines on the RCV1-V2 test set. It is clear to see that the proposed model is superior to other baseline models in the main evaluation metrics. Different from these methods, HBLA incorporates label semantics to learn better label-specific feature representations, leading to notable performance improvement on all metrics.

VI. ANALYSIS AND DISCUSSION
HBLA contains two critical modules that work cooperatively, i.e., the label graph embedding module and attention module. It is necessary to demonstrate the contributions of each part for the MLTC task. In this section, we conduct some further explorations and analyses of the proposed model on the AAPD dataset.

A. EFFECT OF LABEL GRAPH EMBEDDING
To validate the contribution of the label graph embedding module, we compare our model with the following three models: 1) The BERT model. We chose the pre-trained uncased BERT base model for fine-tuning. 2) Remove the attention module and the label graph embedding module (namely HBLA-A). This model can be seen as a combination of BERT and BiLSTM. We only use the semantic features of the document to classify, without considering the effectiveness of the labels. 3) Remove the GCN layers in the label graph embedding module (namely HBLA-B). We represent semantic features for each label with one-hot vector, which is a common category encoder that assumes labels are independent of each other. To avoid the interference caused by word embedding, we use the pre-trained Glove word vector to encode words of documents (words that do not appear in Glove are initialized randomly following a uniform distribution). Similar to the process of the above settings, we remove related modules respectively in two ways (namely Glove-A, Glove-B). The Glove model means only using the Glove word vector to encode documents. The rest modules are unchanged of HBLA.
As shown in Table 4, we notice the gap between the HBLA-family and the Glove-family model. Compared with Glove-A, HBLA-A achieves a Hamming Loss reduction of 13.3% and a Micro-F1 score improvement of 11.2%. HBLA-B and HBLA have a similar trend, which indicates that BERT could better represent the word semantics than Glove, and thus it is easier to capture the correlations between words and labels.
Although simple one-hot vector is applied to represent label feature, the type B models (Glove-B and HBLA-B) perform better than type A models (Glove-A and HBLA-A) in main evaluation metrics. That is to say, the one-hot vector can  partly represent label features, even if they are independent of each other. Meanwhile, the attention module can also aggregate such label features to generate label-specific word representation.
These observations can further validate the label embedding and attention mechanism which are central roles for MLTC task. Both type A models only focus on the representation of the document and ignore the effectiveness of labels on classification. In addition, adding label embedding yields the evident gain over basic BERT, providing significant advance results for classification in this dataset. Our label graph embedding module ensures that HBLA could capture the correlations among labels, including label structure and semantic information.

B. EFFECT OF ATTENTION MECHANISM
We conduct two types of attention experiments to discover the effect of the attention mechanism. First, we design the HBLA average model without attention mechanism (average the corresponding semantic vectors), to explore whether it is necessary to establish a relationship between the label and word-level information of a document. Then, to further verify the effectiveness of adjustive attention, we replace it in the HBLA model with the normal ''dot-product attention'' and name the replaced model as HBLA dot−attn . Table 5 reports the results under various attention mechanisms. HBLA dot−attn and HBLA models signify that extracting the semantic relation between each word and label is essential and profoundly meaningful. Besides, adjustive attention of HBLA can focus on fine-grained text information and further improve the performance. By contrast, the HBLA average model directly averages the label embedding instead of aggregating different label information for words. That causes the attention module and label embedding module to provide noise for the model, but relies on powerful pre-trained BERT model, HBLA average also achieves a passable result.
To better demonstrate that the adjustive attention can allocate the weight of attention reasonably of different labels,  we pick an example from the AAPD dataset and visualize the attention. As shown in the Fig. 2 and Fig. 3, the attention weights of the correlative labels against words are illustrated. Darker colour refers to a higher weight on the keywords. Compared to the dot-product attention, we can observe that the adjustive attention model can reduce the influence of other label items on the weight distribution of current words and enhance the weights between label items corresponding to current words. In this case, the word coding, applications and cryptography support tagging this document with cs.it.

VII. APPLICATIONS IN THE MEDICAL FIELD
The novel Coronavirus Disease 2019 (COVID-19) presents an urgent threat to global health [42]. The application of deep learning technology in the medical field has been an essential trend to mitigate the burden on the healthcare system. To demonstrate the practical value of our model, we apply HBLA for a real health care scenario: predicting ICD-9 codes(or it could be disease diagnosis) according to patient's electronic medical records (EMRs). Since an EMR usually contains multiple types of disease codes, we treat predicting diagnostic codes as an MLTC task.
We evaluate HBLA on the publicly available MIMIC3 dataset [43], which contains 58,976 ICU EMRs. Each EMR includes clinic text of one patient, and we only focus on discharge summaries, which record information about a stay.
To compare with previous works [44], [45], we use the top 50 codes for experiments which results in 8,066 EMRs for training and 1,729 for testing. The baseline methods include Logistic Regression, CNN, Bi-GRU, Attentive LSTM [44] and Convolutional Attention (CAML) [45]. The last two baseline methods specialize in ICD-9 code prediction task. We apply the F1 score and AUC (area under the ROC curve) to validate performance.
The results are shown in Table 6, and the Logistic Regression baseline performs worse than all deep learning methods. HBLA provides excellent performance on most metrics, which proves that our model is universal and practical. However, since the average length of the discharge diagnosis free text is more than 2,000 tokens, while the word embedding module of HBLA is limited the sequence lengths up to 512 tokens, some information will be lost due to interception. At the same time, due to the particularity of the medical field, the effect of pre-training model is not satisfactory. If biomedical domain corpus is used to learn word representations, 3 they will serve better for classification of domain-specific text documents.

VIII. CONCLUSION
The application scenarios of multi-label classification are very broad, which is a hot topic in the field of natural language processing. Our model introduces pre-trained BERT to obtain word context as well as in-depth semantic information. Moreover, capturing label semantics takes a crucial position in MLTC. To better model that information, we propose an adjacency-based similarity method to construct the label 3 http://evexdb.org/pmresources/vec-space-models graph and obtain the semantic embedding of the label by using GCN. We propose a novel adjustive attention mechanism to explicitly calculate the semantic relationship between labels and documents to capture useful label-specific information and suppress noise. The final hybrid word representation is used for classification. We also explore the effect of label embedding and adjustive attention. Experimental results on two multi-label classification datasets demonstrate the superiority of HBLA.
Actually, imbalanced data is ubiquitous in the real world, and it may deteriorate the performance of conventional classification algorithms. Imbalanced multi-label data often exhibits skewed distribution compared with single label data. We will verify the improvement of the model on a single label and investigate the performance of the model on some imbalanced datasets in future work.
LINKUN CAI was born in 1996. She is currently pursuing the master's degree in software engineering with Zhengzhou University, China. Her research interests include artificial intelligence and natural language processing.
YU SONG was born in 1969. He received the M.S. degree in applied mathematics from the Huazhong University of Science and Technology, Wuhan, China, in 2001.
He is currently an Associate Professor and a Ph.D. Supervisor with Zhengzhou University. His main research interests include artificial intelligence and machine learning.
TAO LIU was born in 1996. He is currently pursuing the master's degree in computer technology with Zhengzhou University, China. His research interests include artificial intelligence and natural language processing.
KUNLI ZHANG was born in 1977. She received the Ph.D. degree in software engineering from Zhengzhou University, Zhengzhou, China, in 2019.
She is currently a Lecturer with Zhengzhou University. Her main research interests include artificial intelligence and natural language processing.