PML-ED: A method of partial multi-label learning by using encoder-decoder framework and exploring label correlation

multi-label


Introduction
Traditional multi-label learning (MLL) deals with the problem of assigning multiple labels to each instance simultaneously by assuming that each instance in the training set is accurately labeled with relevant labels.However, this assumption is too strict to hold in most application scenarios.In practice, due to the ambiguity of data features or the oversight of annotators, irrelevant labels often exist in real-word data.For example, the annotator may be not sure whether the dog in Fig. 1 is an Alaskan Malamute or a Husky.No labeling will result in a loss of information and random annotation will mislead the classifier.As a result, annotating all possible labels may be a practical choice.In the real world, weakly supervised data like this is very common [14,20,36] and partial multi-label learning (PML) serves as a framework for handling such situations within weakly supervised learning, where each instance is assigned a set of candidate labels and only a portion of them are relevant labels.The core task of PML is thus to mitigate the effects of noisy labels to train an effective multi-label classifier.
PML, first proposed in 2018 [36], has become a cutting-edge research issue in machine learning.It can be considered as the fusion of MLL [10,48] and partial label learning (PLL) [2,4].Specifically, while both MLL and PML models complete the task of multi-label classification, the former learns from the true label set and the latter learns from the candidate label sets with pseudo-positive labels.Both PLL and PML models train a classifier from datasets with noisy labels.However, while PLL only has one ground-truth label in the candidate label set, PML can have multiple ground-truth labels.Thus, the task for PML is more challenging than solving the problems of MLL and PLL.
To solve PML problems, an intuitive strategy is to consider the candidate label set as the true label set and use ready-made MLL methods to derive a multi-label classifier.However, this approach might lead the classifier to be misled by noisy labels in the candidate label set, greatly deteriorating its performance.In recent years, many PML methods have been proposed, which can be roughly divided into two categories: pipeline methods (PPMs) [3,13,34,46] and joint recognition methods (JRMs) [11,12,16,17,19,21,27,28,29,30,33,[36][37][38][39][40][41][42][43][44][45]49].The PPM, a two-stage model, first identifies highly credible labels from the candidate label set through methods like an iterative annotation matrix [46] or KNN minimum error reconstruction [3,13,34], and then uses existing MLL algorithms for classification tasks.However, most PPMs adopt a one-hot style to express labels.It not only fails to reflect the complex correlations among labels, but also propagates the label prediction errors generated in the first stage, thus affecting the overall the model performance.
The JRMs presuppose a specific relationship between sample features and labels and optimize PML models under this constraint.For example, some works [16,19,21,22,36,45] assume that the label co-occurrence and feature similarity should be consistent.Others [28,29,37] assume sparse label noise and low-rank true labels.It has also been assumed that the sample features and labels can be decomposed and represented into the same subspace [27,44,49].While these methods [3,13,16,19,21,22,28,29,34,36,37,45,46,49] have made great progress in handling PML problems, their strong prior assumptions on samples limit their applicability in diverse application scenarios with large noisy labels and in effectively utilizing high-order label correlations to improve classification performance.
In this work, we propose a novel method of PML based on the universal Encoder-Decoder framework (PML-ED), which sets out to simultaneously solve the abovementioned drawbacks of PPMs and JRMs.The main contributions of this paper are summarized as follows.
1.A universal PML model under the Encoder-Decoder framework is proposed, where fewer prior assumptions and inductive bias make it applicable to handle the different scenarios with large noisy labels.2.An exploration method of high-order label correlations is proposed.It introduces the Transformer-Block and the model of Conditional Layer Normalization (CLN) to extract the high-order label correlation, which enhances the classification performance.3. A method of generating the credible label distribution is proposed by using the mechanism of KNN label attention in label space, which makes the prediction of labels more accurate than the one-hot style.4.An in-depth experimental analysis is carried out on twenty-eight datasets across five evaluation criteria, thus obtaining a comprehensive comparison among nine PML methods.
To our knowledge, this work is the first to introduce the Encoder-Decoder framework to handle the PML problems, which can not only extract the label probability distribution and explore the high-order label correlations by using the CLN model and the method of KNN label attention, but also relaxes the prior assumption to handle various PML application scenarios with large noisy labels.Fig. 1.An example of PML dataset.
Z. Wang et al.The rest of this paper is organized as follows.The related work of PML is described in Section 2 and the principles of the proposed PML-ED model are described in Section 3. In Section 4, we describe the experiment environments, the evaluation criteria, and the benchmark datasets.We also present a comprehensive comparison of experimental results, the analysis of time complexities, the discussion of model interpretability, the convergence analysis, and the statistical test of the experimental results in this section.Limitation and discussion are detailed in Section 5. We conclude the paper in Section 6 with an indication of our contribution to this research area and our future work.

Pipeline methods (PPMs)
For the PPMs, He et al. [13] proposed a PML approach to discriminatively relabel for PML (DR-PML), which adopted the idea of Bootstrap to identify the true labels iteratively.Specifically, the DR-PML initially took the candidate labels as the true labels, used them to train a linear classifier model and utilized a soft sign thresholding operator to enlarge the difference of label confidences.It then repredicted the true labels from the candidate label set and iteratively executed the above process.However, the simple threshold operator cannot handle complex label noises.A PARTICLE model [46] first adopted the method of iterative label propagation to estimate the labeling confidence of candidate labels for each PML training example.By using high-confidence labels, it then induced the MLL classifier via pairwise label ranking coupled with virtual label splitting or maximum a posteriori (MAP) reasoning.However, PARTICLE only extracted the second-order label correlation and the insufficient exploration of label couplings weakened the classifier performance.To further explore label correlations, Wang et al. [34] presented a two-stage discriminative and correlative PML (DRAMA) algorithm, which first utilized the feature manifold to learn a confidence value for each label and then adopted a gradientboosting model to fit the label confidences.To explore the label couplings, DRAMA augmented the feature space by the previously elicited labels on each boosting round, but it rarely considered the noise in the features, which may mislead the label identification in the second stage.A GRADIS [3] utilized a similar strategy to PARTICLE [46], with the difference lying in the second stage.The second stage of GRADIS was built on the concept of multi-view and integrated clustering information.A PAMB [18] utilized binary decomposition to handle PML training examples, where the techniques of error-correcting output codes (ECOC) were adopted to transform PML problem into a number of binary learning problems.
In summary, all the above PPMs [3,13,18,34,46] have adopted the one-hot style to characterize label information.They are not able to explore the complex correlations among labels.In addition, the errors in the first stage of the PPMs are easy to spread to the second stage, especially when the dataset has many label noises and the models cannot correctly estimate the confidence of labels.

Consistency assumption of label co-occurrence and feature similarity
The first category [16,21,22,36,45,49] is based on the consistency assumption of label co-occurrence and feature similarity.They have attempted to model the correlations between instance features and labels.Xie et al. [36] proposed a PML problem for the first time, which combined the label ranking with confidence to identify true labels.They proposed a PML-fp method to calculate the label confidence by using the label ranking loss and to obtain the label correlations through the confidence and label co-occurrence relationship.Another PML-lc model assumed that each label has a feature prototype and utilized it to extract label correlation, where the feature prototype is the average value of all feature vectors associated with specific labels.A HALE model [22] constructed the graph structure by using the similarities between features and labels.It expanded the traditional probability graph matching algorithm from one-to-one constraints to many-to-many constraints.A PML-LFC model [45] assumed that noise comes from samples with low feature similarity and low label similarity.It estimated the confidence values of relevant labels for each instance by using the similarity from both the label and feature spaces and trained the classifier with the estimated confidence values.A NATAL model [21] assumed that all the labels were true labels and features were incomplete.It transformed the PML into the problem of feature completion.NATAL also constrained the low rank of the missing feature matrix, the sparsity of the completion matrix, and the consistency of the label cooccurrence and feature similarity.In order to solve the defect of co-occurrence, a PML-SALC model [49] realized that label correlation is asymmetric.The model obtained the label confidence by using a semi-symmetric matrix of label correlation to mitigate the negative impact of noise labels.Li et al. [16] considered that the similarity between two instances may be different in different label spaces and tried to extract the label-specific features for disambiguation.
Thus, the first category of JRMs [16,21,22,36,45,49] aims to execute label disambiguation by assuming the consistency between label co-occurrence relationship and feature similarity.However, such assumptions often mislead the models under the influence of feature noise and label noise.For example, label co-occurrence is extremely low in a highly sparse label space so that the models can ignore the correct samples.Therefore, these methods struggle to effectively utilize complex label correlations and also fail to handle a large number of noises in the samples.
Z. Wang et al.

Noise sparse assumption
The second category [28,29,33,37] assumes that the label noise is sparse.It was pointed in a PML-LRS model [29] that the matrix of irrelevant labels is sparse and the matrix of relevant labels is low-ranked.Specifically, the PML-LRS, based on linear regression, regarded the matrix of candidate labels as the sum of the matrices of true labels and noise labels by minimizing the kernel norm of the matrix of true labels, the 1-norm of the matrix of noise labels, and the kernel norm of the parameter matrix.A PML-NI model [37] considered the parameter matrix as the sum of the multiple-label classifier and the noise recognizer by minimizing the kernel norm of the classifier matrix and the 1-norm of the matrix of the noise recognizer.To improve the noise identification process, a PML-NSI model [28] converted the feature matrix into the matrices of true features and noise features, where the former is constrained to be a low rank and the latter is sparse.The manifold representation was obtained through KNN minimum error reconstruction and the similarity between adjacent manifolds was minimized to ensure that the internal structure is unchanged.Wang et al. [33] addressed the issue of high-dimensional feature space, which previous methods ignored, by proposing the feature selection method PMLFS.The method projected feature space into the matrices of true labels and noise labels, which constrained the former to be low rank and the later to be sparse.
As mentioned above, the second category [28,29,33,37] can handle the situation of sparse noise, but cannot handle a large number of noisy labels.For example, when there is a large number of noisy labels, the matrix of noise labels and sparse constraint may fail to work.In addition, due to their inability to utilize high-order label correlations, these methods might incorrectly identify label noise.

Low rank assumption
The third category of JRMs [11,17,19,27,30,41,44] assumes that the relationship between labels and features can be obtained in the low-rank-represented sub-space, thus filtering the label noises.An fPML model [44] assumed that labels and features can be presented in a unified space and the noise would disappear in a low-rank space.It thus converted the matrices of instance-label and instance-feature into the low-rank matrices in the same space and converted PML into a problem of low-rank decomposition of the matrix.An IMVPML model [19] considered PML as a problem of incomplete views and redundant labels.The study employed the method of negative matrix factorization (NMF) to learn shared subspaces from incomplete views and used matrix of true labels to identify true labels.Considering that a single low-rank subspace (such as IMVPML) cannot handle the data from multiple subspaces, a GLC model [27] jointly utilized the global and local labels to enhance the classifier performance.Specifically, from a global perspective, the GLC represented the candidate labels by combining the matrices of the label coefficient in multiple subspaces with the noise matrix, where the matrices of the label coefficient are low-ranked and the noise matrix is sparse.From a local perspective, the GLC obtained the label correlation by learning the consistency between the matrices of the label coefficient and the prediction model.A PML-LMNNE model [11] projected features and labels into a low-dimensional embedding space, making the features of an instance closer to its own labels and further away from its nearest neighbor instances.However, PML-LMNNE only considered the local structure when calculating the similarity of nearest neighbors, which leads to poor performance when dealing with data with complex distribution.A PML-LCD model [41] calculated the label matrix by utilizing the low rank constraint.It abandoned the sparse assumption of the noise matrix and showed better robustness to label noise compared with the sparse-assumption-based method, for example, the PML-LRS [29].A PML-LCom model [30] first modeled the matrix of sparse noise labels and then reduced the label dimensions to a low-rank space.However, the compression of labels can result in the loss of certain label information.Considering the noise in the feature space, Li et al. proposed a MUSER model [17].It mapped the original noisy feature space to the feature sub-space by utilizing the correlation between features, thereby reducing feature noise and generating the discriminative features.
Thus, it can be inferred from the above that the sub-space models in the third category [11, 17, 19, 27, 30, 41, 44,] are only adaptable to independent data within each sub-space in the high-dimensional space.The models face difficulties in recognizing certain nonlinear low-dimensional manifolds [4].

Smoothness assumption
Under the smoothness assumption, similar samples in the feature space should have similar labels.In PML methods, most models adopt the smoothness assumption while also incorporating other prior assumptions.Only the methods of label distribution learning [38,39] have exclusively utilized this assumption.A PML-LD model [38] reconstructed the matrix of feature similarity using local similar features.It then adopted Tulaplace algorithm to constrain the correlation between label distribution and label similarity matrices.After obtaining the label distribution, the probability of each label was extracted by using the method of multiple output regression (MOR).However, PML-LD did not utilize the label distribution to explore label correlation and combine it with the MOR stage.In order to make the label distribution differentiable, a PENAD model [39] mapped the local linear relationship of labels to the label distribution, which transferred the local topological structure of feature space to the label distribution space.However, the MOR method in PENAD is too simple to handle complex noise situations.

Deep learning based JRMs
Recently, some deep learning (DL)-based PML methods have been proposed.For example, a PML-SE method [43] adopted the Mean Teacher method [31] to train student and teacher networks to reduce the labels prediction bias.A PML-MT method [40] was also based on the teacher-student model.Compared with PML-SE, it added the output labels consistency among student and teacher networks.However, the teacher networks of PML-SE and PML-MT are easily affected by the gradient of noise labels.A PML-GAN model [42] improved the prediction performance by fitting the bidirectional mapping relationship between input features and output labels.However, the generator of PML-GAN may generate noisy features, leading to misguidance of discriminator and prediction networks and causing the entire Generative Adversarial Network (GAN) to be misleading when it classifies labels.
Z. Wang et al.

Summary
As discussed in Section 2.2.1 ~ 2.2.3, the JRMs [11,16,17,19,21,22,27,28,29,30,33,36,37,41,44,45,49] constructed PML models under strong prior assumptions, such as the consistency between label co-occurrence and feature similarity [16,21,22,36,45,49], the low rank of subspaces [11,17,19,27,30,41,44], and the sparse noise [28,29,33,37].These assumptions make them difficult to explore high-order label correlations and narrow the application scenarios.In addition, almost all the aforementioned JRMs adopt a one-hot style to express the authenticity of the labels.The absolute expression of label prediction is prone to error propagation in label prediction.Thus, these drawbacks in JRMs hinder their ability to explore high-order label correlations and limit their applicability to various scenarios, such as identifying nonlinear, low-dimensional manifolds [15].The JRMs based on smoothness assumption [38,39] can use local features to filter label noise.However, label distribution has not been utilized to exact label correlations that may enhance the classification performance in PML.For the recently proposed deep learning-based JRMs [31,40,42], the teacher-student model [40,43] does not filter noise in label space and its performance is affected by the outliers.Without using correlation to filter noises in feature and label spaces, PML-GAN [42] may mislead the discriminator and prediction networks.
In summary, current studies of PML have two drawbacks: (1) Strong prior assumption makes PML methods only suitable for specific application scenarios.(2) Insufficient exploration of label correlations weakens the effectiveness of label disambiguation.In order to address these drawbacks, we propose a novel PML method based on the Encoder-Decoder (ED) framework (PML-ED), where ED is a common DL framework without strong prior assumption.Specifically, PML-ED first achieves the label probability distribution (rather than one-hot style) by using the mechanism of KNN label attention in label space.The method explores the high-order label correlations through the method of CLN feature enhancement.It further introduces and implements a highly versatile ED framework to reduce the prior assumption of label noise and handle a wide range of application scenarios.In Section 3, we will discuss the principle of PML-ED in detail.

The principle of the PML-ED method
We denote X = R d as a d-dimensional instance feature space and Y = { l 1 , l 2 , ⋯, l q } as a label space with q labels.The training dataset of PML is defined as , where x i ∈ X represents the ith instance with d-dimensional features and ŷi ∈ R 1×q represents the candidate label set.We also assume that y i ∈ {0, 1} 1×q (y i ⊂ŷ i ) is the true label set of x i .The goal of PML is to obtain a multi-label classifier f : X →Y based on the training dataset D.

Overall Description of the PML-ED method
Encoder-Decoder is a deep learning (DL) framework.As the basis of classic DL models, it is commonly used for the learning tasks of sequence to sequence, such as NMT [1], Transformer [32], and T5 [23].In this work, we innovatively implement and apply the Fig. 2. The architecture of the PML-ED method.
Z. Wang et al.Encoder-Decoder framework to extract label correlations and execute label disambiguation both in the feature and label spaces.
The proposed PML-ED method consists of two stages.In the first stage, a credible label distribution is learned from the candidate labels (Section 3.2).In the second stage, by using the Encoder-Decoder framework, the high-order label correlation is extracted.The information coordination of feature and label spaces is utilized to derive the true labels of test instances (Section 3.3 and 3.4).The overall architecture of the PML-ED method is described in Fig. 2 and in the sections that follow, each step will be explained in detail.In addition, we will specifically discuss the process of PML-ED for handling noisy labels in Section 3.5.

Learning of credible label distribution
In the classic Transformer model [32], the attention mechanism is only utilized in the feature space.In this work, it will be used in the label space to extract credible label distribution for the first time.For PML applications, most of the feature information of the samples is credible and the corresponding feature noise is sparse.Thus, we utilize the similar relationship of sample features to derive the label distribution for reducing the interference of noise labels.
The proposed PML-ED model uses the mechanism of KNN label attention [35] to assign labels' weights based on the similarity of the features among samples and to calculate the credible label distribution based on the labels of neighboring samples.For each sample x i in training set D, its k nearest neighbors (KNN) are chosen from D as the samples according to the Euclidean distance between x i and other samples in D.
Suppose that N i represents the set of KNN samples of x i and x t ∈ N i .As defined in Eq. ( 1), the feature values of N i constitute K i and the corresponding labels form V i (1) Thus, the credible label distribution of x i can be defined in Eq. ( 2), where softmax( • ) represents the softmax function.

Exploration of label correlations under the Encoder-Decoder framework
In the second stage, p i is input into the Encoder-Decoder framework to extract high-order label correlation, where the Encoder module obtains the label semantic embedding of sample features and the Decoder module is responsible for decoding the label correlations.

Encoder
The Encoder module converts x i into the label semantic matrix L i ∈ R q×h , where h is the size of semantic embedding.Specifically, the multi-view feature extractor, implemented by the convolution layer, is used by the Encoder for semantic extraction.As indicated in Eq. ( 3), a residual connection is made between the original features x i and the local features derived by the convolution and maximum pooling to avoid the loss of original information: In Eq. ( 3), e (1 ≤ e ≤ d) is the serial number of the sliding window and j (1 ≤ j ≤ q) is the number of views.x i a:b represents the vector truncated by the sliding window at the subscript range [a,b], w is the size of the sliding window of the convolution layer, W j ∈ R w×d is the parameter of convolution operation, b j ∈ R 1×d is the offset parameter, and ReLU( • ) is the ReLU function.max(T) represents the function that obtains the maximum value for each column of the matrix T to achieve the salient features under each view.q feature extractors acquire the features from q different views and project them onto the original feature space.As seen in Eq. ( 4), the semantic vector of label l j is defined as l j and all the semantic vectors of q labels form the label semantic matrix L i ∈ R q×h : where W l ∈ R d×h and b l ∈ R 1×h are the learnable parameters.

3.3.2.Decoder
Decoder is responsible for further decoding the label semantics extracted by the Encoder module and exploring the high-order correlations in the label semantic embedding.Firstly, the Decoder module utilizes the method of maximum pooling to extract the salient semantic features as the vector of label representation I c ∈ R 1×h , namely Z. Wang et al.
Conditional Layer Normalization (CLN) [26] and Transformer-Block [32] are then used to extract the label semantics layer by layer, where CLN transforms the label distribution derived by the Transformer-Block module to obtain the high-order correlations among labels.A recursive process of updating the Decoder, H l is given in Eq. ( 6) and it terminates until H dec is obtained, where dec is the total layer number of the Decoder.
Specifically, CLN dynamically generates conditional gain γ c and bias β c according to the input vectors, which changes the distribution of original input by learning from the mean and variance.In this way, the vector of label representation I c that is given is encoded into γ c and β c and then integrated into the label semantic representation.The detailed process of CLN is defined in Eq. ( 7) where ⊙ represents the operator of element wise-product, The Transformer-Block consists of an attention layer and a layer of full connection feedforward network (FFN).For the recursive input label semantic representation L ∈ R q×h , the attention layer can reduce the concern for unimportant labels while maintaining the same concern for current labels as defined in Eq. ( 8) In Eq. ( 8), the following equations hold , and W v ∈ R h×h are the weight matrices of Q,K , and V. Residual connection and layer normalization defined in Eq. ( 9) are then used to improve the stability of the attention module The results are utilized as the input to the feedforward network of Transformer-Block.The FFN layer can provide the capability of nonlinear transformation.Suppose that the input matrix of FFN is L ∈ R q×h : where ) and b 2 ∈ R 1×h are the learnable parameters.The residual connection and layer normalization are also used for the FFN layer as defined in Eq. ( 11) Thus, the Transformer-Block can be defined in Eq. (12).
which can not only obtain the label weight by using the attention mechanism, but can also solve the problem of long-distance dependency between feature representations.In addition, the characteristic of less inductive bias of the Transformer-Block enables the PML-ED model to adapt to various application scenarios.

Model training and label prediction
Finally, the label semantic matrix H dec is projected onto the label space R 1×q where W proj ∈ R 1×h and b proj ∈ R 1×q are the learnable parameters.In order to train the PML-ED model, the global optimization function J (Θ) is expressed as where θ 1 and θ 2 (Θ ≜ {θ 1 , θ 2 }) represent the learnable parameters of the Encoder and Decoder, respectively.
Z. Wang et al.
The flowchart and the implementation steps of PML-ED are outlined in Fig. 3 and Algorithm 1, respectively.calculate the different view features cj by Eq. ( 3), 1 ≤ j ≤ q. 5.
//prediction process For test sample xt, input the trained PML-ED model and calculate q i according to steps 4-6.For each element of q i , if it is larger than or equal to α, it is a true label; otherwise, it is a false label.

Discussion of handling noisy labels
Robust handling of noisy labels is essential in PML scenarios.In the stage of learning credible label distribution (Section 3.2), PML-ED integrates a label reconstruction process that identifies and corrects label inconsistencies.Specifically, our proposed method averages the candidate labels for each instance using a similarity method and adopts a KNN label attention mechanism to derive the credible label distribution.
Subsequently, we introduce a noise-adaptive Encoder layer into PML-ED to mitigate the effects of noisy labels.This layer assesses the contribution of each label based on its consistency with the label semantics encoded by the Encoder.The low prior assumptions of the Transformer make it suitable for handling scenarios with a large number of noisy labels.As a result, PML-ED learns to prioritize Fig. 3. Flowchart of the PML-ED method.
Z. Wang et al. reliable labels while reducing the influence of outliers or noise.
In the Decoder layer of PML-ED, we introduce conditional layer normalization and a Transformer Block.Specifically, the layer uses maximum pooling to extract significant features from the semantic representation matrix as the overall label representation.The Decoder layer then utilizes the conditional layer normalization and the Transformer Block to extract label semantics layer by layer.Conditional layer normalization normalizes the label semantic vector output from each Encoder layer, enhancing the correlation between the overall label representation and the current label.The Transformer Block, with its multi-head attention mechanism, reduces the weight of irrelevant labels while maintaining the weight for the current label representation, which enables the model to better focus on label representations related to label features.
In addition to the above methods for handling noisy labels, the Encoder-Decoder framework inherently possesses a degree of robustness to noise, thanks to the label embeddings' ability to capture semantic relationships between labels.

Experimental results and analysis
We start this section by outlining the compared PML methods, experimental environment, evaluation criteria, and datasets.We then proceed to conduct a comprehensive experimental comparison using the 5 benchmark evaluation criteria and the 28 datasets.

Description of compared PML methods and experimental environment
Nine state-of-the-art PML methods were chosen to compare the classification performance.The compared methods can be divided into three categories: (1) PPMs, for example, P-VLS, P-MAP and PAMB; (2) JRMs, such as PML-LCom, PML-LFC, PML-NI and NATAL; and (3) Deep learning-based methods, including PML-MT and PML-GAN.The parameters used in the comparison methods are consistent with those reported in their respective original papers, which are summarized in Table 1.A 10-fold cross-validation was adopted for enhancing the comparative objectivity.The mean and standard deviation of each evaluation criterion is given in the experimental results.
To comprehensively compare PML-ED with other state-of-the-art PML methods, the experimental comparison is divided into two parts: a comparison with classic PML methods (Section 4.4.2) and a comparison with DL-based methods (Section 4.4.3).The former includes 28 datasets, while the latter encompasses 11 datasets.For the comparison with DL-based methods, the experimental results of PML-MT and PML-GAN are sourced from their respective original papers.[40,42].

Description of evaluation criteria
In this work, five MLC criteria were used to evaluate the PML methods.They are Hamming Loss (HL), Ranking Loss (RL), One Error (OE), Coverage (Cove), and Average Precision (AP).f in Eq. ( 16) is the predicted probability that the test instance belongs to each label and the values are sorted in a descending order.rank f indicates the number of elements in a set.Specifically, Hamming Loss computes the average number of times that labels are misclassified, where Δ is the symmetric difference between two sets: Ranking Loss computes the average number of times when irrelevant labels are ranked before the relevant labels, where y i is the complement of y i in L: One Error calculates the average number of times that the top-ranked label is irrelevant to the test instance:

Table 1
Parameters of compared PML methods.
Coverage calculates the average number of steps that are in the ranked list to find all the relevant labels of the test instance: Average Precision evaluates the degree for the labels that are prior to the relevant labels and that are still relevant labels:

Description of experimental datasets
Twenty-eight benchmark datasets, comprising 22 composite datasets and 6 real-world datasets, were chosen to evaluate the performance of the methods.The basic information of these 28 datasets is summarized in Table 2 (for detailed information about these benchmark and real-world datasets please visit: https://mulan.sourceforge.net/datasets-mlc.html,http://www.uco.es/kdis/mllresources/ and https://palm.seu.edu.cn/zhangml/).
For the comparison with classic PML methods (Section 4.4.2),we used 22 composite datasets and 6 real-world datasets.We constructed the PML composite datasets by randomly adding noise labels to the MLL datasets.Specifically, for each instance x i in MLL datasets [48], some noise labels were randomly introduced into the MLL data.The number of noise labels is a% of the number of labels of x i .Owing to a large number of noise labels in real application scenarios, in this work, a was randomly selected from a wide range of set {50,100,150,200}, implying that the 22 composite datasets contained more label noises than those of the compared PML methods.
For the real-world datasets, YeastBP, YeastCC, and YeastMF are obtained from the task of protein-protein interaction prediction [44] and Mirflickr, Music_style, and Music_emotion are from the image retrieval task [14].All of them have false positive labels in the real world.Specifically, for YeastBP, YeastCC, and YeastMF, the candidate labels correspond to the biological process annotations of Yeast proteins, archived from different periods, in the Gene Ontology (https://www.geneontology.org).Annotations that were available historically but are absent in recent periods are considered as false positive labels.For Mirflickr, Music_style, and Music_emotion, the candidate labels are collected from web users and further examined by human labelers to specify the ground-truth labels.
Additionally, as detailed in Table 2, several large-scale datasets were selected for the experiments.11 of these datasets contain more than 5,000 instances, 12 datasets have over 1,000 features, and 8 datasets include more than 100 labels.As outlined in Table 3, we compiled a series of PML datasets characterized by a large number of noisy labels, with over 70 % of them exhibiting a noise ratio exceeding 30 %.
For the comparison with DL-based methods (Section 4.4.3),we use 8 composite datasets and 3 real-world datasets.To ensure a fair comparison, the composite datasets were constructed following the approach described in [40,42], thereby maintaining consistency with the original references in terms of the average number of candidate labels per instance.

Analysis of experimental results
In this section, we first analyze the parameter sensitivity (Section 4.4.1) and comprehensively compare the results of PML-ED with the other nine state-of-the-art PML models on 28 datasets (Section 4.4.2 and 4.4.3).We then give the time complexity comparison for the studied PML methods (Section 4.4.4).Finally, we perform a discussion of interpretability (Section 4.4.5) and an analysis of convergence (Section 4.4.6).

Analysis of parameter sensitivity
In order to analyze the sensitivity of the threshold parameter α (see Algorithm 1), the datasets scene, genbase, CAL500, and delicious were chosen to test the parameter setting because they represent the different sizes in the true label set, that are 6, 27, 174 and 983 labels.Fig. 4 shows the classification performance of PML-ED on the four datasets.Specifically, we select the value of α within an interval of [0.5, 1] with a step size of 0.1, and test the performance changes of five evaluation criteria.As the value of α changes, PML-ED achieves relatively stable performance on HL and Cove on all four datasets.For RL, OE and AP, the algorithm performance slowly deteriorates with the increase of α value.Thus, as described in Fig. 3, when α = 0.5, the criteria of HL, RL, OE, and Cove achieve small values and AP obtains a large value, implying that 0.5 is a reasonable threshold for α.

Comparison with classic PML methods
Generally, as shown in Tables 4-8, PAMB demonstrates poor performance across all five criteria since it obtains the worst average rank on RL, OE, Cove, AP and ranks seventh on HL.NATAL achieves the worst average rank on HL and ranks seventh on RL.PML-NI obtains the second average rank on the criteria of RL, OE, Cove and AP, but it ranks sixth on HL.Although P-VLS achieves the second average rank on HL, the average ranks on RL, OE, Cove, and AP are 6, 4, 7, and 7, respectively.P-MAP, PML-LFC and PML-LCom have a non-outstanding performance in the compared methods.Specifically, for PML-LFC, the average ranks on 28 datasets for the criteria of HL, RL OE, Cove, and AP are 4.07, 4.18, 4.25, 4.07, and 4.11.On the other hand, our proposed PML-ED model shows obvious advantages on all five criteria.Its average ranks on 28 datasets for the criteria of HL, RL OE, Cove and AP are 2.36, 2.79, 3.04, 2.82 and 2.82, respectively, which outperforms other methods across all five criteria.Moreover, there is a clear gap between those methods with  For AP, it obtains the best results on 10 datasets (YeastBP, delicious, enron, Eurlex-dc, Eurlex-sm, genbase, mediamill, yahoo-Entertainment, yahoo-Health and yeast) and ranks the second on 5 datasets (YeastCC, YeastMF, scene, Water-quality and Mirflickr).It thus can be concluded that the PML-ED presents the best performance on datasets covering the domains of text, image, music, biology, video and chemistry.This also implies that the PML-ED can be applied to different application scenarios.It should be noted that PAMB cannot be run on CAL500, delicious and Eurlex-dc and the corresponding results in Tables 4-8 are marked as NA (Not Available).

Comparison with DL-based methods
Table 9 shows the experimental results of PML-ED, PML-MT and PML-GAN on 11 datasets for five criteria.Since PML-GAN cannot be evaluated by the Cove criterion and PML-MT cannot be run on yeast and Music_style, corresponding results in Table 9 are marked as NA (Not Available).
To present the results in Table 9 more intuitively, we further introduce a "win/draw/lose" count diagram to compare the performance of DL-based methods, as described in Fig. 5.The basic principle is that if one algorithm is better than others on one dataset for a criterion, it is considered as a "win" on this criterion.If it is worse than others, it is considered as a "lose" and otherwise is a "draw".
As shown in Fig. 5, compared with PML-GAN and PML-MT, the proposed PML-ED achieves the best results on 7 datasets under the HL criterion, 7 under RL, 5 under OE, and 6 under Cove.Additionally, it achieves draw results on 1 dataset under HL, 2 under RL, 2 under OE, and 5 under Cove.This indicates that PML-ED has superior performance compared to PML-GAN and PML-MT.

Time complexity comparison
We give the comparison of time complexity on the nine compared PML methods in Table 10.For PML-ED, k n represents the number of convolution kernels and k s presents the size of the convolution kernel.k in P-VLS, P-MAP and PML-ED is the number of the nearest neighbors.r in NATAL and PML-LCom is the rank of the matrix decomposed by the method of singular value decomposition (SVD).In PML-ED, m is the number of samples in the training set, d is the number of features, q is the total number of labels, and T is the number of iterations, as described in PML-LCom, PML-MT, PML-GAN and PAMB.In addition, l in PAMB is the number of columns of coding matrix, and N s is the number of support vectors.
In order to assess real-world performance of ten compared algorithms, we give their running time on all of 28 datasets, as described in Table 11.We used an Intel (R) core (TM) i9-13900 K CPU @ 3.00 GHz with 64 GB memory, NVIDIA GeForce GTX 4090 24G GPU, a Windows 10 64bit operating system, Python 3.8.10 and matlab R2018a as our experimental environment.It can be seen that the running time of PML-ED is shorter than those of PML-LFC, NATAL and PAMB on most of datasets, but longer than those of PML-NI, PML-LFC, P-VLS and P-MAP on most of datasets.We have also observed that the proposed PML-ED, as well as NATAL and PAMB, runs a long time on some large-scale datasets.Nevertheless, advancements and enhanced applicability in distributed and parallel computation (DPC) technologies have significantly accelerated the execution of PML algorithms [12,25].Speed enhancements of up to 200 times, as reported in [19], or even 266 times according to [25], are achievable on large-scale datasets compared to traditional sequential CPU execution.For example, for the large-scale dataset Eurlex-sm, adopting the DPC technology in [12] (speedup to 200 times) will reduce its real running time to about 0.79 minute.Thus, the DPC technologies can mitigate the situation of high time complexity through enhancing the computation efficiency of PML-ED.

Interpretability discussion of PML-ED model
In existing research, neural network interpretability [47] primarily focuses on visual expression, often in the form of heat maps for image data.In this work, we use changes in evaluation criteria caused by feature masking as the interpretable indicator for the PML-ED model.Inspired by the concept of gradients reflecting feature importance in Grad CAM [24], we explain the effectiveness of the Encoder-Decoder framework in feature extraction and expression, using both original sample features and label semantic features.We have conducted above mask-based interpretability experiments on both a benchmark dataset (yeast) and a real-world application dataset (Mirflickr).
For the original sample features, we calculate the gradient of each feature in the network for each instance.We posit that features with larger gradient values significantly contribute to network decision-making.Masking these features should cause notable changes in evaluation indicators.Specifically, for the yeast dataset, each prediction input's size is X * 103, where X represents the number of samples and 103 represents the number of features.During the prediction process, we calculate the corresponding gradient in the network, which is also a X * 103 matrix.Each row is a 1 * 103 gradient vector, reflecting the importance of different features in the sample.For several sample features with higher gradient values, we use random values to replace them, thereby achieving the goal of masking features.We also construct a comparison group that randomly masks the same number of features.By comparing the changes in evaluation indicators caused by both types of masks, we validate the effectiveness of the feature encoder in feature extraction.For the dataset Mirflickr, We also construct a comparison group that randomly masks the same number of features.By comparing the changes in evaluation indicators caused by both types of masks, we validate the effectiveness of the feature encoder in feature extraction.Table 12 shows the impact of two methods of original sample feature masking on the datasets yeast and Mirflickr.
For the semantic features of labels extracted by the Encoder, we obtain their gradient information in the network and use the global average gradient to reflect the weight of each label vector.Specifically, for the yeast dataset, prediction data with dimensions of X * 103 is encoded into label semantic feature vectors of dimensions X * 14 * 128, where X represents the number of samples, 14 is the number of labels, and 128 is the size of each label semantic feature vector.Similarly, obtaining its gradient information in the network results in a matrix of X * 14 * 128.We calculate the global average gradient to represent the weight of each feature vector in the network (i.e., by taking the mean of the third dimension, size 128), thus obtaining a weight matrix of dimensions X * 14.This matrix reflects the importance of the 14 label semantic feature vectors extracted by the Encoder from each of the X samples.Based on this, we mask important feature vectors (replacing them with random value vectors).We have also constructed a comparison strategy that randomly selects the same number of feature vectors and randomizes them.It compares the impact of these two types of label vector masking methods on evaluation indicators.The process for Mirflickr is similar to that for yeast.Table 13 shows the impact of these two label vector masking methods on the yeast and Mirflickr datasets.
From the results shown in Tables 12 and 13, it is evident that for the original sample features, masking based on gradients leads to significant changes (either increases or decreases) in evaluation indicators compared to random masking.Similarly, for the semantic features of labels, gradient-guided masking also results in more pronounced changes in evaluation indicators (increase or decrease) compared to random masking.The outcomes of the masking on both the original sample features and label semantic features indicate that the Encoder has successfully extracted relevant features of the instance.Furthermore, the Decoder has effectively extracted highorder correlation relationships in the label semantic features, thereby verifying the effectiveness of PML-ED.

Table 9
Results comparison for DL-based PML methods.

Table 10
Time complexity of compared methods.

PML Methods Time Complexity
Training process Prediction process The running time (seconds) of eight compared algorithms.

Convergence analysis
To objectively analyze the convergence of the proposed PML-ED method, three different scale of the datasets: genbase, CAL500 and delicious, were chosen because they represent the different sizes (27,174 and 983 labels) in the label set.The corresponding proportion of noisy labels for each of the datasets is 53.99 %, 51.26 % and 61.28 %, respectively.Fig. 6 shows the change in the objective function value, defined in Eq. ( 14) on the three datasets for each iteration.We can see that the loss curves fall fast within a couple of iterations and then tend to be stable.Hence, the results empirically verify the convergence of the PML-ED method in practice.

Statistical test of experimental results
Two statistical tests, the Friedman test and Nemenyi test [5], were used to test whether the ranks of the methods significantly differ or not.The Friedman test statistics F F and the corresponding critical values for each evaluation criteria are shown in Table 14.Taking a significance level of φ = 0.05, the null hypothesis that the compared methods perform equally is clearly rejected for all evaluation criteria.The Nemenyi test further investigates whether each of the methods performs equally well against the others.The performance of the two methods significantly differs if the difference in their average is over the critical difference CD = q φ ̅̅̅̅̅̅̅̅̅̅ ̅ k(k+1) 6N √ .q φ is 3.031 and the CD is calculated as 1.9842 (k = 8, N = 28) for φ = 0.05.The CDs are outlined in Fig. 7.The compared methods with average ranks within a CD to that of PML-ED are covered by a red line.In another word, uncovered methods thus have a significantly worse performance than PML-ED.By looking at RL, for example, the average rank for PML-ED is 2.786 and the critical value is 4.770 after adding the CD.Given that the average ranks of P-VLS, NATAL and PAMB are greater than 4.770, they are classified as worse methods.However, there is no statistical evidence to assert that PML-ED outperforms the rest of the compared methods under RL.

Limitation and discussion
As demonstrated in Section 4.4.4,although PML-ED runs faster than PML-LFC, NATAL, and PAMB on most datasets, it still has a long running time on some large-scale datasets, such as Eurlex-dc and Eurlex-sm.On these datasets, all compared methods exhibit long running times, which limits PML-ED's application in scenarios with high runtime requirements.
The application of the proposed PML-ED method in scenarios involving sensitive data, such as healthcare, finance, or personal data management, raises several ethical implications that must be carefully considered.One primary concern is the protection of individual privacy.Given that PML-ED is capable of handling large datasets with noisy labels, there is a risk that sensitive information could be inadvertently revealed or misinterpreted, especially if the data includes personal or confidential details.As shown in Table 2, PML methods have been applied in the health domain where data is sensitive, including clinical reports (medical dataset) and data on coronary heart disease (CHD-49 dataset).Improper handling of such data could raise potential ethical issues.Mitigating the ethical risk of private data protection, especially in the context of advanced techniques like PML-ED, involves a multifaceted approach.Firstly, ensuring robust data anonymization is critical, which requires removing or encrypting identifiable information so that individual data subjects cannot be easily recognized.Techniques like differential privacy can be employed to preserve overall data utility while protecting individual data points.Secondly, implementing strict access controls and data governance policies is essential.Access to sensitive data should be limited to authorized personnel only, and clear protocols should be established for data handling, processing, and storage.Furthermore, incorporating transparency into PML methods involves developing models in a way that their decisions can be easily interpreted and explained to non-experts.Explainable AI helps in understanding how and why certain data is used or a  particular decision is made.It is important in scenarios where incorrect or biased decisions could have serious implications.

Conclusions
PML has widely applied in the real life, but is challenging in the field of machine learning, because the noisy labels make the MLL issue more complex.In this work, we propose a novel method for PML based on the Encoder-Decoder framework (PML-ED).The method not only leverages the KNN label attention mechanism and the conditional layer normalization model to extract high-order correlations of labels but also exhibits versatility across various application scenarios due to its fewer prior assumptions within the Encoder-Decoder framework.Experimental results show that the proposed method achieves the highest average ranking across five evaluation criteria compared with other PML algorithms.Some extra efforts will be required to improve and extend the PML-ED method in future.To enhance the computational efficiency of the PML-ED method, DPC technology [12,25] could be utilized.Additionally, few-shot PML and the integration of advanced feature selection methods [8,9] with PML are other research issues that deserve deeper investigation.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
are the mean and variance of H l , respectively.γ c and β c ∈ R 1×h represent the conditional gain and bias.W γ , W β ∈ R h×h , b γ and b β ∈ R 1×h are the learnable parameters.
For PML-NI and PML-LFC, F QP (a, b) represents the time complexity of QP problem with parameter a and b constraint conditions.F B (m, d) and F ′ B (d) represent the time complexity of training and prediction processes on binary classifier B , respectively.

Fig. 7 .
Fig. 7. Comparison of PML-ED (control method) against other compared methods using the Nemenyi test.

Table 2
Basic information of PML datasets.

Table 3
Description of noise information in PML datasets.

Table 4
Results comparison for PML methods on HL criterion.

Table 5
Results comparison for PML methods on RL criterion.

Table 6
Results comparison for PML methods on OE criterion.

Table 7
Results comparison for PML methods on Cove criterion.

Table 8
Results comparison for PML methods on AP criterion.

Table 12
Comparison of two original sample feature masks on the datasets of yeast and Mirflickr.

Table 13
Comparison of two label semantic feature masks on the datasets of yeast and Mirflickr.

Table 14
Summary of the Friedman Statistics F F (k = 8, N = 28) and the critical value (k: #comparing methods; N: #benchmark functions).