Dependence maximization based label space dimension reduction for multi-label classification

doi:10.1016/j.engappai.2015.07.023

Engineering Applications of Artificial Intelligence

Volume 45, October 2015, Pages 453-463

https://doi.org/10.1016/j.engappai.2015.07.023 Get rights and content

Abstract

High dimensionality of label space poses crucial challenge to efficient multi-label classification. Therefore, it is needed to reduce the dimensionality of label space. In this paper, we propose a new algorithm, called dependence maximization based label space reduction (DMLR), which maximizes the dependence between feature vectors and code vectors via Hilbert–Schmidt independence criterion while minimizing the encoding loss of labels. Two different kinds of instance kernel are discussed. The global kernel for DMLR_G and the local kernel for DM_LR_L take global information and locality information into consideration respectively. Experimental results over six categorization problems validate the superiority of the proposed algorithm to state-of-art label space dimension reduction methods in improving performance at the cost of a very short time.

Introduction

During the last decade, multi-label classification has aroused the interest of researchers from engineering and academic areas because of its wide applications in real world. In multi-label setting, a document may be associated with multiple categories (Ji et al., 2010, Ueda and Saito, 2003); an image may be annotated with several concepts (Boutell et al., 2004). It is rather different from the traditional single-label (binary or multi-class) classification where each document is only allowed to be associated with one category.

A lot of algorithms have been proposed for multi-label classification (Zhang and Zhou, 2014). Currently a consensus on multi-label classification is that label correlations play an important role and should be utilized for performance improvement (Dembczyński et al., 2010, Zhang and Zhang, 2010, Zhang and Zhou, 2014). Most algorithms usually build classification model based on some label correlation assumption, such as ensemble of classifier chains (ECC) (Read et al., 2011) and calibrated label ranking (CLR) (Fürankranz et al., 2008).

Although these algorithms achieve satisfactory results, they suffer from computational inefficiency in both training and testing even for the most intuitive approach binary relevance (BR) (Boutell et al., 2004), which decomposes a multi-label classification problem into several independent binary classification problems, one for each label, based on one-versus-all (OVA) strategy (Hastie et al., 2009). This problem poses a rather crucial challenge to classification especially when there are a large number of possible labels. Therefore, it is necessary to explore ways that balances the classification performance and computational efforts.

Several algorithms on label space dimension reduction (LSDR) have been proposed along this avenue, which can be categorized into two groups: learning methods and reduction methods. The former group reduces the label space while jointly learning a classifier from the instances to the code vectors as well, for example multi-label prediction via compressed sensing (CS) (Hsu et al., 2009). We can obtain a classifier finally and use it for predicting directly. However, in order to get a promising classifier, these methods often employ complicated algorithms in the learning part, which again costs too much time. Therefore, the latter group is the mainstream in this avenue.

The latter group focuses on how to efficiently compress the label space and does not consider what learning algorithm to apply after compression. An exemplar is principal label space transformation (PLST) (Tai and Lin, 2012), which only reduces the dimensionality of label space by analyzing the principal components. A key problem of this group is on how to utilize the instances, which is still an open question. Since the ultimate objective is to make classification, some methods only use a simple model from instances to code vectors, for instance, conditional PLST (CPLST) (Chen and Lin, 2012). Nevertheless, this strategy might be suboptimal as it may over-fit the learnt model, which has a negative impact on the learning process later.

In this paper, we propose a new LSDR method, called dependence maximization based label space reduction (DMLR), which can be categorized into the latter group as a reduction method. Different from previous reduction methods, it assumes that the objective function should consist of two components: encoding loss and dependence loss. The former one measures the loss of label compression while the latter one measures the dependence between instances and code vectors. Specifically, it measures the encoding loss using least square loss function as used in PLST and measures the dependence loss based on Hilbert–Schmidt independence criterion (HSIC) (Gretton et al., 2005). Two different instance kernels are applied and we obtain two methods: DMLR_G with the instance kernel exploiting global information and DMLR_L with the instance kernel exploiting local information. Experimental results across six data sets from various application domains validate the superiority of two proposed algorithms to two state-of-art LSDR methods, PLST and CPLST, in performance and save a lot of training and testing time compared with a simple representative multi-label classification method – BR. Moreover, DMLR_L outperforms DMLR_G in performance in most cases and costs similar or less training time due to the sparsity of instance kernel used in DMLR_L.

The rest of this paper is organized as follows. Section 2 presents a brief literature review on multi-label classification algorithms and pays more attention on LSDR methods and the HSIC. We describe the two proposed algorithms, DMLR_G and DMLR_L, in detail in Section 3 and. Experimental results and discussion are given in Section 4 Finally, Section 5 concludes this paper and presents some clues for future work.

Section snippets

Related works

Since this paper focuses on LSDR methods, we present a brief literature review on multi-label classification in Section 2.1 and existing LSDR methods in Section 2.2. Section 2.3 describes the dependence measurement criterion HSIC on which our proposed methods relies. But for convenience of presentation, we first give the formulation of multi-label classification.

Let $D = {{(X_{i}, Y_{i})}_{i = 1}^{N}}$ be the training set with $N$ examples, where $X_{i} \in R^{d}$ is the ith instance (or feature vector) and $Y_{i} \in {- 1, + 1}^{L}$ is the

DMLR: dependence maximization based label space reduction

In this section, we first present the DMLR method in Section 3.1 in detail. Then we discuss in Section 3.2 the way to set the instance kernel $K$ for different purposes which plays a key role in DMLR.

Experiments and discussion

In this section, we conduct experiments to validate the competence of our proposed methods: DMLR_G and DMLR_L. Section 4.1 gives the experimental settings of several comparing methods. Some details on data sets are given in Section 4.2, on which the experimental results and discussion are presented in Section 4.3.

Conclusion and future work

In this paper, we assumed that the objective function in multi-label label space dimension reduction should be constituted by two components: the compression loss to measure the quality of label compression and the dependency loss to measure the dependence between the instances and code vectors. Based on this scheme, we proposed the dependence maximization based label space reduction (DMLR). It utilizes the compression loss as used in PLST and CPLST, and introduces the HSIC as the dependence

Acknowledgments

This work is supported by National Natural Science Foundation of China (Grant nos. 61472305, 61070143, 61303034), Science and Technology Project of Shaanxi Province, China (Grant nos. 2015GY027), and the Fundamental Research Funds for the Central Universities (Grant no. SMC1405).

References (45)

M.R. Boutell et al.
Learning multi-label scene classification
Pattern Recognit.
(2004)
J. Xu
An efficient multi-label support vector machine with a zero label
Expert Syst. Appl.
(2012)
J.-J. Zhang et al.
Multi-label learning with discriminative features for each label
Neurocomputing
(2015)
M.-L. Zhang et al.
ML-kNN: a lazy learning approach to multi-label learning
Pattern Recognit.
(2007)
M. Belkin et al.
Manifold regularization: a geometric framework for learning from labeled and unlabeled examples
J. Mach. Learn. Res.
(2006)
K.M. Borgwardt et al.
Integrating structured biological data by kernel maximum mean discrepancy
Bioinformatics
(2006)
Bucak, S.S., Mallapragada, P.K., Jin, R., Jain, A.K., 2009. Efficient multi-label ranking for multi-class learning:...
Chang, B., Kruger, U., Kustra, R., Zhang, J., 2013. Canonical correlation analysis based on Hilbert–Schmidt...
C.-C. Chang et al.
Training v-support vector regression: theory and algorithms
Neural Comput.
(2002)
C.-C. Chang et al.
LIBSVM: a library for supporting vector machines
ACM Trans. Intell. Syst. Technol.
(2011)

Chen, Y.-N., Lin, H.-T., 2012. Feature-aware label space dimension reduction for multi-label classification. In:...

W. Cheng et al.

Combining instance-based learning and logistic regression for mutli-label classification

Mach. Learn.

(2009)

C. Cortes et al.

Algorithms for learning kernels based on centered alignment

J. Mach. Learn. Res.

(2012)

Dembczyński, K., Cheng, W., Hüllermeier, E., 2010. Bayes optimal multilabel classification via probabilistic classifier...

Elisseeff, A., Weston, J., 2001. A kernel method for multi-labelled classification. In: Proceedings of Advances in...

K. Fukumizu et al.

Kernel dimension reduction in regression

Ann. Stat.

(2009)

J. Fürankranz et al.

Multilabel classification via calibrated label ranking

Mach. Learn.

(2008)

Gretton, A., Bousquet, O., Smola, A.J., Schölkopf, B., 2005. Measuring statistical dependence with Hilbert–Schmidt...

M. Hall et al.

The WEKA data mining software: an update

SIGKDD Explorations

(2009)

T. Hastie et al.

The elements of statistical learning

(2009)

H. Hotelling

Relations between two sets of variates

Biometrika

(1936)

Hsu, D., Kakade, S.M., Langford, J., Zhang, T., 2009. Multi-label prediction via compressed sensing. In: Proceedings of...

Cited by (12)

Learning with Hilbert–Schmidt independence criterion: A review and new perspectives
2021, Knowledge-Based Systems
The Hilbert–Schmidt independence criterion (HSIC) was originally designed to measure the statistical dependence of the distribution-based Hilbert space embedding in statistical inference. In recent years, it has been witnessed that this criterion can tackle a large number of learning problems owing to its effectiveness and high efficiency. In this article, we provide an in-depth survey of learning methods using the HSIC for various learning problems, like feature selection, dimensionality reduction, clustering, and kernel learning and optimization. Specifically, after introducing the basic idea of HISC, we systematically review the typical learning models based on the HISC, ranging from supervised learning to unsupervised learning, as well as from traditional machine learning to transfer learning and deep learning, followed by remaining challenges and future directions. The relationships between learning methods using the HSIC and other relevant learning algorithms are also discussed. We expect to provide practitioners valuable guidelines for their specific domains by elucidating the similarities and differences of these learning models.
Compact learning for multi-label classification
2021, Pattern Recognition
Citation Excerpt :
Then, based on canonical correlation analysis [20], Conditional PLST (CPLST) [16] and CCA-OC [21] improved PLST from the point of feature information. Zhang et al. [22] put forward a method to maximize the dependence between features and embedding labels. Some LC methods applied the randomized techniques to speed up the computing [23,24].
Multi-label classification (MLC) studies the problem where each instance is associated with multiple relevant labels, which leads to the exponential growth of output space. It confronts with the great challenge for the exploration of the latent label relationship and the intrinsic correlation between feature and label spaces. MLC gave rise to a framework named label compression (LC) to obtain a compact space for efficient learning. Nevertheless, most existing LC methods failed to consider the influence of the feature space or misguided by original problematic features, which may result in performance degradation instead. In this paper, we present a compact learning (CL) framework to embed the features and labels simultaneously and with mutual guidance. The proposal is a versatile concept that does not rigidly adhere to some specific embedding methods, and is independent of the subsequent learning process. Following its spirit, a simple yet effective implementation called compact multi-label learning (CMLL) is proposed to learn a compact low-dimensional representation for both spaces. CMLL maximizes the dependence between the embedded spaces of the labels and features, and minimizes the loss of label space recovery concurrently. Theoretically, we provide a general analysis for different embedding methods. Practically, we conduct extensive experiments to validate the effectiveness of the proposed method.
A Label Embedding Method via Conditional Covariance Maximization for Multi-label Classification
2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Robust Multi-Label Relief Feature Selection Based on Fuzzy Margin Co-Optimization
2022, IEEE Transactions on Emerging Topics in Computational Intelligence
Singular Value Decomposition and Manifold Regulation-based Multi-label Classification
2022, ACM International Conference Proceeding Series
A Review on Dimensionality Reduction for Multi-Label Classification
2021, IEEE Transactions on Knowledge and Data Engineering

View all citing articles on Scopus

View full text

Dependence maximization based label space dimension reduction for multi-label classification

Abstract

Introduction

Section snippets

Related works

DMLR: dependence maximization based label space reduction

Experiments and discussion

Conclusion and future work

Acknowledgments

Pattern Recognit.

Expert Syst. Appl.

Neurocomputing

Pattern Recognit.

Manifold regularization: a geometric framework for learning from labeled and unlabeled examples

J. Mach. Learn. Res.

Integrating structured biological data by kernel maximum mean discrepancy

Bioinformatics

Training v-support vector regression: theory and algorithms

Neural Comput.

LIBSVM: a library for supporting vector machines

ACM Trans. Intell. Syst. Technol.

Combining instance-based learning and logistic regression for mutli-label classification

Mach. Learn.

Algorithms for learning kernels based on centered alignment

J. Mach. Learn. Res.

Kernel dimension reduction in regression

Ann. Stat.

Multilabel classification via calibrated label ranking

Mach. Learn.

The WEKA data mining software: an update

SIGKDD Explorations

The elements of statistical learning

Relations between two sets of variates

Biometrika