Keywords

1 Introduction

Multi-Label Classification (MLC), which allows an instance to have more than one label at the same time, has been recently received a surge of interests in a variety of fields and applications [10, 15]. The main task of MLC is to learn the relationship between a F-dimensional feature vector \(\varvec{x}\) and an L-dimensional binary vector \(\varvec{y}\) from N training instances \(\{(\varvec{x}^{(1)},\varvec{y}^{(1)}),\dots ,(\varvec{x}^{(N)},\varvec{y}^{(N)})\}\), and to predict a binary vector \(\hat{\varvec{y}} \in \{0,1\}^{L}\) for a test instance \(\varvec{x} \in \mathbb {R}^{F}\). To simplify the notation, we use a matrix \(\mathbf {X}=[\varvec{x}^{(1)},\varvec{x}^{(2)},\dots ,\varvec{x}^{(N)}]^{T} \in \mathbb {R}^{N \times F}\) and a matrix \(\mathbf {Y}=[\varvec{y}^{(1)},\varvec{y}^{(2)},\dots ,\varvec{y}^{(N)}]^{T} \in \{0,1\}^{N\times L}\) for expressing the training set.

A key of learning in MLC is how to utilize dependency between labels [10]. However, an excessive treatment of label dependency causes over-learning and brings larger complexity, sometimes, even intractable. Thus, many algorithms have been proposed to model the label dependency efficiently and effectively. Embedding is one of such methods for MLC. This type of methods utilizes label dependency through dimension reduction. The label dependency is explicitly realized by reducing the dimension of the label space from L to K (\(\ll \) L). Embedding methods in general learn relationships instances in F-dimensional space and latent labels in K-dimensional space, then, linearly transform the relationship to those in F-dimensional and real labels in L-dimensional space [46, 8, 12, 16].

In this paper, we propose a novel method of a nonlinear embedding. Usually, either a set of labels or a set of instances is embedded [46, 8, 16], but in our method, both are embedded in the same time. We realize a mapping into a low-dimensional Euclidean space keeping three kinds of relationships between instance-instance, label-label and label-instance as faithfully as possible. In addition, for classification, a linear and a non-linear mappings of a testing instance are realized.

2 The Proposed Embedding

2.1 Objective Function

In contrast to traditional embedding methods, we explicitly embed both labels and instances into the same K-dimensional space (\(K < F\)) while preserving the relationships among labels and instances.Footnote 1 To preserve such relationships, we use a manifold learning method called Laplacian eigen map [1]. It keeps the distance or the degree of similarity between any pair of points or objects even in a low-dimensional space. For example, given similarity measure \(\mathbf {W}_{ij}\) between two objects indexed by i and j, we find \(\varvec{z}^{(i)}\) and \(\varvec{z}^{(j)}\) in \(\mathbb {R}^{K}\) so as to minimize \(\sum _{i,j} \mathbf {W}_{ij} \Vert \varvec{z}^{(i)} -\varvec{z}^{(j)}\Vert ^{2}_{2}\) under an appropriate constraint for scaling.

Now, we consider to embed both instances and labels at once. Let \(\varvec{g}^{(i)} \in \mathbb {R}^{K}\) be the low-dimensional representation of ith instance \(\varvec{x}^{(i)}\) on the embedding space and \(\varvec{h}^{(l)} \in \mathbb {R}^{K}\) be the representation of lth label on the same space as well. In this embedding, we consider three types of relationships: instance-label, instance-instance and label-label relationships. In this work, we quantify the above relationships by focusing on their localities. In more detail, we realize a mapping to preserve the following three kinds of properties in the training set:

  1. 1.

    Instance-Label (IL) relationship: Explicit relationship given by (\(\varvec{x}^{(i)},\varvec{y}^{(i)}\)) (\(i=1,\dots ,N\)) should be kept in the embedding as closeness between \(\varvec{g}^{(i)}\) and \(\varvec{h}^{(l_{i})} \) where \(l_{i}\) is one label of value one in \(\varvec{y}^{(i)}\)

  2. 2.

    Label-Label (LL) relationship: Frequently co-occurred label pairs should be placed more closely in the embedded space \(\mathbb {R}^{K}\).

  3. 3.

    Instance-Instance (II) relationship: Instances close in \(\mathbb {R}^{F}\) should be placed closely even in \(\mathbb {R}^{K}\).

Let us denote them by \(\mathbf {W}^{(IL)} \in \mathbb {R}^{N \times L}\), \(\mathbf {W}^{(LL)} \in \mathbb {R}^{L \times L}\) and \(\mathbf {W}^{(II)} \in \mathbb {R}^{N \times N}\), respectively. Then our objective function of \(\{\varvec{g}^{(i)},\varvec{h}^{(l)}\}\) become, with \(\alpha ,\beta \) (>0),

$$\begin{aligned} O= & {} 2O_{IL}+ \alpha O_{II}+ \beta O_{LL} \nonumber \\= & {} 2\sum _{i,l} \mathbf {W}_{il}^{(IL)}\Vert \varvec{g}^{(i)}-\varvec{h}^{(l)} \Vert _{2}^{2} + \alpha \sum _{ij} \mathbf {W}_{i,j}^{(II)}\Vert \varvec{g}^{(i)}-\varvec{g}^{(j)} \Vert _{2}^{2} \\&\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ + \beta \sum _{l,m} \mathbf {W}_{l,m}^{(LL)}\Vert \varvec{h}^{(l)}-\varvec{h}^{(m)} \Vert _{2}^{2} \nonumber \\= & {} \sum _{s,t} \mathbf {W}_{st} \Vert \varvec{e}^{(s)} - \varvec{e}^{(t)} \Vert ^{2}_{2} \ \ \ \ (s,t=1,2,\dots ,(N+L)), \nonumber \end{aligned}$$
(1)

where \( \varvec{e}^{(s)} = \varvec{g}^{(s)}\) or \(\varvec{h}^{(s)}\), and \(\mathbf {W}_{st}=\mathbf {W}^{(IL)}_{st}\), \(\mathbf {W}^{(II)}_{st}\) or \(\mathbf {W}^{(LL)}_{st}\) depending on the values of s and t. As their matrix representation, let us use \(\mathbf {G}=[ \varvec{g}^{(1)},\dots , \varvec{g}^{(N)}]^{T} \in \mathbb {R}^{N\times K}\) and \(\mathbf {H}=[ \varvec{h}^{(1)},\dots , \varvec{h}^{(L)}]^{T} \in \mathbb {R}^{L\times K}\). Then using

our objective function is rewritten as

$$\begin{aligned} O = \sum _{s,t} \mathbf {W}_{st} \Vert \varvec{e}^{(s)} - \varvec{e}^{(t)} \Vert ^{2}_{2} = 2 \text {Tr}(\mathbf {E}^{T}\mathbf {L}\mathbf {E}), \ \ \text {s.t.} \ \mathbf {E}^{T}\mathbf {DE}=\mathbf {I} \end{aligned}$$
(2)

where \(\mathbf {L}=\mathbf {D}-\mathbf {W}\) and \(\mathbf {D}\) is a diagonal matrix with elements \(\mathbf {D}_{ii}=\sum _{j}\mathbf {W}_{ij}\) [1]. The constraint \(\mathbf {E}^{T}\mathbf {DE}=\mathbf {I}\) is imposed to remove an arbitrary scaling factor in the embedding. This formulation is that of the Laplacian eigen map. Next, let us explain how to determine the similarity matrix \(\mathbf {W}\).

Fig. 1.
figure 1

The result of the proposed embedding in Scene dataset. Only 20 % of instances are displayed. The numbers indicate the labels \(1,\dots ,6\), and small crosses show the instances. (Color figure online)

Instance-Label Relationship: For the instance-label relationship \(\mathbf {W}^{(IL)}\), we use \(\mathbf {W}^{(IL)}=\mathbf {Y}\). In this case, \(\mathbf {W}^{(IL)}\) has elements of zero or one. The corresponding objective function of Instance-Label relationship becomes:

$$\begin{aligned} O_{IL}=\min _{\varvec{g}^{(i)},\varvec{h}^{(l)}} \sum _{\begin{array}{c} i=1,\dots ,N \\ l=1,\dots ,L \end{array}} \mathbf {W}^{(IL)}_{il}\Vert \varvec{g}^{(i)} -\varvec{h}^{(l)}\Vert ^{2}_{2}, \end{aligned}$$

where \(\mathbf {W}^{(IL)}_{il}=\mathbf {Y}_{il} \in \{0,1\}\).

Instance-Instance Relationship: We use the symmetric k-nearest neighbor relation in \(\mathbb {R}^{F}\) for constructing \(\mathbf {W}^{(II)}\) as seen in [3]. Thus, our second objective function becomes

$$\begin{aligned} O_{II}=\min _{\varvec{g}^{(i)}, \varvec{g}^{(j)}} \sum _{i,j=1,\dots , N} \mathbf {W}^{(II)}_{ij}\Vert \varvec{g}^{(i)} -\varvec{g}^{(j)}\Vert ^{2}_{2}, \end{aligned}$$

where

$$\begin{aligned} \mathbf {W}^{(II)}_{ij}= {\left\{ \begin{array}{ll} 1 &{} (i \in \mathcal{{N}}_{k}(\varvec{x}^{(j)}) \vee j \in \mathcal{{N}}_{k}(\varvec{x}^{(i)}) ),\\ 0 &{} \text {(otherwise),} \end{array}\right. } \end{aligned}$$

where \(\mathcal{{N}}_{k}(\varvec{x}^{(i)}) \) denotes the index set of k nearest neighbors of the ith instance. It is worth noting that we can construct \(\mathbf {W}^{(II)}\) on the basis of the similarity between \(\varvec{y}^{(i)}\) and \(\varvec{y}^{(j)}\) as seen in [3] instead of that between \(\varvec{x}^{(i)}\) and \(\varvec{x}^{(j)}\) above.

Label-Label Relationship: We construct \(\mathbf {W}^{(LL)}\) in such a way that \(\mathbf {W}^{(LL)}_{lm}\) takes a large positive value when labels l and m co-occur frequently in \(\mathbf {Y}\), otherwise a small positive value. We also use the symmetric k-nearest neighbor relation in the frequency. The corresponding third objective function becomes

$$\begin{aligned} O_{LL}=\min _{\varvec{h}^{(l)}, \varvec{h}^{(m)}} \sum _{l,m=1,\dots L} \mathbf {W}^{(LL)}_{lm}\Vert \varvec{h}^{(l)} -\varvec{h}^{(m)}\Vert ^{2}_{2}, \end{aligned}$$

where

$$\begin{aligned} \mathbf {W}^{(LL)}_{lm}= {\left\{ \begin{array}{ll} 1 &{} \text {(if}\ l\ \text {is one of top-}k\ \text {co-occurrence labels of}\ m\ \text {and vice versa)},\\ 0 &{} (\text {otherwise}). \end{array}\right. } \end{aligned}$$

Note that \(\mathbf {W}^{(LL)}\) is symmetric as well as \(\mathbf {W}^{(II)}\). The symmetricity of those guarantees the existence of a solution in (2).

The solution of (2) is obtained by solving the following generalized eigen problem:

$$\begin{aligned} \mathbf {L}\mathbf {E}=\lambda \mathbf {D}\mathbf {E}. \end{aligned}$$
(3)

Hence, the optimal solution \(\mathbf {E}\) of the objective function is the bottom K eigenvectors excluding an eigenvector with zero eigenvalue [1].

An example of this embedding is shown in Fig. 1. This is the result of mapping for Scene dataset [11] where \(N=2407\), \(F=294\), \(L=6\) and \(K=2\). In Fig. 1, we can see that the instance-label, instance-instance and label-label relations are fairly preserved. First, for instance-label relationship, four instances that share a label subset \(\{3,4\}\) (large brown dots) are mapped between labels 3 and 4. Second, for label-label relationship, highly co-occurred labels 1, 5 and 6 are closely mapped (highlighted by a circle). Finally, for instance-instance relationship, an instance and its k nearest neighbors (\(k=2\)) in the original F-dimensional space (a blue square and 2 blue diamonds) are closely placed.

figure a

2.2 Embedding Test Instances

For assigning labels for a testing instance, we need to embed it into the same low-dimensional space constructed from the training instances with multiple labels. Unfortunately above embedding is not functionally realized, we do not have an explicit way of mapping. Therefore, we propose two different ways of a linear mapping and a nonlinear mapping.

In the linear mapping, we simulate the nonlinear mapping from \(\mathbf {X}\) to \(\mathbf {G}\) (the former part of \(\mathbf {E}\)) by a linear mapping \(\mathbf {V}\) so as to \( \mathbf {G} \simeq \hat{\mathbf {G}}=\mathbf {XV}\). We use Ridge regression to find such a \(\mathbf {V}\):

$$\begin{aligned} \min _{\mathbf {V}} \Vert \mathbf {X}\mathbf {V} - \mathbf {G} \Vert ^{2}_{2} + \lambda \Vert \mathbf {V}\Vert ^{2}_{2}. \end{aligned}$$

where \(\lambda \) is a parameter. A test instance \(\varvec{x}\) is mapped to \(\varvec{g}\) such as \(\varvec{g}=\varvec{x}^{T}\mathbf {V}\).

figure b

In the nonlinear mapping, we use again the k-nearest neighbor relation to the testing instance \(\varvec{x}\). We map \(\varvec{x}\) into \(\varvec{g}\) by the average point of its k-nearest neighbors in the training instances.

$$\begin{aligned} \varvec{g}=\mathop {\text {argmin}}\limits _{\varvec{g}'} \frac{1}{K}\sum _{i \in \mathcal{{N}}_{k}(\varvec{x})} \Vert \varvec{g}' - \varvec{g}^{(i)}\Vert ^{2}_{2}=\frac{1}{K}\sum _{i \in \mathcal{{N}}_{k}(\varvec{x})} \varvec{g}^{(i)}. \end{aligned}$$

Since the objective function (2) is solved by Laplacian Eigen Map [1], we name the proposed method Multi-Label classification using Laplacian Eigen Map (shortly, MLLEM). The combined pseudo-code of MLLEM-L (for linear mapping of a testing instance) and MLLEM-NL (for nonlinear mapping of a test instance) is described in Algorithms 1 and 2.

2.3 Computational Complexity

The training procedure of the proposed algorithm (Algorithm 1) can be divided into two parts. The first part constructs k-nn graphs for both labels and instances (Step 3 and Step 4), in \(O(NL^{2})\) for labels and in \(O(FN^{2})\) for instances, respectively. The second part solves the generalized eigen problem (Step 6). This part takes \(O((N+L)^{3})\). However, it is known that this complexity can be largely reduced when the matrix \(\mathbf {W}\) is sparse and only a small number K of eigen vectors are necessary [9]. Therefore, the complexity of the proposed algorithm can be estimated as \(O(NL^{2}+FN^{2})\). This complexity is the same to those of almost all embedding methods including the compared methods on the experiments.

In the testing phase, the linear embedding needs \(O(F^{2}N)\) for the ridge regression. In contrast, nonlinear embedding needs only O(FN) for each test instance that is faster than linear embedding.

3 Related Work

Label embedding methods for MLC are employed to utilize label-dependency via the low-rank structure of an embedding space. Recently, several methods based on traditional factorizations [4, 6, 8] and based on regressions with various loss functions [12, 13] have been proposed. Canonical Correlation Analysis based method [16] is also one of them. This method conceptually embeds both instance and labels at the same time like the proposed MLLEM does. However, it conducts only one-side embedding in the actual classification process. This is because the linear regression after embedding includes the other-side embedding. Although all methods utilizes low-rank structure and succeeded to improve classification accuracy, they are limited to linear transformation.Footnote 2 In contrast to these methods, our MLLEM utilizes label dependency in a nonlinear way so that it is more flexible for mapping. On the other hand, we have to be careful for overfitting when we use nonlinear mappings. In MLLEM, the nonlinear mappings rely only on the similarity measures \(\mathbf {W}^{(IL)}\), \(\mathbf {W}^{(II)}\) and \(\mathbf {W}^{(LL)}\). Therefore, overfitting is limited to some extent.

Bhatia et al. proposed linear embedding method for instances [3]. In their embedding, only instance locality on the label space is considered and ML-KNN [14] is conducted on the low-dimensional space. In the sense of using locality, the proposed MLLEM is close to theirs, but the proposed MLLEM is different from their approach in the sense that label-instance relationship, label-label relationship and instance-instance relationship are all taken into consideration at the same time.

4 Experiments

4.1 Setting

We conducted experiments on thirteen benchmark datasets [11] (Table 1). Each dataset was separated into 67 % of training instances and 33 % of test instances at random. On large datasets (i.e. delicious, bookmarks and mediamill), we sampled randomly 5000 instances (4000 samples for training and 1000 samples for testing) according to [6].

Since all embedding methods return scores of labels, not a label subset, we used Area Under ROC-Curve (AUC) and top-k precision to evaluate the results [13].Footnote 3 AUC is a popular criteria to evaluate the ranking of all labels. We used ROC-curve between true-positive rate and false-positive rate for AUC and Top-1 precision [3].Footnote 4

Table 1. Dataset used in the experiment. \(F_{nnz}\) and \(L_{nnz}\) are the average of number non-zero features and labels among instances in the corresponding set, respectively.

We compared the following three state-of-the-art embedding methods to ours:

  1. 1.

    Low-rank Empirical risk minimization for MLC (LEML) [13]

  2. 2.

    Feature-aware Implicit Encoding(FaIE) [6]

  3. 3.

    Sparse Local Embedding for Extreme Multi-label Classification (SLEEC) [3]

  4. 4.

    Proposal with linear embedding (MLLEM-L) and nonlinear embedding (MLLEM-NL)

The proposed MLLEM has five parameters, the number of nearest neighbors \(k^{(I)}\) for instances, \(k^{(L)}\) for labels, weighting parameters \(\alpha \) for \(\mathbf {W}^{(II)}\) and \(\beta \) for \(\mathbf {W}^{(LL)}\) and the dimension K of the embedding space. On all datasets, we used \(K=20\). All the other parameters were tuned by five-cross validation on training dataset. The code is available at the authors’ web site.Footnote 5 For CPLST and FaIE, we set their numbers of dimension for labels space to the \(80\,\%\) of their numbers of labels following the setting in [3]. The other parameters were tuned as well. We used the implementations provided by the authors [13]Footnote 6, [6]Footnote 7. For SLEEC, we set the number of dimension as \(K=100\) following the their setting [3]. We tuned best the number of k-nn and the number of neighborhoods for ML-KNN after embedding. The other parameters such as smoothing parameter in their regression is set to the default setting they used. We used the implementations provided by the authors too.

Table 2. Results
Table 3. Training time (\(K=20\)).

4.2 Results

Table 2 shows the averaged AUC and Top-1 precision on all thirteen benchmark datasets (the larger, the better). SLEEC was the best in AUC. The proposed MLLEM follows. In Top-1 MLLEM was the best, especially in relatively small-scale datasets. This difference is possibly explained from the difference between objective functions of MLLEM. MLLEM ignores the distance between two objects which do not have a local similarity relation to each other. Therefore, occasionally, such two objects are placed closely in the embedding space in spite that they are not similar. This affects the result measured by AUC which takes overall ranking into evaluation. On the other hand, on top-k labels, such an indicated bad effect problem seldom occurs.

MLLEM was superior to SLEEC when the number N of instances is relatively small even in AUC. This is probably because SLEEC considers only locality of instances (instance-instance relationship). When the number of available instances is limited, instance-instance relationship is not enough to capture the relationship between features and labels. From this viewpoint, MLLEM is the best choice for small- to medium-sample size problem (Table 3).

5 Discussion

Since the proposed MLLEM uses Laplacian eigen map for the nonlinear embedding, there are several ways to increase the scalability of MLLEM such as an incremental method [2], Nyström approximation or column sampling [9] and efficient k-NN constructors [7]. Note that the framework used in MLLEM is very general. It comes from the freedom of choice of matrix \(\mathbf {W}\) (e.g., using a heat kernel). It is also able to handle categorical features. This generalization is not shared with SLEEC [3].

6 Conclusion

In this paper, we have proposed an embedding based approach for multi-label classification. The proposed algorithm takes into consideration three relationships: label-instance relationship, label-label relationship and instance-instance relationship, and realized a nonlinear mapping. All these three relationships are preserved in the embedded low-dimensional space as the closeness between instances and individual labels. We have shown that the algorithm is useful to visualize instances and labels at the same time, which helps us to understand a given multi-label problem, especially, how strongly those labels are related to each other. Linear and nonlinear mapping have been also proposed for classification. On experiments, the proposed algorithm outperformed the other state-of-the-art methods in small-scale datasets in sample number.