Learning multi-label scene classification

doi:10.1016/j.patcog.2004.03.009

Pattern Recognition

Volume 37, Issue 9, September 2004, Pages 1757-1771

https://doi.org/10.1016/j.patcog.2004.03.009 Get rights and content

Abstract

In classic pattern recognition problems, classes are mutually exclusive by definition. Classification errors occur when the classes overlap in the feature space. We examine a different situation, occurring when the classes are, by definition, not mutually exclusive. Such problems arise in semantic scene and document classification and in medical diagnosis. We present a framework to handle such problems and apply it to the problem of semantic scene classification, where a natural scene may contain multiple objects such that the scene can be described by multiple class labels (e.g., a field scene with a mountain in the background). Such a problem poses challenges to the classic pattern recognition paradigm and demands a different treatment. We discuss approaches for training and testing in this scenario and introduce new metrics for evaluating individual examples, class recall and precision, and overall accuracy. Experiments show that our methods are suitable for scene classification; furthermore, our work appears to generalize to other classification problems of the same nature.

Introduction

In traditional classification tasks [1]:

Classes are mutually exclusive by definition. Let χ be the domain of examples to be classified, Y be the set of labels, and H be the set of classifiers for χ→Y. The goal is to find the classifier h∈H maximizing the probability of h(x)=y, where y∈Y is the ground truth label of x, i.e., $y= arg max i P(y_{i} |x).$

Classification errors occur when the classes overlap in the selected feature space (Fig. 2a). Various classification methods have been developed to provide different operating characteristics, including linear discriminant functions, artificial neural networks (ANN), k-nearest-neighbor (k-NN), radial basis functions (RBF) and support vector machines (SVM) [1].

However, in some classification tasks, it is likely that some data belongs to multiple classes, causing the actual classes to overlap by definition. In text or music categorization, documents may belong to multiple genres, such as government and health, or rock and blues[2], [3]. Architecture may belong to multiple genres as well. In medical diagnosis, a disease may belong to multiple categories, and genes may have multiple functions, yielding multiple labels [4].

A problem domain receiving renewed attention is semantic scene classification [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], categorizing images into semantic classes such as beaches, sunsets or parties. Semantic scene classification finds application in many areas, including content-based indexing and organization and content-sensitive image enhancement.

Many current digital library systems allow a user to specify a query image and search for images “similar” to it, where similarity is often defined only by color or texture properties. This the so-called “query by example” process has often proved to be inadequate [19]. Knowing the category of a scene helps narrow the search space dramatically, reducing the search space, and simultaneously increasing the hit rate and reducing the false alarm rate.

Knowledge about the scene category can find also application in context-sensitive image enhancement [16]. While an algorithm might enhance the quality of some classes of pictures, it can degrade others. Rather than applying a generic algorithm to all images, we could customize it to the scene type (allowing us, for example, to retain or enhance the brilliant colors of sunset images while reducing the warm-colored cast from tungsten-illuminated scenes).

In the scene classification domain, many images may belong to multiple semantic classes. Fig. 1(a) shows an image that had been classified by a human as a beach scene. However, it is clearly both a beach scene and an urban scene. It is not a fuzzy member of each (due to ambiguity), but is a full member of each class (due to multiplicity). Fig. 1(b) (beach and mountains) is similar.

Much research has been done on scene classification recently, e.g., [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18]. Most systems are exemplar-based, learning patterns from a training set using statistical pattern recognition techniques. A variety of features and classifiers have been proposed; most systems use low-level features (e.g., color, texture). However, none addresses the use of multi-label images.

When choosing their data sets, most researchers either avoid such images, label them subjectively with the base (single-label) class most obvious to them, or consider “beach+urban” as a new class. The last method is unrealistic in most cases because it would increase the number of classes to be considered substantially and the data in such combined classes is usually sparse. The first two methods have limitations as well. For example, in content-based image indexing and retrieval applications, it would be more difficult for a user to retrieve a multiple-class image (e.g., beach+urban) if we only have exclusive beach or urban labels. It may require that two separate queries be conducted respectively and the intersection of the retrieved images be taken. In a content-sensitive image enhancement application, it may be desirable for the system to have different settings for beach, urban, and beach+urban scenes. This is impossible using exclusive single labels.

In this work, we consider the following problem:

The base classes are non-mutually exclusive and may overlap by definition (Fig. 2b). As before, let χ be the domain of examples to be classified and Y be the set of labels. Now let B be a set of binary vectors, each of length |Y|. Each vector b∈B indicates membership in the base classes in Y (+1=member,−1=non-member). H is the set of classifiers for χ→B. The goal is to find the classifier h∈H that minimizes a distance (e.g., Hamming), between h(x) and b_x for a newly observed example x.
In a probabilistic formulation, the goal of classifying x is to find one or more base class labels in a set C and for a threshold T such that $P(c|x)>T, ∀c∈C.$

Clearly, the mathematical formulation and its physical meaning are distinctively different from those used in classic pattern recognition. Few papers address this problem (see Section 2), and most of these are specialized for text classification or bioinformatics. Based on the multi-label model, we investigate several methods of training and propose a novel training method, “cross-training”. We also propose three classification criteria in testing. When applying our methods to scene classification, our experiments show that our approach is successful on multi-label images even without an abundance of training data. We also propose a generic evaluation metric that can be tailored to applications needing different error forgiveness.

It is worth noting that multi-label classification is different from fuzzy logic-based classification. Fuzzy logics are used as a means to cope with ambiguity in the feature space between multiple classes for a given sample, not as the end for achieving multi-label classification. The fuzzy membership stems from ambiguity and often a de-fuzzification step is eventually used to derive a crisp decision (typically by choosing the class with the highest membership value). For example, a foliage scene and a sunset scene may share some warm, bright colors, therefore there is confusion between the two scene classes in the selected feature space if color features are used; fuzzy logic would be suitable for solving this problem.

In contrast, multi-label classification is a unique problem in that a sample may possess multiple properties of multiple classes. The content for different classes can be quite distinct: for example, there is little confusion between beach (sand, water) and city (buildings).

The only commonalty between fuzzy-logic classification and multi-class classification is the use of membership functions. However, there is correlation between fuzzy membership functions: when one membership takes low values, the other also takes low values or high values and vice versa [20]. On the other hand, the membership functions in multi-label case are largely coincidence (e.g., resort on the beach). In practice, the sum of fuzzy memberships usually is normalized to 1, while no such constraints apply to the multi-class problem (e.g., a beach resort scene is both a beach scene and a city scene, each with certainty).

With these differences aside, it is conceivable that one could use the learning strategies described in this paper in combination with a fuzzy classifier in a similar way as they were used with the pattern classifiers in this study.

In this paper, we first review past work related to multi-label classification. In Section 3, we describe our training models and testing criteria. Section 4 contains the proposed evaluation methods. Section 5 contains the experimental results obtained by applying our approaches to multi-labeled scene classification. We conclude with a discussion and suggestions for future work.

Section snippets

Related work

The sparse literature on multi-label classification is primarily geared to text classification or bioinformatics. For text classification, Schapire and Singer [3] proposed BoosTexter, extending AdaBoost to handle multi-label text categorization. However, they note that controlling complexity due to overfitting in their model is an open issue. McCallum [2] proposed a mixture model trained by EM, selecting the most probable set of labels from the power set of possible classes and using heuristics

Multi-label classification

In this section, we describe possible approaches for training and testing with multi-label data. Consider two classes, denoted by ‘+’ and ‘x’ respectively. Examples belonging to both the ‘+'and ‘x’ classes simultaneously are denoted by ‘*’ (see Fig. 2b).

Evaluating multi-label classification results

Evaluating the performance of multi-label classification is different from evaluating performance of classic single-label classification. Standard evaluation metrics include precision, recall, accuracy, and F-measure [29]. In multi-label classification, the evaluation is more complicated, because a result can be fully correct, partly correct, or fully incorrect. Take an example belonging to classes c₁ and c₂. We may get one of the following results:

1.
c₁, c₂ (correct),
2.
c₁ (partly correct),
3.
c₁, c₃

Experimental results

We applied the above training and testing methods to semantic scene classification. As discussed in the Introduction, scene classification finds application in many areas, including content-based image analysis and organization and content-sensitive image enhancement. We now describe our baseline classifier and features and present the results.

Discussions

As shown in Table 1, some combined classes contain very few examples. The above experimental results show that the increase in accuracy due to the cross-training model is statistically significant; furthermore, these good multi-label results are produced even without an abundance of training data.

We now analyze the results obtained by using C-criterion and cross-training.¹ The

Conclusions and future work

In this paper, we have presented an extensive comparative study of possible approaches to training and testing in multi-label classification. In particular, we contribute the following:

•
Cross-training, a new training strategy to build classifiers. Experimental results show that cross-training is more efficient in using training data and more effective in classifying multi-label data.
•
C-Criterion using threshold selected by MAP principle is effective for multi-label classification. Other

Acknowledgements

Boutell and Brown were supported by a grant from Eastman Kodak Company, by the NSF under Grant Number EIA-0080124, and by the Department of Education (GAANN) under Grant Number P200A000306. Shen was supported by DARPA under Grant Number F30602-03-2-0001.

About the Author–MATTHEW BOUTELL received the B.S. degree (with High Distinction) in Mathematical Science from Worcester Polytechnic Institute in 1993 and the M.Ed. degree from the University of Massachusetts in 1994. Currently, he is a Ph.D. student in Computer Science at the University of Rochester. He served for several years as a mathematics and computer science instructor at Norton High School and at Stonehill College. His research interests include computer vision, pattern recognition,

References (34)

J.R. Smith et al.
Image classification and querying using composite region templates
Comput. Vision Image Understanding
(1999)
R. Duda et al.
Pattern Classification
(2001)
A. McCallum, Multi-label text classification with a mixture model trained by EM, in: AAAI’99 Workshop on Text Learning,...
R. Schapire et al.
Boostexter: a boosting-based system for text categorization
Mach. Learning
(2000)
A. Clare et al.
Knowledge Discovery in Multi-label Phenotype Data
(2001)
M. Boutell, J. Luo, R.T. Gray, Sunset scene classification using simulated image recomposition, in: International...
C. Carson, S. Belongie, H. Greenspan, J. Malik, Recognition of images in large databases using a learning framework,...
J. Fan, Y. Gao, H. Luo, M.-S. Hacid, A novel framework for semantic image classification and benchmark, in: ACM SIGKDD...
Q. Iqbal et al.
Retrieval by classification of images containing large manmade objects using perceptual grouping
Pattern Recognition
(2001)
P. Lipson, E. Grimson, P. Sinha, Configuration based scene classification and image indexing, 1997. Proc - IEEE...

A. Oliva et al.

Modeling the shape of the scenea holistic representation of the spatial envelope

Int. J. Comput. Vision

(2001)

A. Oliva, A. Torralba, Scene-centered description from spatial envelope properties, in: Second Workshop on Biologically...

S. Paek, S.-F. Chang, A knowledge engineering approach for image classification based on probabilistic reasoning...

N. Serrano, A. Savakis, J. Luo, A computationally efficient approach to indoor/outdoor scene classification, in:...

Y. Song, A. Zhang, Analyzing scenery images by monotonic tree, ACM Multimedia Systems J. 8 (6) 495–511...

M. Szummer, R.W. Picard, Indoor–outdoor image classification, in: IEEE International Workshop on Content-based Access...

A. Torralba, P. Sinha, Recognizing indoor scenes, Technical Report, AI Memo 2001-015, CBCL Memo 202, MIT, July...

Cited by (2296)

A thorough experimental comparison of multilabel methods for classification performance
2024, Pattern Recognition
Multilabel classification as a data mining task has recently attracted increasing interest from researchers. Many current data mining applications address problems with instances that belong to more than one class. These problems require the development of new, efficient methods. Advantageously using the correlation among different labels can provide better performance than methods that manage each label separately. In recent decades, many methods have been developed to deal with multilabel datasets, which makes it difficult to decide which method is the most appropriate for a given task. In this paper, we present the most comprehensive comparison carried out so far. We compare a total of 62 different methods and several configurations of each one for a total of 197 trained models. We also use a large set of problems comprising 65 datasets. In addition, we studied the efficiency of the methods considering six different classification performance metrics. Our results show that, although there are methods that repeatedly appear among the top-performing models, the best methods are closely related to the metric used for evaluating the performance. We also analyzed different aspects of the behavior of the methods.
A two-stage multi-view partial multi-label learning for enhanced disambiguation
2024, Knowledge-Based Systems
Multi-view partial multi-label (MVPML) problems are often encountered in our real life. Each training sample described by multi-view is associated with a set of candidate labels, including multiple ground-truth labels and noisy labels. Currently, to tackle the problem, some attempts are focusing on label enhancement. Moreover, disambiguation is a relatively valid method to recover the ground-truth label by disambiguating the candidate label for each sample. However, due to the complexity of the data, the one-stage disambiguation method is hard to fully recover the ground-truth label. To deal with the challenge, we propose a novel two-stage MVPML method motivated by feature-induced manifold disambiguation and low-rank sparse decomposition (TFMDD). Specifically, in the first stage, the KNN-based aggregate manifold structure adaptively fuses feature information from different views, and the obtained structure information is used to build a feature-induced manifold structure in the label space to remove the ambiguity of candidate labels. In the second stage, under the supervision of the disambiguated label matrix, multi-label classifiers and noise recognizers are simultaneously optimized in a unified framework by considering label correlation and low noise levels. Through the two stages, TFMDD attains satisfactory results. Extensive experiments on real-world datasets clearly confirm the method validity and give a validation in a statistical sense.
Semi-supervised imbalanced multi-label classification with label propagation
2024, Pattern Recognition
Multi-label learning tasks usually encounter the problem of the class-imbalance, where samples and their corresponding labels are non-uniformly distributed over multi-label data space. It has attracted increasing attention during the past decade, however, there is a lack of methods capable of handling the imbalanced problem in a semi-supervised setting. This study proposes a label propagation technique to settle the semi-supervised imbalanced multi-label issue. Specially, we first utilize a collaborative manner to exploit the correlations from labels and instances, and learn a label regularization matrix to overcome the imbalanced problem in the labeled instance. After that, we extend to semi-supervised learning and explore to represent the similarity of instances with weighted graphs on labeled and unlabeled data. Then, the data distribution information and label correlations are fully utilized to design the loss function under the consistency assumption manner. At last, we present an iterative scheme to settle the optimization issue, thereby achieving label propagation to address the imbalanced challenge. Experiments on a variety of multi-label data sets show the favorable performance of the proposed method against related comparing approaches. Notably, the proposed method is also validated to be robust with a limited number of training instances.
Learning with incomplete labels of multisource datasets for ECG classification
2024, Pattern Recognition
The shortage of annotated ECG data presents a significant impediment, hampering the overall generalization capabilities of machine learning models tailored for automated ECG classification. The collective integration of multisource datasets presents a potential remedy for this challenge. However, it is crucial to underscore that the mere addition of supplementary data does not automatically guarantee performance enhancement, given the unresolved challenges associated with multisource data. In this research, we address one such challenge, namely, the issue of incomplete labels arising from the diversity of annotations within multi-source ECG datasets. First, we identified three distinct types of label missing: dataset-related label missing, supertype missing, and subtype missing. To address the supertype missing effectively, we introduce a novel approach known as offline category mapping which leverages the hierarchical relationships inherent within the categories to recover the missing supertype labels. Additionally, two complementary strategies, referred to as prediction masking and online category mapping, are proposed to mitigating the adverse effects of subtype and dataset-related label missing on model optimization. These strategies enhance the model's ability to identify missing subtypes under conditions of weak supervision. These pioneering methodologies are integrated into a deep learning-based framework designed for multilabel ECG classification. The performance of our proposed framework is rigorously evaluated using realistic multi-source datasets obtained from the PhysioNet/CinC challenge 2020/2021. The proposed learning framework exhibits a notable improvement in macro-average precision, surpassing the corresponding baseline model by more than 25 % on the test datasets. As a result, this research study makes a substantial contribution to the field of ECG classification by addressing the critical issue of incomplete labels in multisource datasets, ultimately enhancing the generalization capabilities of machine learning models in this domain.
Multi-target feature selection with subspace learning and manifold regularization
2024, Neurocomputing
Existing supervised Multi-Target Feature Selection (MTFS) methods seldom consider the nearest-neighbor relationship and statistical correlation of samples underlying the output space, which leads the result of feature selection to be easily interfered by the output noise, thus making it difficult to achieve satisfactory performance. This paper proposes a novel MTFS method to preserve both the global and local target correlations. Specifically, the low-rank constraint is introduced to achieve multi-layer regression structure to better decouple the inter-input and inter-target relationships. Moreover, the local nearest-neighbor relationships and variable correlations of the sample points in the output space are also explored through adaptive graph and manifold learning, to better utilize the target correlations to improve the MTFS performance. Following the above principle, the resulting objective function and the corresponding optimization algorithm are proposed. Extensive experiments on several public datasets show that the proposed method is superior to other state-of-the-art methods.
CuPe-KG: Cultural perspective–based knowledge graph construction of tourism resources via pretrained language models
2024, Information Processing and Management
Tourism knowledge graphs lack cultural content, limiting their usefulness for cultural tourists.This paper presents the development of a cultural perspective-based knowledge graph (CuPe-KG). We evaluated fine-tuning ERNIE 3.0 (FT-ERNIE) and ChatGPT for cultural type recognition to strengthen the relationship between tourism resources and cultures. Our investigation used an annotated cultural tourism resource dataset containing 2,745 items across 16 cultural types. The results showed accuracy scores for FT-ERNIE and ChatGPT of 0.81 and 0.12, respectively, with FT-ERNIE achieving a micro-F1 score of 0.93, a 26 percentage point lead over ChatGPT's score of 0.67. These underscore FT-ERNIE's superior performance (the shortcoming is the need to annotate data) while highlighting ChatGPT's limitations because of insufficient Chinese training data and lower identification accuracy in professional knowledge. A novel ontology was designed to facilitate the construction of CuPe-KG, including elements such as cultural types, historical figures, events, and intangible cultural heritage. CuPe-KG effectively addresses cultural tourism visitors’ information retrieval needs.

View all citing articles on Scopus

About the Author–JIEBO LUO received his Ph.D. degree in Electrical Engineering from the University of Rochester in 1995. He is currently a Senior Principal Research Scientist in the Eastman Kodak Research Laboratories. His research interests include image processing, pattern recognition, and computer vision. He has authored over 80 technical papers and holds 20 granted US patents. Dr. Luo was the Chair of the Rochester Section of the IEEE Signal Processing Society in 2001, and the General Co-Chair of the IEEE Western New York Workshop on Image Processing in 2000 and 2001. He was also a member of the Organizing Committee of the 2002 IEEE International Conference on Image Processing and a Guest Co-Editor for the Journal of Wireless Communications and Mobile Computing Special Issue on Multimedia Over Mobile IP. Currently, he is serving as an Associate Editor of the journal of Pattern Recognition and Journal of Electronic Imaging, an adjunct faculty member at Rochester Institute of Technology, and an At-Large Member of the Kodak Research Scientific Council. Dr. Luo is a Senior Member of the IEEE.

About the Author–XIPENG SHEN received the M.S. degree in Computer Science from the University of Rochester in 2002 and the M.S. degree in Pattern Recognition and Intelligent Systems from the Chinese Academy of Sciences. He is currently a Ph.D graduate student at the Department of Computer Science, University of Rochester. His research interests include image processing, machine learning, program analysis and optimization, speech and language processing.

About the Author–CHRISTOPHER BROWN (B.A. Oberlin 1967, Ph.D. University of Chicago 1972) is Professor of Computer Science at the University of Rochester, where he has been since finishing a postdoctoral fellowship at the School of Artificial Intelligence at the University of Edinburgh in 1974. He is coauthor of COMPUTER VISION with his Rochester colleague Dana Ballard. His current research interests are computer vision and robotics, integrated parallel systems performing animate vision (the interaction of visual capabilities and motor behavior), and the integration of planning, learning, sensing, and control.

^☆: A short version of this paper was published in the Proceedings of the SPIE 2004 Electronic Imaging Conference.

View full text

Learning multi-label scene classification☆

Abstract

Introduction

Section snippets

Related work

Multi-label classification

Evaluating multi-label classification results

Experimental results

Discussions

Conclusions and future work

Acknowledgements

Comput. Vision Image Understanding

Pattern Classification

Boostexter: a boosting-based system for text categorization

Mach. Learning

Knowledge Discovery in Multi-label Phenotype Data

Retrieval by classification of images containing large manmade objects using perceptual grouping

Pattern Recognition

Modeling the shape of the scenea holistic representation of the spatial envelope

Int. J. Comput. Vision