Learning multi-label scene classification☆
Introduction
In traditional classification tasks [1]:
Classes are mutually exclusive by definition. Let χ be the domain of examples to be classified, Y be the set of labels, and H be the set of classifiers for χ→Y. The goal is to find the classifier h∈H maximizing the probability of h(x)=y, where y∈Y is the ground truth label of x, i.e.,
Classification errors occur when the classes overlap in the selected feature space (Fig. 2a). Various classification methods have been developed to provide different operating characteristics, including linear discriminant functions, artificial neural networks (ANN), k-nearest-neighbor (k-NN), radial basis functions (RBF) and support vector machines (SVM) [1].
However, in some classification tasks, it is likely that some data belongs to multiple classes, causing the actual classes to overlap by definition. In text or music categorization, documents may belong to multiple genres, such as government and health, or rock and blues[2], [3]. Architecture may belong to multiple genres as well. In medical diagnosis, a disease may belong to multiple categories, and genes may have multiple functions, yielding multiple labels [4].
A problem domain receiving renewed attention is semantic scene classification [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], categorizing images into semantic classes such as beaches, sunsets or parties. Semantic scene classification finds application in many areas, including content-based indexing and organization and content-sensitive image enhancement.
Many current digital library systems allow a user to specify a query image and search for images “similar” to it, where similarity is often defined only by color or texture properties. This the so-called “query by example” process has often proved to be inadequate [19]. Knowing the category of a scene helps narrow the search space dramatically, reducing the search space, and simultaneously increasing the hit rate and reducing the false alarm rate.
Knowledge about the scene category can find also application in context-sensitive image enhancement [16]. While an algorithm might enhance the quality of some classes of pictures, it can degrade others. Rather than applying a generic algorithm to all images, we could customize it to the scene type (allowing us, for example, to retain or enhance the brilliant colors of sunset images while reducing the warm-colored cast from tungsten-illuminated scenes).
In the scene classification domain, many images may belong to multiple semantic classes. Fig. 1(a) shows an image that had been classified by a human as a beach scene. However, it is clearly both a beach scene and an urban scene. It is not a fuzzy member of each (due to ambiguity), but is a full member of each class (due to multiplicity). Fig. 1(b) (beach and mountains) is similar.
Much research has been done on scene classification recently, e.g., [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18]. Most systems are exemplar-based, learning patterns from a training set using statistical pattern recognition techniques. A variety of features and classifiers have been proposed; most systems use low-level features (e.g., color, texture). However, none addresses the use of multi-label images.
When choosing their data sets, most researchers either avoid such images, label them subjectively with the base (single-label) class most obvious to them, or consider “beach+urban” as a new class. The last method is unrealistic in most cases because it would increase the number of classes to be considered substantially and the data in such combined classes is usually sparse. The first two methods have limitations as well. For example, in content-based image indexing and retrieval applications, it would be more difficult for a user to retrieve a multiple-class image (e.g., beach+urban) if we only have exclusive beach or urban labels. It may require that two separate queries be conducted respectively and the intersection of the retrieved images be taken. In a content-sensitive image enhancement application, it may be desirable for the system to have different settings for beach, urban, and beach+urban scenes. This is impossible using exclusive single labels.
In this work, we consider the following problem:
The base classes are non-mutually exclusive and may overlap by definition (Fig. 2b). As before, let χ be the domain of examples to be classified and Y be the set of labels. Now let B be a set of binary vectors, each of length |Y|. Each vector b∈B indicates membership in the base classes in Y (+1=member,−1=non-member). H is the set of classifiers for χ→B. The goal is to find the classifier h∈H that minimizes a distance (e.g., Hamming), between h(x) and bx for a newly observed example x.
In a probabilistic formulation, the goal of classifying x is to find one or more base class labels in a set C and for a threshold T such that
Clearly, the mathematical formulation and its physical meaning are distinctively different from those used in classic pattern recognition. Few papers address this problem (see Section 2), and most of these are specialized for text classification or bioinformatics. Based on the multi-label model, we investigate several methods of training and propose a novel training method, “cross-training”. We also propose three classification criteria in testing. When applying our methods to scene classification, our experiments show that our approach is successful on multi-label images even without an abundance of training data. We also propose a generic evaluation metric that can be tailored to applications needing different error forgiveness.
It is worth noting that multi-label classification is different from fuzzy logic-based classification. Fuzzy logics are used as a means to cope with ambiguity in the feature space between multiple classes for a given sample, not as the end for achieving multi-label classification. The fuzzy membership stems from ambiguity and often a de-fuzzification step is eventually used to derive a crisp decision (typically by choosing the class with the highest membership value). For example, a foliage scene and a sunset scene may share some warm, bright colors, therefore there is confusion between the two scene classes in the selected feature space if color features are used; fuzzy logic would be suitable for solving this problem.
In contrast, multi-label classification is a unique problem in that a sample may possess multiple properties of multiple classes. The content for different classes can be quite distinct: for example, there is little confusion between beach (sand, water) and city (buildings).
The only commonalty between fuzzy-logic classification and multi-class classification is the use of membership functions. However, there is correlation between fuzzy membership functions: when one membership takes low values, the other also takes low values or high values and vice versa [20]. On the other hand, the membership functions in multi-label case are largely coincidence (e.g., resort on the beach). In practice, the sum of fuzzy memberships usually is normalized to 1, while no such constraints apply to the multi-class problem (e.g., a beach resort scene is both a beach scene and a city scene, each with certainty).
With these differences aside, it is conceivable that one could use the learning strategies described in this paper in combination with a fuzzy classifier in a similar way as they were used with the pattern classifiers in this study.
In this paper, we first review past work related to multi-label classification. In Section 3, we describe our training models and testing criteria. Section 4 contains the proposed evaluation methods. Section 5 contains the experimental results obtained by applying our approaches to multi-labeled scene classification. We conclude with a discussion and suggestions for future work.
Section snippets
Related work
The sparse literature on multi-label classification is primarily geared to text classification or bioinformatics. For text classification, Schapire and Singer [3] proposed BoosTexter, extending AdaBoost to handle multi-label text categorization. However, they note that controlling complexity due to overfitting in their model is an open issue. McCallum [2] proposed a mixture model trained by EM, selecting the most probable set of labels from the power set of possible classes and using heuristics
Multi-label classification
In this section, we describe possible approaches for training and testing with multi-label data. Consider two classes, denoted by ‘+’ and ‘x’ respectively. Examples belonging to both the ‘+'and ‘x’ classes simultaneously are denoted by ‘*’ (see Fig. 2b).
Evaluating multi-label classification results
Evaluating the performance of multi-label classification is different from evaluating performance of classic single-label classification. Standard evaluation metrics include precision, recall, accuracy, and F-measure [29]. In multi-label classification, the evaluation is more complicated, because a result can be fully correct, partly correct, or fully incorrect. Take an example belonging to classes c1 and c2. We may get one of the following results:
- 1.
c1, c2 (correct),
- 2.
c1 (partly correct),
- 3.
c1, c3
Experimental results
We applied the above training and testing methods to semantic scene classification. As discussed in the Introduction, scene classification finds application in many areas, including content-based image analysis and organization and content-sensitive image enhancement. We now describe our baseline classifier and features and present the results.
Discussions
As shown in Table 1, some combined classes contain very few examples. The above experimental results show that the increase in accuracy due to the cross-training model is statistically significant; furthermore, these good multi-label results are produced even without an abundance of training data.
We now analyze the results obtained by using C-criterion and cross-training.1 The
Conclusions and future work
In this paper, we have presented an extensive comparative study of possible approaches to training and testing in multi-label classification. In particular, we contribute the following:
- •
Cross-training, a new training strategy to build classifiers. Experimental results show that cross-training is more efficient in using training data and more effective in classifying multi-label data.
- •
C-Criterion using threshold selected by MAP principle is effective for multi-label classification. Other
Acknowledgements
Boutell and Brown were supported by a grant from Eastman Kodak Company, by the NSF under Grant Number EIA-0080124, and by the Department of Education (GAANN) under Grant Number P200A000306. Shen was supported by DARPA under Grant Number F30602-03-2-0001.
About the Author–MATTHEW BOUTELL received the B.S. degree (with High Distinction) in Mathematical Science from Worcester Polytechnic Institute in 1993 and the M.Ed. degree from the University of Massachusetts in 1994. Currently, he is a Ph.D. student in Computer Science at the University of Rochester. He served for several years as a mathematics and computer science instructor at Norton High School and at Stonehill College. His research interests include computer vision, pattern recognition,
References (34)
- et al.
Image classification and querying using composite region templates
Comput. Vision Image Understanding
(1999) - et al.
Pattern Classification
(2001) - A. McCallum, Multi-label text classification with a mixture model trained by EM, in: AAAI’99 Workshop on Text Learning,...
- et al.
Boostexter: a boosting-based system for text categorization
Mach. Learning
(2000) - et al.
Knowledge Discovery in Multi-label Phenotype Data
(2001) - M. Boutell, J. Luo, R.T. Gray, Sunset scene classification using simulated image recomposition, in: International...
- C. Carson, S. Belongie, H. Greenspan, J. Malik, Recognition of images in large databases using a learning framework,...
- J. Fan, Y. Gao, H. Luo, M.-S. Hacid, A novel framework for semantic image classification and benchmark, in: ACM SIGKDD...
- et al.
Retrieval by classification of images containing large manmade objects using perceptual grouping
Pattern Recognition
(2001) - P. Lipson, E. Grimson, P. Sinha, Configuration based scene classification and image indexing, 1997. Proc - IEEE...
Modeling the shape of the scenea holistic representation of the spatial envelope
Int. J. Comput. Vision
Cited by (2296)
A thorough experimental comparison of multilabel methods for classification performance
2024, Pattern RecognitionA two-stage multi-view partial multi-label learning for enhanced disambiguation
2024, Knowledge-Based SystemsSemi-supervised imbalanced multi-label classification with label propagation
2024, Pattern RecognitionLearning with incomplete labels of multisource datasets for ECG classification
2024, Pattern RecognitionCuPe-KG: Cultural perspective–based knowledge graph construction of tourism resources via pretrained language models
2024, Information Processing and Management
About the Author–MATTHEW BOUTELL received the B.S. degree (with High Distinction) in Mathematical Science from Worcester Polytechnic Institute in 1993 and the M.Ed. degree from the University of Massachusetts in 1994. Currently, he is a Ph.D. student in Computer Science at the University of Rochester. He served for several years as a mathematics and computer science instructor at Norton High School and at Stonehill College. His research interests include computer vision, pattern recognition, probabilistic modeling, and image understanding. He is a student member of the IEEE.
About the Author–JIEBO LUO received his Ph.D. degree in Electrical Engineering from the University of Rochester in 1995. He is currently a Senior Principal Research Scientist in the Eastman Kodak Research Laboratories. His research interests include image processing, pattern recognition, and computer vision. He has authored over 80 technical papers and holds 20 granted US patents. Dr. Luo was the Chair of the Rochester Section of the IEEE Signal Processing Society in 2001, and the General Co-Chair of the IEEE Western New York Workshop on Image Processing in 2000 and 2001. He was also a member of the Organizing Committee of the 2002 IEEE International Conference on Image Processing and a Guest Co-Editor for the Journal of Wireless Communications and Mobile Computing Special Issue on Multimedia Over Mobile IP. Currently, he is serving as an Associate Editor of the journal of Pattern Recognition and Journal of Electronic Imaging, an adjunct faculty member at Rochester Institute of Technology, and an At-Large Member of the Kodak Research Scientific Council. Dr. Luo is a Senior Member of the IEEE.
About the Author–XIPENG SHEN received the M.S. degree in Computer Science from the University of Rochester in 2002 and the M.S. degree in Pattern Recognition and Intelligent Systems from the Chinese Academy of Sciences. He is currently a Ph.D graduate student at the Department of Computer Science, University of Rochester. His research interests include image processing, machine learning, program analysis and optimization, speech and language processing.
About the Author–CHRISTOPHER BROWN (B.A. Oberlin 1967, Ph.D. University of Chicago 1972) is Professor of Computer Science at the University of Rochester, where he has been since finishing a postdoctoral fellowship at the School of Artificial Intelligence at the University of Edinburgh in 1974. He is coauthor of COMPUTER VISION with his Rochester colleague Dana Ballard. His current research interests are computer vision and robotics, integrated parallel systems performing animate vision (the interaction of visual capabilities and motor behavior), and the integration of planning, learning, sensing, and control.
- ☆
A short version of this paper was published in the Proceedings of the SPIE 2004 Electronic Imaging Conference.