Elsevier

Pattern Recognition

Volume 37, Issue 9, September 2004, Pages 1757-1771
Pattern Recognition

Learning multi-label scene classification

https://doi.org/10.1016/j.patcog.2004.03.009Get rights and content

Abstract

In classic pattern recognition problems, classes are mutually exclusive by definition. Classification errors occur when the classes overlap in the feature space. We examine a different situation, occurring when the classes are, by definition, not mutually exclusive. Such problems arise in semantic scene and document classification and in medical diagnosis. We present a framework to handle such problems and apply it to the problem of semantic scene classification, where a natural scene may contain multiple objects such that the scene can be described by multiple class labels (e.g., a field scene with a mountain in the background). Such a problem poses challenges to the classic pattern recognition paradigm and demands a different treatment. We discuss approaches for training and testing in this scenario and introduce new metrics for evaluating individual examples, class recall and precision, and overall accuracy. Experiments show that our methods are suitable for scene classification; furthermore, our work appears to generalize to other classification problems of the same nature.

Introduction

In traditional classification tasks [1]:

Classes are mutually exclusive by definition. Let χ be the domain of examples to be classified, Y be the set of labels, and H be the set of classifiers for χY. The goal is to find the classifier hH maximizing the probability of h(x)=y, where yY is the ground truth label of x, i.e.,y=argmaxiP(yi|x).

Classification errors occur when the classes overlap in the selected feature space (Fig. 2a). Various classification methods have been developed to provide different operating characteristics, including linear discriminant functions, artificial neural networks (ANN), k-nearest-neighbor (k-NN), radial basis functions (RBF) and support vector machines (SVM) [1].

However, in some classification tasks, it is likely that some data belongs to multiple classes, causing the actual classes to overlap by definition. In text or music categorization, documents may belong to multiple genres, such as government and health, or rock and blues[2], [3]. Architecture may belong to multiple genres as well. In medical diagnosis, a disease may belong to multiple categories, and genes may have multiple functions, yielding multiple labels [4].

A problem domain receiving renewed attention is semantic scene classification [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], categorizing images into semantic classes such as beaches, sunsets or parties. Semantic scene classification finds application in many areas, including content-based indexing and organization and content-sensitive image enhancement.

Many current digital library systems allow a user to specify a query image and search for images “similar” to it, where similarity is often defined only by color or texture properties. This the so-called “query by example” process has often proved to be inadequate [19]. Knowing the category of a scene helps narrow the search space dramatically, reducing the search space, and simultaneously increasing the hit rate and reducing the false alarm rate.

Knowledge about the scene category can find also application in context-sensitive image enhancement [16]. While an algorithm might enhance the quality of some classes of pictures, it can degrade others. Rather than applying a generic algorithm to all images, we could customize it to the scene type (allowing us, for example, to retain or enhance the brilliant colors of sunset images while reducing the warm-colored cast from tungsten-illuminated scenes).

In the scene classification domain, many images may belong to multiple semantic classes. Fig. 1(a) shows an image that had been classified by a human as a beach scene. However, it is clearly both a beach scene and an urban scene. It is not a fuzzy member of each (due to ambiguity), but is a full member of each class (due to multiplicity). Fig. 1(b) (beach and mountains) is similar.

Much research has been done on scene classification recently, e.g., [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18]. Most systems are exemplar-based, learning patterns from a training set using statistical pattern recognition techniques. A variety of features and classifiers have been proposed; most systems use low-level features (e.g., color, texture). However, none addresses the use of multi-label images.

When choosing their data sets, most researchers either avoid such images, label them subjectively with the base (single-label) class most obvious to them, or consider “beach+urban” as a new class. The last method is unrealistic in most cases because it would increase the number of classes to be considered substantially and the data in such combined classes is usually sparse. The first two methods have limitations as well. For example, in content-based image indexing and retrieval applications, it would be more difficult for a user to retrieve a multiple-class image (e.g., beach+urban) if we only have exclusive beach or urban labels. It may require that two separate queries be conducted respectively and the intersection of the retrieved images be taken. In a content-sensitive image enhancement application, it may be desirable for the system to have different settings for beach, urban, and beach+urban scenes. This is impossible using exclusive single labels.

In this work, we consider the following problem:

The base classes are non-mutually exclusive and may overlap by definition (Fig. 2b). As before, let χ be the domain of examples to be classified and Y be the set of labels. Now let B be a set of binary vectors, each of length |Y|. Each vector bB indicates membership in the base classes in Y (+1=member,−1=non-member). H is the set of classifiers for χB. The goal is to find the classifier hH that minimizes a distance (e.g., Hamming), between h(x) and bx for a newly observed example x.

In a probabilistic formulation, the goal of classifying x is to find one or more base class labels in a set C and for a threshold T such thatP(c|x)>T,∀c∈C.

Clearly, the mathematical formulation and its physical meaning are distinctively different from those used in classic pattern recognition. Few papers address this problem (see Section 2), and most of these are specialized for text classification or bioinformatics. Based on the multi-label model, we investigate several methods of training and propose a novel training method, “cross-training”. We also propose three classification criteria in testing. When applying our methods to scene classification, our experiments show that our approach is successful on multi-label images even without an abundance of training data. We also propose a generic evaluation metric that can be tailored to applications needing different error forgiveness.

It is worth noting that multi-label classification is different from fuzzy logic-based classification. Fuzzy logics are used as a means to cope with ambiguity in the feature space between multiple classes for a given sample, not as the end for achieving multi-label classification. The fuzzy membership stems from ambiguity and often a de-fuzzification step is eventually used to derive a crisp decision (typically by choosing the class with the highest membership value). For example, a foliage scene and a sunset scene may share some warm, bright colors, therefore there is confusion between the two scene classes in the selected feature space if color features are used; fuzzy logic would be suitable for solving this problem.

In contrast, multi-label classification is a unique problem in that a sample may possess multiple properties of multiple classes. The content for different classes can be quite distinct: for example, there is little confusion between beach (sand, water) and city (buildings).

The only commonalty between fuzzy-logic classification and multi-class classification is the use of membership functions. However, there is correlation between fuzzy membership functions: when one membership takes low values, the other also takes low values or high values and vice versa [20]. On the other hand, the membership functions in multi-label case are largely coincidence (e.g., resort on the beach). In practice, the sum of fuzzy memberships usually is normalized to 1, while no such constraints apply to the multi-class problem (e.g., a beach resort scene is both a beach scene and a city scene, each with certainty).

With these differences aside, it is conceivable that one could use the learning strategies described in this paper in combination with a fuzzy classifier in a similar way as they were used with the pattern classifiers in this study.

In this paper, we first review past work related to multi-label classification. In Section 3, we describe our training models and testing criteria. Section 4 contains the proposed evaluation methods. Section 5 contains the experimental results obtained by applying our approaches to multi-labeled scene classification. We conclude with a discussion and suggestions for future work.

Section snippets

Related work

The sparse literature on multi-label classification is primarily geared to text classification or bioinformatics. For text classification, Schapire and Singer [3] proposed BoosTexter, extending AdaBoost to handle multi-label text categorization. However, they note that controlling complexity due to overfitting in their model is an open issue. McCallum [2] proposed a mixture model trained by EM, selecting the most probable set of labels from the power set of possible classes and using heuristics

Multi-label classification

In this section, we describe possible approaches for training and testing with multi-label data. Consider two classes, denoted by ‘+’ and ‘x’ respectively. Examples belonging to both the ‘+'and ‘x’ classes simultaneously are denoted by ‘*’ (see Fig. 2b).

Evaluating multi-label classification results

Evaluating the performance of multi-label classification is different from evaluating performance of classic single-label classification. Standard evaluation metrics include precision, recall, accuracy, and F-measure [29]. In multi-label classification, the evaluation is more complicated, because a result can be fully correct, partly correct, or fully incorrect. Take an example belonging to classes c1 and c2. We may get one of the following results:

  • 1.

    c1, c2 (correct),

  • 2.

    c1 (partly correct),

  • 3.

    c1, c3

Experimental results

We applied the above training and testing methods to semantic scene classification. As discussed in the Introduction, scene classification finds application in many areas, including content-based image analysis and organization and content-sensitive image enhancement. We now describe our baseline classifier and features and present the results.

Discussions

As shown in Table 1, some combined classes contain very few examples. The above experimental results show that the increase in accuracy due to the cross-training model is statistically significant; furthermore, these good multi-label results are produced even without an abundance of training data.

We now analyze the results obtained by using C-criterion and cross-training.1 The

Conclusions and future work

In this paper, we have presented an extensive comparative study of possible approaches to training and testing in multi-label classification. In particular, we contribute the following:

  • Cross-training, a new training strategy to build classifiers. Experimental results show that cross-training is more efficient in using training data and more effective in classifying multi-label data.

  • C-Criterion using threshold selected by MAP principle is effective for multi-label classification. Other

Acknowledgements

Boutell and Brown were supported by a grant from Eastman Kodak Company, by the NSF under Grant Number EIA-0080124, and by the Department of Education (GAANN) under Grant Number P200A000306. Shen was supported by DARPA under Grant Number F30602-03-2-0001.

About the AuthorMATTHEW BOUTELL received the B.S. degree (with High Distinction) in Mathematical Science from Worcester Polytechnic Institute in 1993 and the M.Ed. degree from the University of Massachusetts in 1994. Currently, he is a Ph.D. student in Computer Science at the University of Rochester. He served for several years as a mathematics and computer science instructor at Norton High School and at Stonehill College. His research interests include computer vision, pattern recognition,

References (34)

  • J.R. Smith et al.

    Image classification and querying using composite region templates

    Comput. Vision Image Understanding

    (1999)
  • R. Duda et al.

    Pattern Classification

    (2001)
  • A. McCallum, Multi-label text classification with a mixture model trained by EM, in: AAAI’99 Workshop on Text Learning,...
  • R. Schapire et al.

    Boostexter: a boosting-based system for text categorization

    Mach. Learning

    (2000)
  • A. Clare et al.

    Knowledge Discovery in Multi-label Phenotype Data

    (2001)
  • M. Boutell, J. Luo, R.T. Gray, Sunset scene classification using simulated image recomposition, in: International...
  • C. Carson, S. Belongie, H. Greenspan, J. Malik, Recognition of images in large databases using a learning framework,...
  • J. Fan, Y. Gao, H. Luo, M.-S. Hacid, A novel framework for semantic image classification and benchmark, in: ACM SIGKDD...
  • Q. Iqbal et al.

    Retrieval by classification of images containing large manmade objects using perceptual grouping

    Pattern Recognition

    (2001)
  • P. Lipson, E. Grimson, P. Sinha, Configuration based scene classification and image indexing, 1997. Proc - IEEE...
  • A. Oliva et al.

    Modeling the shape of the scenea holistic representation of the spatial envelope

    Int. J. Comput. Vision

    (2001)
  • A. Oliva, A. Torralba, Scene-centered description from spatial envelope properties, in: Second Workshop on Biologically...
  • S. Paek, S.-F. Chang, A knowledge engineering approach for image classification based on probabilistic reasoning...
  • N. Serrano, A. Savakis, J. Luo, A computationally efficient approach to indoor/outdoor scene classification, in:...
  • Y. Song, A. Zhang, Analyzing scenery images by monotonic tree, ACM Multimedia Systems J. 8 (6) 495–511...
  • M. Szummer, R.W. Picard, Indoor–outdoor image classification, in: IEEE International Workshop on Content-based Access...
  • A. Torralba, P. Sinha, Recognizing indoor scenes, Technical Report, AI Memo 2001-015, CBCL Memo 202, MIT, July...
  • Cited by (2296)

    View all citing articles on Scopus

    About the AuthorMATTHEW BOUTELL received the B.S. degree (with High Distinction) in Mathematical Science from Worcester Polytechnic Institute in 1993 and the M.Ed. degree from the University of Massachusetts in 1994. Currently, he is a Ph.D. student in Computer Science at the University of Rochester. He served for several years as a mathematics and computer science instructor at Norton High School and at Stonehill College. His research interests include computer vision, pattern recognition, probabilistic modeling, and image understanding. He is a student member of the IEEE.

    About the AuthorJIEBO LUO received his Ph.D. degree in Electrical Engineering from the University of Rochester in 1995. He is currently a Senior Principal Research Scientist in the Eastman Kodak Research Laboratories. His research interests include image processing, pattern recognition, and computer vision. He has authored over 80 technical papers and holds 20 granted US patents. Dr. Luo was the Chair of the Rochester Section of the IEEE Signal Processing Society in 2001, and the General Co-Chair of the IEEE Western New York Workshop on Image Processing in 2000 and 2001. He was also a member of the Organizing Committee of the 2002 IEEE International Conference on Image Processing and a Guest Co-Editor for the Journal of Wireless Communications and Mobile Computing Special Issue on Multimedia Over Mobile IP. Currently, he is serving as an Associate Editor of the journal of Pattern Recognition and Journal of Electronic Imaging, an adjunct faculty member at Rochester Institute of Technology, and an At-Large Member of the Kodak Research Scientific Council. Dr. Luo is a Senior Member of the IEEE.

    About the AuthorXIPENG SHEN received the M.S. degree in Computer Science from the University of Rochester in 2002 and the M.S. degree in Pattern Recognition and Intelligent Systems from the Chinese Academy of Sciences. He is currently a Ph.D graduate student at the Department of Computer Science, University of Rochester. His research interests include image processing, machine learning, program analysis and optimization, speech and language processing.

    About the AuthorCHRISTOPHER BROWN (B.A. Oberlin 1967, Ph.D. University of Chicago 1972) is Professor of Computer Science at the University of Rochester, where he has been since finishing a postdoctoral fellowship at the School of Artificial Intelligence at the University of Edinburgh in 1974. He is coauthor of COMPUTER VISION with his Rochester colleague Dana Ballard. His current research interests are computer vision and robotics, integrated parallel systems performing animate vision (the interaction of visual capabilities and motor behavior), and the integration of planning, learning, sensing, and control.

    A short version of this paper was published in the Proceedings of the SPIE 2004 Electronic Imaging Conference.

    View full text