Elsevier

Pattern Recognition Letters

Volume 30, Issue 2, 15 January 2009, Pages 88-97
Pattern Recognition Letters

Semantic object classes in video: A high-definition ground truth database

https://doi.org/10.1016/j.patrec.2008.04.005Get rights and content

Abstract

Visual object analysis researchers are increasingly experimenting with video, because it is expected that motion cues should help with detection, recognition, and other analysis tasks. This paper presents the Cambridge-driving Labeled Video Database (CamVid) as the first collection of videos with object class semantic labels, complete with metadata. The database provides ground truth labels that associate each pixel with one of 32 semantic classes.

The database addresses the need for experimental data to quantitatively evaluate emerging algorithms. While most videos are filmed with fixed-position CCTV-style cameras, our data was captured from the perspective of a driving automobile. The driving scenario increases the number and heterogeneity of the observed object classes. Over 10 min of high quality 30 Hz footage is being provided, with corresponding semantically labeled images at 1 Hz and in part, 15 Hz.

The CamVid Database offers four contributions that are relevant to object analysis researchers. First, the per-pixel semantic segmentation of over 700 images was specified manually, and was then inspected and confirmed by a second person for accuracy. Second, the high-quality and large resolution color video images in the database represent valuable extended duration digitized footage to those interested in driving scenarios or ego-motion. Third, we filmed calibration sequences for the camera color response and intrinsics, and computed a 3D camera pose for each frame in the sequences. Finally, in support of expanding this or other databases, we present custom-made labeling software for assisting users who wish to paint precise class-labels for other images and videos. We evaluate the relevance of the database by measuring the performance of an algorithm from each of three distinct domains: multi-class object recognition, pedestrian detection, and label propagation.

Introduction

Training and rigorous evaluation of video-based object analysis algorithms require data that is labeled with ground truth. Video labeled with semantic object classes has two important uses. First, it can be used to train new algorithms that leverage motion cues for recognition, detection, and segmentation. Second, such labeled video can be useful to finally evaluate existing video algorithms quantitatively.

This paper presents the CamVid Database, which is to our knowledge, the only currently available video-based database with per-pixel ground truth for multiple classes. It consists of the original high-definition (HD) video footage and 10 min of frames which volunteers hand-labeled according to a list of 32 object classes. The pixel precision of the object labeling in the frames allows for accurate training and quantitative evaluation of algorithms. The database also includes the camera pose and calibration parameters of the original sequences. Further, we propose the InteractLabeler, an interactive software system to assist users with the manual labeling task. The volunteers’ paint strokes were logged by this software and are also included with the database.

We agree with the authors of Yao et al. (2007) that perhaps in addition to pixel-wise class labels, the semantic regions should be annotated with their shape or structure, or perhaps also organized hierarchically. Our data does not contain such information, but we propose that it may be possible to develop a form of high-level boundary-detection in the future that would convert this and other pixel-wise segmented data into a more useful form.

So far, modern databases have featured still images, to emphasize the breadth of object appearance. Object analysis algorithms are gradually maturing to the point where scenes (Lazebnik et al., 2006, Oliva and Torralba, 2001), landmarks (Snavely et al., 2006), and whole object classes (Rabinovich et al., in press) could be recognized in still images for a majority of the test data (Fei-Fei et al., 2006).

We anticipate that the greatest future innovations in object analysis will come from algorithms that take advantage of spatial and temporal context. Spatial context has already proven very valuable, as show by Hoiem et al. (2006) who took particular advantage of perspective cues. Yuan et al. (2007) showed the significant value of layout context and region adaptive grids in particular. Yuan et al. experimentally demonstrated improved performance for region annotation of objects from a subset of the Corel Stock Photo CDs which they had to annotate themselves for lack of existing labels. They have a lexicon of 11 concepts that overlaps with our 32 classes. Our database is meant to enable similar innovations, but also for temporal context instead of spatial. Fig. 1 lists the most relevant photo and video databases used for either recognition or segmentation.

The performance of dedicated detectors for cars (Leibe et al., 2007) and pedestrians (Dalal and Triggs, 2005) is generally quantified thanks to data where individual entities have been counted, or by measuring the overlap with annotated bounding boxes. A number of excellent still-image databases have become available recently, with varying amounts of annotation. The Microsoft Research Cambridge database (Shotton et al., 2006) is among the most relevant, because it includes per-pixel class labels for every photograph in the set. The LabelMe (Russell et al., 2005) effort has cleverly leveraged the internet and interest in annotated images to gradually grow their database of polygon outlines that approximate object boundaries. The PASCAL Visual Object Classes Challenge provides datasets and also invites authors to submit and compare the results of their respective object classification (and now segmentation) algorithms.

However, no equivalent initiative exists for video. It is reasonable to expect that the still-frame algorithms would perform similarly on frames sampled from video. However, to test this hypothesis, we were unable to find suitable existing video data with ground truth semantic labeling.

In the context of video based object analysis, many advanced techniques have been proposed for object segmentation (Marcotegui et al., 1999, Deng and Manjunath, 2001, Patras et al., 2003, Wang et al., 2005, Agarwala et al., 2004). However, the numerical evaluation of these techniques is often missing or limited. The results of video segmentation algorithms are usually illustrated by a few segmentation examples, without quantitative evaluation. Interestingly, for detection in security and criminal events, the PETS Workshop (The PETS, 2007) provides benchmark data consisting of event logs (for three types of events) and boxes. TRECVid (Smeaton et al., 2006) is one of the reigning event analysis datasets, containing shot-boundary information, and flags when a given shot ”features” sports, weather, studio, outdoor, etc. events.

We propose this new ground truth database to allow numerical evaluation of various recognition, detection, and segmentation techniques. The proposed CamVid Database consists of the following elements:

  • the original video sequences (Section 2.1);

  • the intrinsic calibration (Section 2.2);

  • the camera pose trajectories (Section 2.3);

  • the list of class labels and pseudo-colors (Section 3);

  • the hand labeled frames (Section 3.1);

  • the stroke logs for the hand labeled frames (Section 3.2).

The database and the InteractLabeler (Section 4.2) software shall be available for download from the web.1 A short video provides an overview of the database. It is available as supplemental material to this article, as well as on the database page itself.

Section snippets

High quality video

We drove with a camera mounted inside a car and filmed over two hours of video footage. The CamVid Database presented here is the resulting subset, lasting 22 min, 14 s. A high-definition 3CCD Panasonic HVX200 digital camera was used, capturing 960 × 720 pixel frames at 30 fps (frames per second). Note the pixel aspect ratio on the camera is not square and was kept as such to avoid interpolation and quality degradation.2 The

Semantic classes and labeled data

After surveying the greater set of videos, we identified 32 classes of interest to drivers. The class names and their corresponding pseudo-colors are given in Fig. 4. They include fixed objects, types of road surface, moving objects (including vehicles and people), and ceiling (sky, tunnel, archway). The relatively large number of classes (32) implies that labeled frames provide a rich semantic description of the scene from which spatial relationships and context can be learned.

The

Production of the labeled frames data

For 701 frames extracted from the database sequence, we hired 13 volunteers (the “labelers”) to manually produce the corresponding labeled images. They painted the areas corresponding to a predefined list of 32 object classes of interest, given a specific palette of colors (Fig. 4).

In this section, we give an overview of the website (Section 4.1) and the labeling software (Section 4.2) that were designed for this task. The website has allowed us to train volunteers and then exchange original

Applications and results

To evaluate the potential benefits of the CamVid Database, we measured the performance of several existing algorithms. Unlike many databases which were collected for a single application, the CamVid Database was intentionally designed for use in multiple domains. Three performance experiments examine the usefulness of the database for quantitative algorithm testing. The algorithms address, in turn, (i) object recognition, (ii) pedestrian detection, and (iii) segmented label propagation in

Discussion

The long term goals of object analysis research require that objects, even in motion, are identifiable when observed in the real world. To thoroughly evaluate and improve these object recognition algorithms, this paper proposes the CamVid annotated database. Building of this database is a direct response to the formidable challenge of providing video data with detailed semantic segmentation.

The CamVid Database offers four contributions that are relevant to object analysis researchers. First,

Acknowledgements

This work has been carried out with the support of Toyota Motor Europe. We are grateful to John Winn for help during filming.

References (36)

  • I. Patras et al.

    Semi-automatic object-based video segmentation with labeling of color segments

    Signal Process.: Image Comm.

    (2003)
  • A. Agarwala et al.

    Keyframe-based tracking for rotoscoping and animation

    ACM Trans. Graphics

    (2004)
  • Bileschi, S., 2006. CBCL Streetscenes: towards scene understanding in still images, Tech. Rep. MIT-CBCL-TR-2006,...
  • Bouguet, J.-Y., 2004. Camera Calibration Toolbox for MATLAB....
  • Boujou, 2007. 2d3 Ltd....
  • P. Burt et al.

    Segmentation and estimation of image region properties through cooperative hierarchical computation

    IEEE Syst. Man Cybern. (SMC)

    (1981)
  • D. Comaniciu et al.

    Robust analysis of feature spaces: Color image segmentation

    IEEE Conf. Comput. Vision Pattern Recognition (CVPR), Puerto Rico

    (1997)
  • Dalal, N., Triggs, B., 2005. Histograms of oriented gradients for human detection. In: IEEE Comput. Vision Pattern...
  • Dalal, N., Triggs, B., Schmid, C., 2006. Human detection using oriented histograms of flow and appearance. In: Eur....
  • Y. Deng et al.

    Unsupervised segmentation of color-texture regions in images and video

    IEEE Trans. Pattern Anal. Machine Intell. (PAMI)

    (2001)
  • Efros, A.A., Berg, A.C., Mori, G., Malik, J., 2003. Recognizing action at a distance. In: IEEE Internat. Conf. Comput....
  • Facebook homepage, 2007....
  • Fauqueur, J., Brostow, G., Cipolla, R., 2007. Assisted video object labeling by joint tracking of regions and...
  • L. Fei-Fei et al.

    One-shot learning of object categories

    IEEE Trans. Pattern Anal. Machine Intell. (PAMI)

    (2006)
  • P. Felzenszwalb et al.

    Efficient graph-based image segmentation

    Internat. J. Comput. Vision (IJCV)

    (2004)
  • Griffin, G., Holub, A., Perona, P., 2007. Caltech-256 object category dataset, Tech. Rep. 7694, California Institute of...
  • Hoiem, D., Efros, A.A., Hebert, M., 2006. Putting objects in perspective. In: Proc. IEEE Comput. Vision Pattern...
  • Lazebnik, S., Schmid, C., Ponce, J., 2006. Beyond bags of features: spatial pyramid matching for recognizing natural...
  • Cited by (1237)

    • Real-time semantic segmentation for underground mine tunnel

      2024, Engineering Applications of Artificial Intelligence
    • Attention based lightweight asymmetric network for real-time semantic segmentation

      2024, Engineering Applications of Artificial Intelligence
    View all citing articles on Scopus
    View full text