Semantic object classes in video: A high-definition ground truth database
Introduction
Training and rigorous evaluation of video-based object analysis algorithms require data that is labeled with ground truth. Video labeled with semantic object classes has two important uses. First, it can be used to train new algorithms that leverage motion cues for recognition, detection, and segmentation. Second, such labeled video can be useful to finally evaluate existing video algorithms quantitatively.
This paper presents the CamVid Database, which is to our knowledge, the only currently available video-based database with per-pixel ground truth for multiple classes. It consists of the original high-definition (HD) video footage and 10 min of frames which volunteers hand-labeled according to a list of 32 object classes. The pixel precision of the object labeling in the frames allows for accurate training and quantitative evaluation of algorithms. The database also includes the camera pose and calibration parameters of the original sequences. Further, we propose the InteractLabeler, an interactive software system to assist users with the manual labeling task. The volunteers’ paint strokes were logged by this software and are also included with the database.
We agree with the authors of Yao et al. (2007) that perhaps in addition to pixel-wise class labels, the semantic regions should be annotated with their shape or structure, or perhaps also organized hierarchically. Our data does not contain such information, but we propose that it may be possible to develop a form of high-level boundary-detection in the future that would convert this and other pixel-wise segmented data into a more useful form.
So far, modern databases have featured still images, to emphasize the breadth of object appearance. Object analysis algorithms are gradually maturing to the point where scenes (Lazebnik et al., 2006, Oliva and Torralba, 2001), landmarks (Snavely et al., 2006), and whole object classes (Rabinovich et al., in press) could be recognized in still images for a majority of the test data (Fei-Fei et al., 2006).
We anticipate that the greatest future innovations in object analysis will come from algorithms that take advantage of spatial and temporal context. Spatial context has already proven very valuable, as show by Hoiem et al. (2006) who took particular advantage of perspective cues. Yuan et al. (2007) showed the significant value of layout context and region adaptive grids in particular. Yuan et al. experimentally demonstrated improved performance for region annotation of objects from a subset of the Corel Stock Photo CDs which they had to annotate themselves for lack of existing labels. They have a lexicon of 11 concepts that overlaps with our 32 classes. Our database is meant to enable similar innovations, but also for temporal context instead of spatial. Fig. 1 lists the most relevant photo and video databases used for either recognition or segmentation.
The performance of dedicated detectors for cars (Leibe et al., 2007) and pedestrians (Dalal and Triggs, 2005) is generally quantified thanks to data where individual entities have been counted, or by measuring the overlap with annotated bounding boxes. A number of excellent still-image databases have become available recently, with varying amounts of annotation. The Microsoft Research Cambridge database (Shotton et al., 2006) is among the most relevant, because it includes per-pixel class labels for every photograph in the set. The LabelMe (Russell et al., 2005) effort has cleverly leveraged the internet and interest in annotated images to gradually grow their database of polygon outlines that approximate object boundaries. The PASCAL Visual Object Classes Challenge provides datasets and also invites authors to submit and compare the results of their respective object classification (and now segmentation) algorithms.
However, no equivalent initiative exists for video. It is reasonable to expect that the still-frame algorithms would perform similarly on frames sampled from video. However, to test this hypothesis, we were unable to find suitable existing video data with ground truth semantic labeling.
In the context of video based object analysis, many advanced techniques have been proposed for object segmentation (Marcotegui et al., 1999, Deng and Manjunath, 2001, Patras et al., 2003, Wang et al., 2005, Agarwala et al., 2004). However, the numerical evaluation of these techniques is often missing or limited. The results of video segmentation algorithms are usually illustrated by a few segmentation examples, without quantitative evaluation. Interestingly, for detection in security and criminal events, the PETS Workshop (The PETS, 2007) provides benchmark data consisting of event logs (for three types of events) and boxes. TRECVid (Smeaton et al., 2006) is one of the reigning event analysis datasets, containing shot-boundary information, and flags when a given shot ”features” sports, weather, studio, outdoor, etc. events.
We propose this new ground truth database to allow numerical evaluation of various recognition, detection, and segmentation techniques. The proposed CamVid Database consists of the following elements:
- •
the original video sequences (Section 2.1);
- •
the intrinsic calibration (Section 2.2);
- •
the camera pose trajectories (Section 2.3);
- •
the list of class labels and pseudo-colors (Section 3);
- •
the hand labeled frames (Section 3.1);
- •
the stroke logs for the hand labeled frames (Section 3.2).
The database and the InteractLabeler (Section 4.2) software shall be available for download from the web.1 A short video provides an overview of the database. It is available as supplemental material to this article, as well as on the database page itself.
Section snippets
High quality video
We drove with a camera mounted inside a car and filmed over two hours of video footage. The CamVid Database presented here is the resulting subset, lasting 22 min, 14 s. A high-definition 3CCD Panasonic HVX200 digital camera was used, capturing 960 × 720 pixel frames at 30 fps (frames per second). Note the pixel aspect ratio on the camera is not square and was kept as such to avoid interpolation and quality degradation.2 The
Semantic classes and labeled data
After surveying the greater set of videos, we identified 32 classes of interest to drivers. The class names and their corresponding pseudo-colors are given in Fig. 4. They include fixed objects, types of road surface, moving objects (including vehicles and people), and ceiling (sky, tunnel, archway). The relatively large number of classes (32) implies that labeled frames provide a rich semantic description of the scene from which spatial relationships and context can be learned.
The
Production of the labeled frames data
For 701 frames extracted from the database sequence, we hired 13 volunteers (the “labelers”) to manually produce the corresponding labeled images. They painted the areas corresponding to a predefined list of 32 object classes of interest, given a specific palette of colors (Fig. 4).
In this section, we give an overview of the website (Section 4.1) and the labeling software (Section 4.2) that were designed for this task. The website has allowed us to train volunteers and then exchange original
Applications and results
To evaluate the potential benefits of the CamVid Database, we measured the performance of several existing algorithms. Unlike many databases which were collected for a single application, the CamVid Database was intentionally designed for use in multiple domains. Three performance experiments examine the usefulness of the database for quantitative algorithm testing. The algorithms address, in turn, (i) object recognition, (ii) pedestrian detection, and (iii) segmented label propagation in
Discussion
The long term goals of object analysis research require that objects, even in motion, are identifiable when observed in the real world. To thoroughly evaluate and improve these object recognition algorithms, this paper proposes the CamVid annotated database. Building of this database is a direct response to the formidable challenge of providing video data with detailed semantic segmentation.
The CamVid Database offers four contributions that are relevant to object analysis researchers. First,
Acknowledgements
This work has been carried out with the support of Toyota Motor Europe. We are grateful to John Winn for help during filming.
References (36)
- et al.
Semi-automatic object-based video segmentation with labeling of color segments
Signal Process.: Image Comm.
(2003) - et al.
Keyframe-based tracking for rotoscoping and animation
ACM Trans. Graphics
(2004) - Bileschi, S., 2006. CBCL Streetscenes: towards scene understanding in still images, Tech. Rep. MIT-CBCL-TR-2006,...
- Bouguet, J.-Y., 2004. Camera Calibration Toolbox for MATLAB....
- Boujou, 2007. 2d3 Ltd....
- et al.
Segmentation and estimation of image region properties through cooperative hierarchical computation
IEEE Syst. Man Cybern. (SMC)
(1981) - et al.
Robust analysis of feature spaces: Color image segmentation
IEEE Conf. Comput. Vision Pattern Recognition (CVPR), Puerto Rico
(1997) - Dalal, N., Triggs, B., 2005. Histograms of oriented gradients for human detection. In: IEEE Comput. Vision Pattern...
- Dalal, N., Triggs, B., Schmid, C., 2006. Human detection using oriented histograms of flow and appearance. In: Eur....
- et al.
Unsupervised segmentation of color-texture regions in images and video
IEEE Trans. Pattern Anal. Machine Intell. (PAMI)
(2001)
One-shot learning of object categories
IEEE Trans. Pattern Anal. Machine Intell. (PAMI)
Efficient graph-based image segmentation
Internat. J. Comput. Vision (IJCV)
Cited by (1237)
Real-time semantic segmentation for underground mine tunnel
2024, Engineering Applications of Artificial IntelligenceAttention based lightweight asymmetric network for real-time semantic segmentation
2024, Engineering Applications of Artificial IntelligenceCN4SRSS: Combined network for super-resolution reconstruction and semantic segmentation in frontal-viewing camera images of vehicle
2024, Engineering Applications of Artificial IntelligenceMulti-branch residual image semantic segmentation combined with inverse weight gated-control
2024, Image and Vision Computing