Semantic object classes in video: A high-definition ground truth database

doi:10.1016/j.patrec.2008.04.005

Pattern Recognition Letters

Volume 30, Issue 2, 15 January 2009, Pages 88-97

https://doi.org/10.1016/j.patrec.2008.04.005 Get rights and content

Abstract

Visual object analysis researchers are increasingly experimenting with video, because it is expected that motion cues should help with detection, recognition, and other analysis tasks. This paper presents the Cambridge-driving Labeled Video Database (CamVid) as the first collection of videos with object class semantic labels, complete with metadata. The database provides ground truth labels that associate each pixel with one of 32 semantic classes.

The database addresses the need for experimental data to quantitatively evaluate emerging algorithms. While most videos are filmed with fixed-position CCTV-style cameras, our data was captured from the perspective of a driving automobile. The driving scenario increases the number and heterogeneity of the observed object classes. Over 10 min of high quality 30 Hz footage is being provided, with corresponding semantically labeled images at 1 Hz and in part, 15 Hz.

The CamVid Database offers four contributions that are relevant to object analysis researchers. First, the per-pixel semantic segmentation of over 700 images was specified manually, and was then inspected and confirmed by a second person for accuracy. Second, the high-quality and large resolution color video images in the database represent valuable extended duration digitized footage to those interested in driving scenarios or ego-motion. Third, we filmed calibration sequences for the camera color response and intrinsics, and computed a 3D camera pose for each frame in the sequences. Finally, in support of expanding this or other databases, we present custom-made labeling software for assisting users who wish to paint precise class-labels for other images and videos. We evaluate the relevance of the database by measuring the performance of an algorithm from each of three distinct domains: multi-class object recognition, pedestrian detection, and label propagation.

Introduction

Training and rigorous evaluation of video-based object analysis algorithms require data that is labeled with ground truth. Video labeled with semantic object classes has two important uses. First, it can be used to train new algorithms that leverage motion cues for recognition, detection, and segmentation. Second, such labeled video can be useful to finally evaluate existing video algorithms quantitatively.

This paper presents the CamVid Database, which is to our knowledge, the only currently available video-based database with per-pixel ground truth for multiple classes. It consists of the original high-definition (HD) video footage and 10 min of frames which volunteers hand-labeled according to a list of 32 object classes. The pixel precision of the object labeling in the frames allows for accurate training and quantitative evaluation of algorithms. The database also includes the camera pose and calibration parameters of the original sequences. Further, we propose the InteractLabeler, an interactive software system to assist users with the manual labeling task. The volunteers’ paint strokes were logged by this software and are also included with the database.

We agree with the authors of Yao et al. (2007) that perhaps in addition to pixel-wise class labels, the semantic regions should be annotated with their shape or structure, or perhaps also organized hierarchically. Our data does not contain such information, but we propose that it may be possible to develop a form of high-level boundary-detection in the future that would convert this and other pixel-wise segmented data into a more useful form.

So far, modern databases have featured still images, to emphasize the breadth of object appearance. Object analysis algorithms are gradually maturing to the point where scenes (Lazebnik et al., 2006, Oliva and Torralba, 2001), landmarks (Snavely et al., 2006), and whole object classes (Rabinovich et al., in press) could be recognized in still images for a majority of the test data (Fei-Fei et al., 2006).

We anticipate that the greatest future innovations in object analysis will come from algorithms that take advantage of spatial and temporal context. Spatial context has already proven very valuable, as show by Hoiem et al. (2006) who took particular advantage of perspective cues. Yuan et al. (2007) showed the significant value of layout context and region adaptive grids in particular. Yuan et al. experimentally demonstrated improved performance for region annotation of objects from a subset of the Corel Stock Photo CDs which they had to annotate themselves for lack of existing labels. They have a lexicon of 11 concepts that overlaps with our 32 classes. Our database is meant to enable similar innovations, but also for temporal context instead of spatial. Fig. 1 lists the most relevant photo and video databases used for either recognition or segmentation.

The performance of dedicated detectors for cars (Leibe et al., 2007) and pedestrians (Dalal and Triggs, 2005) is generally quantified thanks to data where individual entities have been counted, or by measuring the overlap with annotated bounding boxes. A number of excellent still-image databases have become available recently, with varying amounts of annotation. The Microsoft Research Cambridge database (Shotton et al., 2006) is among the most relevant, because it includes per-pixel class labels for every photograph in the set. The LabelMe (Russell et al., 2005) effort has cleverly leveraged the internet and interest in annotated images to gradually grow their database of polygon outlines that approximate object boundaries. The PASCAL Visual Object Classes Challenge provides datasets and also invites authors to submit and compare the results of their respective object classification (and now segmentation) algorithms.

However, no equivalent initiative exists for video. It is reasonable to expect that the still-frame algorithms would perform similarly on frames sampled from video. However, to test this hypothesis, we were unable to find suitable existing video data with ground truth semantic labeling.

In the context of video based object analysis, many advanced techniques have been proposed for object segmentation (Marcotegui et al., 1999, Deng and Manjunath, 2001, Patras et al., 2003, Wang et al., 2005, Agarwala et al., 2004). However, the numerical evaluation of these techniques is often missing or limited. The results of video segmentation algorithms are usually illustrated by a few segmentation examples, without quantitative evaluation. Interestingly, for detection in security and criminal events, the PETS Workshop (The PETS, 2007) provides benchmark data consisting of event logs (for three types of events) and boxes. TRECVid (Smeaton et al., 2006) is one of the reigning event analysis datasets, containing shot-boundary information, and flags when a given shot ”features” sports, weather, studio, outdoor, etc. events.

We propose this new ground truth database to allow numerical evaluation of various recognition, detection, and segmentation techniques. The proposed CamVid Database consists of the following elements:

•
the original video sequences (Section 2.1);
•
the intrinsic calibration (Section 2.2);
•
the camera pose trajectories (Section 2.3);
•
the list of class labels and pseudo-colors (Section 3);
•
the hand labeled frames (Section 3.1);
•
the stroke logs for the hand labeled frames (Section 3.2).

The database and the InteractLabeler (Section 4.2) software shall be available for download from the web.¹ A short video provides an overview of the database. It is available as supplemental material to this article, as well as on the database page itself.

Section snippets

High quality video

We drove with a camera mounted inside a car and filmed over two hours of video footage. The CamVid Database presented here is the resulting subset, lasting 22 min, 14 s. A high-definition 3CCD Panasonic HVX200 digital camera was used, capturing 960 × 720 pixel frames at 30 fps (frames per second). Note the pixel aspect ratio on the camera is not square and was kept as such to avoid interpolation and quality degradation.² The

Semantic classes and labeled data

After surveying the greater set of videos, we identified 32 classes of interest to drivers. The class names and their corresponding pseudo-colors are given in Fig. 4. They include fixed objects, types of road surface, moving objects (including vehicles and people), and ceiling (sky, tunnel, archway). The relatively large number of classes (32) implies that labeled frames provide a rich semantic description of the scene from which spatial relationships and context can be learned.

The

Production of the labeled frames data

For 701 frames extracted from the database sequence, we hired 13 volunteers (the “labelers”) to manually produce the corresponding labeled images. They painted the areas corresponding to a predefined list of 32 object classes of interest, given a specific palette of colors (Fig. 4).

In this section, we give an overview of the website (Section 4.1) and the labeling software (Section 4.2) that were designed for this task. The website has allowed us to train volunteers and then exchange original

Applications and results

To evaluate the potential benefits of the CamVid Database, we measured the performance of several existing algorithms. Unlike many databases which were collected for a single application, the CamVid Database was intentionally designed for use in multiple domains. Three performance experiments examine the usefulness of the database for quantitative algorithm testing. The algorithms address, in turn, (i) object recognition, (ii) pedestrian detection, and (iii) segmented label propagation in

Discussion

The long term goals of object analysis research require that objects, even in motion, are identifiable when observed in the real world. To thoroughly evaluate and improve these object recognition algorithms, this paper proposes the CamVid annotated database. Building of this database is a direct response to the formidable challenge of providing video data with detailed semantic segmentation.

The CamVid Database offers four contributions that are relevant to object analysis researchers. First,

Acknowledgements

This work has been carried out with the support of Toyota Motor Europe. We are grateful to John Winn for help during filming.

References (36)

I. Patras et al.
Semi-automatic object-based video segmentation with labeling of color segments
Signal Process.: Image Comm.
(2003)
A. Agarwala et al.
Keyframe-based tracking for rotoscoping and animation
ACM Trans. Graphics
(2004)
Bileschi, S., 2006. CBCL Streetscenes: towards scene understanding in still images, Tech. Rep. MIT-CBCL-TR-2006,...
Bouguet, J.-Y., 2004. Camera Calibration Toolbox for MATLAB....
Boujou, 2007. 2d3 Ltd....
P. Burt et al.
Segmentation and estimation of image region properties through cooperative hierarchical computation
IEEE Syst. Man Cybern. (SMC)
(1981)
D. Comaniciu et al.
Robust analysis of feature spaces: Color image segmentation
IEEE Conf. Comput. Vision Pattern Recognition (CVPR), Puerto Rico
(1997)
Dalal, N., Triggs, B., 2005. Histograms of oriented gradients for human detection. In: IEEE Comput. Vision Pattern...
Dalal, N., Triggs, B., Schmid, C., 2006. Human detection using oriented histograms of flow and appearance. In: Eur....
Y. Deng et al.
Unsupervised segmentation of color-texture regions in images and video
IEEE Trans. Pattern Anal. Machine Intell. (PAMI)
(2001)

Efros, A.A., Berg, A.C., Mori, G., Malik, J., 2003. Recognizing action at a distance. In: IEEE Internat. Conf. Comput....

Facebook homepage, 2007....

Fauqueur, J., Brostow, G., Cipolla, R., 2007. Assisted video object labeling by joint tracking of regions and...

L. Fei-Fei et al.

One-shot learning of object categories

IEEE Trans. Pattern Anal. Machine Intell. (PAMI)

(2006)

P. Felzenszwalb et al.

Efficient graph-based image segmentation

Internat. J. Comput. Vision (IJCV)

(2004)

Griffin, G., Holub, A., Perona, P., 2007. Caltech-256 object category dataset, Tech. Rep. 7694, California Institute of...

Hoiem, D., Efros, A.A., Hebert, M., 2006. Putting objects in perspective. In: Proc. IEEE Comput. Vision Pattern...

Lazebnik, S., Schmid, C., Ponce, J., 2006. Beyond bags of features: spatial pyramid matching for recognizing natural...

Cited by (1237)

Real-time semantic segmentation for underground mine tunnel
2024, Engineering Applications of Artificial Intelligence
Semantic segmentation is the underlying technology for many intelligent applications in underground mines. Unlike the ordinary scenario, the lighting of underground mines varies drastically, and there is heavy dust and mist, which results in poor image quality and greatly hinders the application of image semantic segmentation in underground mines. Because there is currently no datasets for underground mine tunnels available, a dataset named Underground Mine Tunnel Semantic Segmentation Dataset (UMTSSD) have to be constructed to support our research. UMTSSD consists of 3461 meticulously annotated images and 17 annotated categories. A real-time semantic segmentation algorithm named Fast Adaptive Deep Dual-resolution Network (FA-DDRNet) which uses Deep Dual-resolution Network (DDRNet) as the backbone is proposed for underground mines. To enhance the semantic segmentation accuracy in underground environment, FA-DDRNet introduces two modules: Fast Adaptive Input Normalization Module (FAINM) and Scale-wise Residual Cascade Module (SRCM). FAINM can autonomously and quickly adjust normal lighting images, weak lighting images, and overexposed images to improve the robustness of semantic segmentation algorithms. SRCM is integrated into the backbone to swiftly fuse multi-scale features in a cascade fashion, resulting in enhanced detection of objects with diverse shapes in underground environment. Finally, our method achieves exceptional performance with a superior inference speed compared to other semantic segmentation algorithms in UMTSSD. The method can realize running in real-time on low computational power embedded devices, which is well adapted to underground environment.
The greener the living environment, the better the health? Examining the effects of multiple green exposure metrics on physical activity and health among young students
2024, Environmental Research
The sedentary and less active lifestyle of modern college students has a significant impact on the physical and mental well-being of the college community. Campus Green Spaces (GSs) are crucial in promoting physical activity and improving students' health. However, previous research has focused on evaluating campuses as a whole, without considering the diverse spatial scenarios within the campus environment. Accordingly, this study focused on the young people’s residential scenario in university and constructed a framework including a comprehensive set of objective and subjective GSs exposure metrics. A systematic, objective exposure assessment framework ranging from 2D (GSs areas), and 2.5D (GSs visibility) to 3D (GSs volume) was innovatively developed using spatial analysis, deep learning technology, and unmanned aerial vehicle (UAV) measurement technology. Subjective exposure metrics incorporated GSs visiting frequency, GSs visiting duration, and GSs perceived quality. Our cross-sectional study was based on 820 university students in Nanjing, China. Subjective measures of GSs exposure, physical activity, and health status were obtained through self-reported questionnaires. The Generalized Linear Model (GLM) was used to evaluate the associations between GSs exposure, physical activity, and perceived health. Physical activity and social cohesion were considered as mediators, and path analysis based on Structural Equation Modeling (SEM) was used to disentangle the mechanisms linking GSs exposure to the health status of college students. We found that (1) 2D indicator suggested significant associations with health in the 100m buffer, and the potential underlying mechanisms were: GSs area → Physical activity → Social cohesion → Physical health → Mental health; GSs area → Physical activity → Social cohesion → Mental health. (2) Subjective GSs exposure indicators were more relevant in illustrating exposure-response relationships than objective ones. This study can clarify the complex nexus and mechanisms between campus GSs, physical activity, and health, and provide a practical reference for health-oriented campus GSs planning.
Linking repeated subjective judgments and ConvNets for multimodal assessment of the immediate living environment
2024, MethodsX
The integration of alternative data extraction approaches for multimodal data, can significantly reduce modeling difficulties for the automatic location assessment. We develop a method for assessing the quality of the immediate living environment by incorporating human judgments as ground truth into a neural network for generating new synthetic data and testing the effects in surrogate hedonic models. We expect that the quality of the data will be less biased if the annotation is performed by multiple independent persons applying repeated trials which should reduce the overall error variance and lead to more robust results. Experimental results show that linking repeated subjective judgements and Deep Learning can reliably determine the quality scores and thus expand the range of information for the quality assessment. The presented method is not computationally intensive, can be performed repetitively and can also be easily adapted to machine learning approaches in a broader sense or be transferred to other use cases. Following aspects are essential for the implementation of the method:
- •
  Sufficient amount of representative data for human assessment.
- •
  Repeated assessment trials by individuals.
- •
  Confident derivation of the effect of human judgments on property price as an approbation for further generation of synthetic data.
Attention based lightweight asymmetric network for real-time semantic segmentation
2024, Engineering Applications of Artificial Intelligence
Real-time semantic segmentation is one of the important tasks in the field of computer vision, which is widely used in the fields of autonomous driving and medical imaging. Existing lightweight networks usually improve inference speed at the sacrifice of segmentation accuracy. How to achieve a balance between accuracy and speed is still a challenging problem for real-time semantic segmentation. In this paper, we propose an attention based lightweight asymmetric network (ALANet) to address this problem. Specifically, in the encoder, a channel-wise attention based depth-wise asymmetric block (CADAB) is designed to extract sufficient features, which has a small number of parameters. In the decoder, a spatial attention based pyramid pooling (SAPP) module is presented to aggregate multi-scale context information by using a few convolutions and poolings; and a pixel-wise attention based multi-scale feature fusion (PAMFF) module is developed to fuse features from different scales and generate pixel-wise attention for improving image restoration. Our ALANet has only 1.32M parameters. Experimental results on the Cityscapes and CamVid datasets show that ALANet obtains the segmentation accuracy (mIoU) of 74.4% and 69.5% and the inference speed of 115.6FPS and 113.2FPS, respectively. These results demonstrate that ALANet achieves a good balance between accuracy and speed.
CN4SRSS: Combined network for super-resolution reconstruction and semantic segmentation in frontal-viewing camera images of vehicle
2024, Engineering Applications of Artificial Intelligence
Recently, the importance of semantic segmentation research for scene understanding in frontal viewing camera images of autonomous vehicles has increased. The existing state-of-the-art (SOTA) methods for semantic segmentation exhibit high accuracy for high-resolution images and low-resolution (LR) images without degradation factors of blur and noise. Owing to the nature of vehicles, the need is increasing for the pre-judgment of emergencies through the accurate semantic segmentation of LR images with the degradation factors acquired by low-cost camera at far distance. However, no research exists on super-resolution reconstruction (SR)-based semantic segmentation of LR images with degradation factors. Therefore, this study proposes a novel combined network for a super-resolution reconstruction and semantic segmentation (CN4SRSS) framework based on attention and re-focus network (ARNet), which exhibits low computational cost and high semantic segmentation accuracy. The experimental results using LR image datasets based on CamVid and Minicity datasets, which are open databases, show that the semantic segmentation accuracy (pixel accuracy) based on the proposed CN4SRSS and DeepLab v3 + is 93.14% and 89.48%, respectively. Particularly, the proposed method shows higher accuracy when compared to the SOTA methods. Furthermore, the proposed method has been confirmed that requires lower computational cost in terms of the number of parameters, memory usage, number of multi-adds calculation, and floating-point operations per second (FLOPs) than the SOTA methods.
Multi-branch residual image semantic segmentation combined with inverse weight gated-control
2024, Image and Vision Computing
The loss of pixel-level information in the multi-class segmentation task based on the U-net model results in unclear boundaries and low semantic segmentation accuracy. Aiming at this, a deep multi-branch residual Unet (IWG-MRUN) with fused inverse weight gated-control is proposed to improve the quality of image semantic segmentation. Specifically, we first introduce a deep multi-branch residual module, which used parallel convolution mode to capture the contextual feature to extract the detailed features of the input image at a deeper level. Then, we adopt an inverse weight gated-control module to enhance the diversity of up-sampling information by counterclockwise transmitting attention horizontally to improve the restoration accuracy of up-sampled image pixels. Finally, to obtain finer granularity features from low spatial resolution images, we adopt the different receptive field pyramid attention mechanisms at the highest level of the U-shaped encoder to capture high-level context information at different scales, thereby improving the accuracy of semantic segmentation. The experimental results show that the segmentation accuracy of the proposed algorithm reaches 91.80% and the CCE loss is reduced to 0.21. When compared to the Unet, BiSeNet, DeeplabV3 + and U-net + BLR model, the pixel accuracy of semantic segmentation is improved by 15.0%, 1.98%, 0.9% and 6.5%, respectively. The semantic segmentation model proposed in this paper provides an end-to-end semantic segmentation capability with the enriched finer granularity features of the target boundary and realizes the accurate segmentation of the objects in different categories.

View all citing articles on Scopus

View full text

Semantic object classes in video: A high-definition ground truth database

Abstract

Introduction

Section snippets

High quality video

Semantic classes and labeled data

Production of the labeled frames data

Applications and results

Discussion

Acknowledgements

Signal Process.: Image Comm.

Keyframe-based tracking for rotoscoping and animation

ACM Trans. Graphics

Segmentation and estimation of image region properties through cooperative hierarchical computation

IEEE Syst. Man Cybern. (SMC)

Robust analysis of feature spaces: Color image segmentation

IEEE Conf. Comput. Vision Pattern Recognition (CVPR), Puerto Rico

Unsupervised segmentation of color-texture regions in images and video

IEEE Trans. Pattern Anal. Machine Intell. (PAMI)

One-shot learning of object categories

IEEE Trans. Pattern Anal. Machine Intell. (PAMI)

Efficient graph-based image segmentation

Internat. J. Comput. Vision (IJCV)