Evaluation of video activity localizations integrating quality and quantity measurements

doi:10.1016/j.cviu.2014.06.014

Computer Vision and Image Understanding

Volume 127, October 2014, Pages 14-30

https://doi.org/10.1016/j.cviu.2014.06.014 Get rights and content

Highlights

•
A new evaluation procedure for action localization is proposed.
•
We introduce performance graphs showing quantity as a function of quality.
•
A single performance measure integrates out quality constraints.
•
Soft upper bounds estimated from experimental data.
•
The entry algorithms in the ICPR 2012 HARL competition are evaluated.

Abstract

Evaluating the performance of computer vision algorithms is classically done by reporting classification error or accuracy, if the problem at hand is the classification of an object in an image, the recognition of an activity in a video or the categorization and labeling of the image or video. If in addition the detection of an item in an image or a video, and/or its localization are required, frequently used metrics are Recall and Precision, as well as ROC curves. These metrics give quantitative performance values which are easy to understand and to interpret even by non-experts. However, an inherent problem is the dependency of quantitative performance measures on the quality constraints that we need impose on the detection algorithm. In particular, an important quality parameter of these measures is the spatial or spatio-temporal overlap between a ground-truth item and a detected item, and this needs to be taken into account when interpreting the results.

We propose a new performance metric addressing and unifying the qualitative and quantitative aspects of the performance measures. The performance of a detection and recognition algorithm is illustrated intuitively by performance graphs which present quantitative performance values, like Recall, Precision and F-Score, depending on quality constraints of the detection. In order to compare the performance of different computer vision algorithms, a representative single performance measure is computed from the graphs, by integrating out all quality parameters. The evaluation method can be applied to different types of activity detection and recognition algorithms. The performance metric has been tested on several activity recognition algorithms participating in the ICPR 2012 HARL competition.

Section snippets

Introduction and related work

Applications such as video surveillance, robotics, source selection, video indexing often require the recognition of actions and activities based on the motion of different actors in a video, for instance, people or vehicles. Certain applications may require assigning activities to one of the predefined classes, while others may focus on the detection of abnormal or infrequent unusual activities. This task is inherently more difficult than more traditional tasks like object recognition in

The performance metric

We propose a new performance metric for algorithms that detect and recognize complex activities in realistic environments. The goals of these algorithms are:

•
To detect relevant human behavior in midst of motion clutter originating from unrelated background activity, e.g., other people walking past the scene or other irrelevant actions.
•
To recognize detected actions among the given action classes.
•
To localize actions temporally and spatially.
•
To be able to manage multiple actions in the scene

The LIRIS/ ICPR 2012 HARL dataset

The LIRIS human activities dataset has been designed for recognizing complex and realistic actions in a set of videos, where each video may contain one or more actions concurrently. Table 1 shows the list of actions to be recognized. Some of them are interactions between two or more humans, like discussion and giving an item. Other actions are characterized as interactions between humans and objects, for instance talking on a telephone, leaving baggage unattended, etc. Note that simple

Results of the ICPR 2012 HARL competition

The proposed performance metric was tested on six different detection and recognition algorithms. Four methods correspond to submissions of the ICPR 2012 HARL competition, which was held in conjunction with the International Conference on Pattern Recognition 2012. Two additional methods have been applied to the same dataset.

The HARL competition took place during roughly 12 months from October 2011 to October 2012. The video frames of the competition dataset (described in Section 3) were

Conclusion

This paper has introduced a new performance metric which allows to evaluate human activity detection, recognition and localization algorithms. Taking into account localization information is a non-trivial task, as evaluation needs to decide for each activity whether it has been successfully detected based on detection quality constraints. The inherent dependency between performance and quality has been identified and a set of quantity/quality curves has been introduced to describe the detection

References (61)

J.M. Chaquet et al.
A survey of video datasets for human action and activity recognition
Comput. Vis. Image Und.
(2013)
C. Schuldt, I. Laptev, B. Caputo, Recognizing human actions: a local svm approach, in: International Conference on...
L. Zelnik-Manor, M. Irani, Weizmann Event-Based Analysis of Video, 2013....
I. Laptev, Irisa Download Data/Software, 2013....
I. Laptev, Hollywood2: Human Actions and Scenes Dataset, 2013....
S. University, Olympic Sports Dataset, 2013....
R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, M. Boonstra, V. Korzhova, Performance Evaluation Protocol for...
R. Collins, X. Zhou, S. Teh, An open source tracking testbed and evaluation web site, in: International Workshop on...
R. Kasturi et al.
Framework for performance evaluation of face, text, and vehicle detection and tracking in video: data, metrics, and protocol
IEEE Trans. Pattern Anal. Mach. Intell.
(2009)
A. Smeaton, P. Over, W. Kraaij, Evaluation campaigns and trecvid, in: International Workshop on Multimedia Information...

D. Mostefa et al.

The chil audiovisual corpus for lecture and meeting analysis inside smart rooms

Lang. Resour. Eval.

(2007)

X. Xu et al.

Exploring techniques for vision based human activity recognition: methods, systems, and evaluation

Sensors

(2013)

O. Kliper-Gross et al.

The action similarity labeling challenge

IEEE Trans. Pattern Anal. Mach. Intell.

(2012)

J.A.Ward, P. Lukowicz, G. Tröster, Evaluating performance in continuous context recognition using event-driven error...

J. Ward et al.

Activity recognition of assembly tasks using body-worn microphones and accelerometers

IEEE Trans. Pattern Anal. Mach. Intell.

(2006)

J.A. Ward et al.

Performance metrics for activity recognition

ACM Trans. Intell. Syst. Technol.

(2011)

D. Minnen et al.

Performance metrics and evaluation issues for continuous activity recognition

Perform. Metrics Intell. Syst.

(2006)

H.A.T.L.M. van Kasteren, C. Ersoy, Effective performance metrics for evaluating activity recognition methods, in: ARCS...

Caviar: Context Aware Vision Using Image-Based Active Recognition, 2013....

INRIA, Etiseo Video Understanding Evaluation, 2013....

D. Tran, A. Sorokin, D. Forsyth, Human Activity Recognition with Metric Learning, 2013....

J. Yuan, Z. Liu, Y. Wu, Discriminative Video Pattern Search for Efficient Action Detection, 2013....

R. Fisher, Behave: Computer-Assisted Prescreening of Video Streams for Unusual Activities, 2013....

C. for Biometrics, S. Research, Casia Action Database for Recognition, 2013....

U. of Surrey, CERTH-ITI, i3dpost Multi-View Human Action Datasets, 2013....

V.G. Group, Tv Human Interactions Dataset, 2013....

M.S. Ryoo, J.K. Aggarwal, Ut-Interaction Dataset, icpr Contest on Semantic Description of Human Activities (sdha),...

V.C. Group, Videoweb Dataset, 2013....

S. o. S.E. Reading University Computational Vision Group, Pets 2009 Benchmark Data, 2013....

S.S.W. Choi, K. Shahid, What are they doing? Collective activity classification using spatio-temporal relationship...

Cited by (73)

Transfer learning and its extensive appositeness in human activity recognition: A survey
2024, Expert Systems with Applications
In this competitive world, the supervision and monitoring of human resources are primary and necessary tasks to drive context-aware applications. Advancement in sensor and computational technology has cleared the path for automatic human activity recognition (HAR). First, machine learning and later deep learning play a cardinal role in this automation process. Classical machine learning approaches follow the hypothesis that the training, validation, and testing data belong to the same domain, where data distribution characteristics and the input feature space are alike. However, during real-time HAR, the above hypothesis does not always true. Transfer learning helps in an extended manner to transfer the required knowledge among heterogeneous data of various activities. To display the hierarchical advancements in transfer learning-enhanced HAR, we have shortlisted the 150 most influential works and articles from 2014–2021 based on their contribution, citation score, and year of publication. These selected articles are collected from IEEE Xplore, Web of Science, and Google Scholar digital libraries. We have also analyzed the statistical research interest related to this topic to substantiate the significance of our survey. We have found a significant growth of 10% in research publications related to this domain every year. Our survey provides a unique classification model to delineate the diversity in transfer learning-based HAR. This survey delves into the world of HAR datasets, exploring their types, specifications, advantages, and limitations. We also examine the steps involved in HAR, including the various transfer learning techniques and performance metrics, as well as the computational complexity associated with these methods. Additionally, we identify the challenges and gaps in HAR related to transfer learning and provide insights into future directions for researchers in this field. Based on the survey findings, researchers prefer the inductive transfer method, feature learning transfer mode, and cross-action transfer domain more over others due to their superior performance, with respective popularity scores of 55%, 40.8%, and 50.2%. This review aims to equip readers with a comprehensive understanding of HAR and transfer learning mechanisms, while also highlighting areas that require further research.
Video captioning: A comparative review of where we are and which could be the route
2023, Computer Vision and Image Understanding
Video captioning is the process of describing the content of a sequence of images capturing its semantic relationships and meanings. Dealing with this task with a single image is arduous, not to mention how difficult it is for a video (or image sequence). The amount and relevance of the applications of video captioning are vast, mainly to deal with a significant amount of video recordings in video surveillance, or assisting people visually impaired, to mention a few. To analyze where the efforts of our community to solve the video captioning task are, as well as what route could be better to follow, this manuscript presents an extensive review of more than 142 papers for the period of 2016 to 2022. As a result, the most-used datasets and metrics are identified and described. Also, the main approaches used and the best ones are analyzed and discussed. Furthermore, we compute a set of rankings based on several performance metrics to obtain, according to its reported performance, the best method with the best result on the video captioning task across of several datasets and metrics. Finally, some insights are concluded about which could be the next steps or opportunity areas to improve dealing with this complex task.
Human activity recognition using tools of convolutional neural networks: A state of the art review, data sets, challenges, and future prospects
2022, Computers in Biology and Medicine
Human Activity Recognition (HAR) plays a significant role in the everyday life of people because of its ability to learn extensive high-level information about human activity from wearable or stationary devices. A substantial amount of research has been conducted on HAR and numerous approaches based on deep learning have been exploited by the research community to classify human activities. The main goal of this review is to summarize recent works based on a wide range of deep neural networks architecture, namely convolutional neural networks (CNNs) for human activity recognition. The reviewed systems are clustered into four categories depending on the use of input devices like multimodal sensing devices, smartphones, radar, and vision devices. This review describes the performances, strengths, weaknesses, and the used hyperparameters of CNN architectures for each reviewed system with an overview of available public data sources. In addition, a discussion of the current challenges to CNN-based HAR systems is presented. Finally, this review is concluded with some potential future directions that would be of great assistance for the researchers who would like to contribute to this field. We conclude that CNN-based approaches are suitable for effective and accurate human activity recognition system applications despite challenges including availability of data regarding composite or group activities, high computational resource requirements, data privacy concerns, and edge computing limitations. For widespread adaptation, future research should be focused on more efficient edge computing techniques, datasets incorporating contextual information with activities, more explainable methodologies, and more robust systems.
A survey on RGB-D datasets
2022, Computer Vision and Image Understanding
RGB-D data is essential for solving many problems in computer vision. Hundreds of public RGB-D datasets containing various scenes, such as indoor, outdoor, aerial, driving, and medical, have been proposed. These datasets are useful for different applications and are fundamental for addressing classic computer vision tasks, such as monocular depth estimation. This paper reviewed and categorized image datasets that include depth information. We gathered 231 datasets that contain accessible data and grouped them into three categories: scene/objects, body, and medical. We also provided an overview of the different types of sensors, depth applications, and we examined trends and future directions of the usage and creation of datasets containing depth data, and how they can be applied to investigate the development of generalizable machine learning models in the monocular depth estimation field.
MTRNet++: One-stage mask-based scene text eraser
2020, Computer Vision and Image Understanding
Citation Excerpt :
To evaluate the quality or realistic degree of inpainting, PSNR, SSIM, MSE and MAE scores are calculated when the ground truth images are available. Note that, for text detection, we applied the recent state-of-the-art text detection method CRAFT (Baek et al., 2019), and for evaluation the DetEval (Wolf et al., 2014) protocol is used. MTRNet++’s three branches output a refined-mask, a coarse-inpainted image and a fine-inpainted image for a given input.
A precise, controllable, interpretable and easily trainable text removal approach is necessary for both user-specific and large-scale text removal applications. To achieve this, we propose a one-stage mask-based text inpainting network, MTRNet++. It has a novel architecture that includes mask-refine, coarse-inpainting and fine-inpainting branches, and attention blocks. With this architecture, MTRNet++ can remove text either with or without an external mask. It achieves state-of-the-art results on both the Oxford and SCUT datasets without using external ground-truth masks. The results of ablation studies demonstrate that the proposed multi-branch architecture with attention blocks is effective and essential. It also demonstrates controllability and interpretability.
ANIM: Accurate Neural Implicit Model for Human Reconstruction from a single RGB-D image
2024, arXiv

View all citing articles on Scopus

^☆: This paper has been recommended for acceptance by Anurag Mittal.

View full text

Evaluation of video activity localizations integrating quality and quantity measurements☆

Highlights

Abstract

Section snippets

Introduction and related work

The performance metric

The LIRIS/ ICPR 2012 HARL dataset

Results of the ICPR 2012 HARL competition

Conclusion

Comput. Vis. Image Und.

Framework for performance evaluation of face, text, and vehicle detection and tracking in video: data, metrics, and protocol

IEEE Trans. Pattern Anal. Mach. Intell.

The chil audiovisual corpus for lecture and meeting analysis inside smart rooms

Lang. Resour. Eval.

Exploring techniques for vision based human activity recognition: methods, systems, and evaluation

Sensors

The action similarity labeling challenge

IEEE Trans. Pattern Anal. Mach. Intell.

Activity recognition of assembly tasks using body-worn microphones and accelerometers

IEEE Trans. Pattern Anal. Mach. Intell.

Performance metrics for activity recognition

ACM Trans. Intell. Syst. Technol.

Performance metrics and evaluation issues for continuous activity recognition

Perform. Metrics Intell. Syst.