Evaluation of video activity localizations integrating quality and quantity measurements

https://doi.org/10.1016/j.cviu.2014.06.014Get rights and content

Highlights

  • A new evaluation procedure for action localization is proposed.

  • We introduce performance graphs showing quantity as a function of quality.

  • A single performance measure integrates out quality constraints.

  • Soft upper bounds estimated from experimental data.

  • The entry algorithms in the ICPR 2012 HARL competition are evaluated.

Abstract

Evaluating the performance of computer vision algorithms is classically done by reporting classification error or accuracy, if the problem at hand is the classification of an object in an image, the recognition of an activity in a video or the categorization and labeling of the image or video. If in addition the detection of an item in an image or a video, and/or its localization are required, frequently used metrics are Recall and Precision, as well as ROC curves. These metrics give quantitative performance values which are easy to understand and to interpret even by non-experts. However, an inherent problem is the dependency of quantitative performance measures on the quality constraints that we need impose on the detection algorithm. In particular, an important quality parameter of these measures is the spatial or spatio-temporal overlap between a ground-truth item and a detected item, and this needs to be taken into account when interpreting the results.

We propose a new performance metric addressing and unifying the qualitative and quantitative aspects of the performance measures. The performance of a detection and recognition algorithm is illustrated intuitively by performance graphs which present quantitative performance values, like Recall, Precision and F-Score, depending on quality constraints of the detection. In order to compare the performance of different computer vision algorithms, a representative single performance measure is computed from the graphs, by integrating out all quality parameters. The evaluation method can be applied to different types of activity detection and recognition algorithms. The performance metric has been tested on several activity recognition algorithms participating in the ICPR 2012 HARL competition.

Section snippets

Introduction and related work

Applications such as video surveillance, robotics, source selection, video indexing often require the recognition of actions and activities based on the motion of different actors in a video, for instance, people or vehicles. Certain applications may require assigning activities to one of the predefined classes, while others may focus on the detection of abnormal or infrequent unusual activities. This task is inherently more difficult than more traditional tasks like object recognition in

The performance metric

We propose a new performance metric for algorithms that detect and recognize complex activities in realistic environments. The goals of these algorithms are:

  • To detect relevant human behavior in midst of motion clutter originating from unrelated background activity, e.g., other people walking past the scene or other irrelevant actions.

  • To recognize detected actions among the given action classes.

  • To localize actions temporally and spatially.

  • To be able to manage multiple actions in the scene

The LIRIS/ ICPR 2012 HARL dataset

The LIRIS human activities dataset has been designed for recognizing complex and realistic actions in a set of videos, where each video may contain one or more actions concurrently. Table 1 shows the list of actions to be recognized. Some of them are interactions between two or more humans, like discussion and giving an item. Other actions are characterized as interactions between humans and objects, for instance talking on a telephone, leaving baggage unattended, etc. Note that simple

Results of the ICPR 2012 HARL competition

The proposed performance metric was tested on six different detection and recognition algorithms. Four methods correspond to submissions of the ICPR 2012 HARL competition, which was held in conjunction with the International Conference on Pattern Recognition 2012. Two additional methods have been applied to the same dataset.

The HARL competition took place during roughly 12 months from October 2011 to October 2012. The video frames of the competition dataset (described in Section 3) were

Conclusion

This paper has introduced a new performance metric which allows to evaluate human activity detection, recognition and localization algorithms. Taking into account localization information is a non-trivial task, as evaluation needs to decide for each activity whether it has been successfully detected based on detection quality constraints. The inherent dependency between performance and quality has been identified and a set of quantity/quality curves has been introduced to describe the detection

References (61)

  • J.M. Chaquet et al.

    A survey of video datasets for human action and activity recognition

    Comput. Vis. Image Und.

    (2013)
  • C. Schuldt, I. Laptev, B. Caputo, Recognizing human actions: a local svm approach, in: International Conference on...
  • L. Zelnik-Manor, M. Irani, Weizmann Event-Based Analysis of Video, 2013....
  • I. Laptev, Irisa Download Data/Software, 2013....
  • I. Laptev, Hollywood2: Human Actions and Scenes Dataset, 2013....
  • S. University, Olympic Sports Dataset, 2013....
  • R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, M. Boonstra, V. Korzhova, Performance Evaluation Protocol for...
  • R. Collins, X. Zhou, S. Teh, An open source tracking testbed and evaluation web site, in: International Workshop on...
  • R. Kasturi et al.

    Framework for performance evaluation of face, text, and vehicle detection and tracking in video: data, metrics, and protocol

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • A. Smeaton, P. Over, W. Kraaij, Evaluation campaigns and trecvid, in: International Workshop on Multimedia Information...
  • D. Mostefa et al.

    The chil audiovisual corpus for lecture and meeting analysis inside smart rooms

    Lang. Resour. Eval.

    (2007)
  • X. Xu et al.

    Exploring techniques for vision based human activity recognition: methods, systems, and evaluation

    Sensors

    (2013)
  • O. Kliper-Gross et al.

    The action similarity labeling challenge

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • J.A.Ward, P. Lukowicz, G. Tröster, Evaluating performance in continuous context recognition using event-driven error...
  • J. Ward et al.

    Activity recognition of assembly tasks using body-worn microphones and accelerometers

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2006)
  • J.A. Ward et al.

    Performance metrics for activity recognition

    ACM Trans. Intell. Syst. Technol.

    (2011)
  • D. Minnen et al.

    Performance metrics and evaluation issues for continuous activity recognition

    Perform. Metrics Intell. Syst.

    (2006)
  • H.A.T.L.M. van Kasteren, C. Ersoy, Effective performance metrics for evaluating activity recognition methods, in: ARCS...
  • Caviar: Context Aware Vision Using Image-Based Active Recognition, 2013....
  • INRIA, Etiseo Video Understanding Evaluation, 2013....
  • D. Tran, A. Sorokin, D. Forsyth, Human Activity Recognition with Metric Learning, 2013....
  • J. Yuan, Z. Liu, Y. Wu, Discriminative Video Pattern Search for Efficient Action Detection, 2013....
  • R. Fisher, Behave: Computer-Assisted Prescreening of Video Streams for Unusual Activities, 2013....
  • C. for Biometrics, S. Research, Casia Action Database for Recognition, 2013....
  • U. of Surrey, CERTH-ITI, i3dpost Multi-View Human Action Datasets, 2013....
  • V.G. Group, Tv Human Interactions Dataset, 2013....
  • M.S. Ryoo, J.K. Aggarwal, Ut-Interaction Dataset, icpr Contest on Semantic Description of Human Activities (sdha),...
  • V.C. Group, Videoweb Dataset, 2013....
  • S. o. S.E. Reading University Computational Vision Group, Pets 2009 Benchmark Data, 2013....
  • S.S.W. Choi, K. Shahid, What are they doing? Collective activity classification using spatio-temporal relationship...
  • Cited by (73)

    • A survey on RGB-D datasets

      2022, Computer Vision and Image Understanding
    • MTRNet++: One-stage mask-based scene text eraser

      2020, Computer Vision and Image Understanding
      Citation Excerpt :

      To evaluate the quality or realistic degree of inpainting, PSNR, SSIM, MSE and MAE scores are calculated when the ground truth images are available. Note that, for text detection, we applied the recent state-of-the-art text detection method CRAFT (Baek et al., 2019), and for evaluation the DetEval (Wolf et al., 2014) protocol is used. MTRNet++’s three branches output a refined-mask, a coarse-inpainted image and a fine-inpainted image for a given input.

    View all citing articles on Scopus

    This paper has been recommended for acceptance by Anurag Mittal.

    View full text