ABSTRACT
Event detection from real surveillance videos with complicated background environment is always a very hard task. Different from the traditional retrospective and interactive systems designed on this task, which are mainly executed on video fragments located within the event-occurrence time, in this paper we propose a new interactive system constructed on the mid-level discriminative representations (patches/shots) which are closely related to the event (might occur beyond the event-occurrence period) and are easier to be detected than video fragments. By virtue of such easily-distinguished mid-level patterns, our framework realizes an effective labor division between computers and human participants. The task of computers is to train classifiers on a bunch of mid-level discriminative representations, and to sort all the possible mid-level representations in the evaluation sets based on the classifier scores. The task of human participants is then to readily search the events based on the clues offered by these sorted mid-level representations. For computers, such mid-level representations, with more concise and consistent patterns, can be more accurately detected than video fragments utilized in the conventional framework, and on the other hand, a human participant can always much more easily search the events of interest implicated by these location-anchored mid-level representations than conventional video fragments containing entire scenes. Both of these two properties facilitate the availability of our framework in real surveillance event detection applications.
- S. Ali and M. Shah. Human action recognition in videos using kinematic features and multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(2):288--303, 2010. Google ScholarDigital Library
- M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. In ICCV, 2005. Google ScholarDigital Library
- A. F. Bobick and J. W. Davis. The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3):257--267, 2001. Google ScholarDigital Library
- Y. Cai, Q. Chen, L. Brown, A. Datta, Q. Fan, R. Feris, S. Yan, A. Hauptmann, and S. Pankanti. CMU-IBM-NUS@ TRECVID 2012: Surveillance event detection. In Proc. TRECVID, 2012.Google Scholar
- M. Chen and A. Hauptmann. Mosift: Recognizing human actions in surveillance videos. Technical report, Carnegie Mellon University, 2009.Google Scholar
- Y. Cheng, L. Brown, Q. Fan, R. Feris, A. Choudhary, and S. Pankanti. IBM-Northwestern@TRECVID 2013: Surveillance event detection(sed). In Proc. TRECVID, 2013.Google Scholar
- N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. Google ScholarDigital Library
- C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What makes paris look like paris? In SIGGRAPH, 2012. Google ScholarDigital Library
- T. V. Duong, H. H. Bui, D. Q. Phung, and S. Venkatesh. Activity recognition and abnormality detection with the switching hidden semi-markov model. In CVPR, 2005. Google ScholarDigital Library
- P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627--1645, 2010. Google ScholarDigital Library
- L. Fengjun and R. Nevatia. Single view human action recognition using key pose matching and viterbi path searching. In CVPR, 2007.Google Scholar
- S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):221--231, 2013. Google ScholarDigital Library
- I. Laptev and T. Lindeberg. Space-time interest points. In ICCV, 2003. Google ScholarDigital Library
- I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008.Google ScholarCross Ref
- I. Laptev and P. Perez. Retrieving actions in movies. In ICCV, 2007.Google ScholarCross Ref
- N. Oliver, A. Garg, and E. Horvitz. Layered representations for learning and inferring office activity from multiple sensory channels. Computer Vision and Image Understanding, 96(2):163--180, 11 2004. Google ScholarDigital Library
- O. Paul, G. Awad, M. Michel, J. Fiscus, W. Kraaij, A. Smeaton, and G. Quenot. TRECVID 2011-an overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proc. TRECVID, 2011.Google Scholar
- O. Paul, A. George, M. Martial, F. Jonathan, S. Greg, S. Barbara, K. Wessel, F. S. Alan, and Q. Georges. TRECVID 2012 -- an overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proc. TRECVID, 2013.Google Scholar
- R. Poppe. A survey on vision-based human action recognition. Image and Vision Computing, 28(6):976--990, 2010. Google ScholarDigital Library
- T. Rose, J. Fiscus, P. Over, J. Garofolo, and M. Michel. The TRECVID 2008 event detection evaluation. In Applications of Computer Vision (WACV), 2009.Google Scholar
- D. A. Sadlier and N. E. O'Connor. Event detection in field sports video using audio-visual features and a support vector machine. IEEE Transactions on Circuits and Systems for Video Technology, 15(10):1225--1233, 2005. Google ScholarDigital Library
- M. Shyu, Z. Xie, M. Chen, and S. Chen. Video semantic event/concept detection using a subspace-based multimedia data mining framework. IEEE Transactions on Multimedia, 10(2):252--259, 2008. Google ScholarDigital Library
- S. Singh, A. Gupta, and A. Efros. Unsupervised discovery of mid-level discriminative patches. In ECCV, 2012. Google ScholarDigital Library
- C. Stauffer and W. E. L. Grimson. Adaptive background mixture models for real-time tracking. In CVPR, 1999.Google ScholarCross Ref
- A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3):480--492, 2012. Google ScholarDigital Library
- H. Wang, A. Klaser, C. Schmid, and C. L. Liu. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1):60--79, 2013.Google ScholarCross Ref
- Z. Xia, X. Fang, Y. Wang, H. Zhang, and Y. Tian. PKU-NEC@TRECVID 2012 SED: Uneven-sequence based event detection in surveillance video. In Proc. TRECVID, 2012.Google Scholar
- K. Yan, R. Sukthankar, and M. Hebert. Event detection in crowded videos. In ICCV, 2007.Google Scholar
- X. Yang, Z. Liu, E. Zavesky, D. Gibbon, and B. Shahraray. AT&T research at TRECVID 2013:surveillance event detection. In Proc. TRECVID, 2013.Google Scholar
- J. Yuan, Z. Liu, and Y. Wu. Discriminative subvolume search for efficient action detection. In CVPR, 2009.Google Scholar
- Z. Zhang, Y. Hu, S. Chan, and L. Chia. Motion context: A new representation for human action recognition. In ECCV, 2008.Google ScholarDigital Library
- Z. Zhao, Y. Zhao, Y. Hua, W. Wang, D. Wan, G. Jia, Z. Li, F. Su, and A. Cai. BUPT-MCPRL at TRECVID. In Proc. TRECVID, 2012.Google Scholar
- G. Zhu, M. Yang, K. Yu, W. Xu, and Y. Gong. Detecting video events based on action recognition in complex scenes using spatio-temporal descriptor. In ACM MM, 2009. Google ScholarDigital Library
Index Terms
- Interactive Surveillance Event Detection through Mid-level Discriminative Representation
Recommendations
Unsupervised Discovery of Mid-Level Discriminative Patches
Proceedings, Part II, of the 12th European Conference on Computer Vision --- ECCV 2012 - Volume 7573The goal of this paper is to discover a set of discriminative patches which can serve as a fully unsupervised mid-level visual representation. The desired patches need to satisfy two requirements: 1 to be representative, they need to occur frequently ...
Detection bank: an object detection based video representation for multimedia event recognition
MM '12: Proceedings of the 20th ACM international conference on MultimediaWhile low-level image features have proven to be effective representations for visual recognition tasks such as object recognition and scene classification, they are inadequate to capture complex semantic meaning required to solve high-level visual ...
Interactive surveillance event detection at TRECVid2012
ICMR '13: Proceedings of the 3rd ACM conference on International conference on multimedia retrievalThis demonstration shows the integration of video analysis and search tools to facilitate the interactive retrieval of video segments depicting specific activities from surveillance footage. The implementation was developed by members of the SAVASA ...
Comments