A bag-of-regions representation for video classification

Choi, Min-Kook; Wang, Ziyu; Lee, Hyun-Gyu; Lee, Sang-Chul

doi:10.1007/s11042-015-2876-y

A bag-of-regions representation for video classification

Published: 17 September 2015

Volume 75, pages 2453–2472, (2016)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Min-Kook Choi¹,
Ziyu Wang¹,
Hyun-Gyu Lee¹ &
…
Sang-Chul Lee¹

306 Accesses
1 Citation
Explore all metrics

Abstract

A bag-of-regions (BoR) representation of a video sequence is a spatio-temporal tessellation for use in high-level applications such as video classifications and action recognitions. We obtain a BoR representation of a video sequence by extracting regions that exist in the majority of its frames and largely correspond to a single object. First, the significant regions are obtained using unsupervised frame segmentation based on the JSEG method. A tracking algorithm for splitting and merging the regions is then used to generate a relational graph of all regions in the segmented sequence. Finally, we perform a connectivity analysis on this graph to select the most significant regions, which are then used to create a high-level representation of the video sequence. We evaluated our representation using a SVM classifier for the video classification and achieved about 85 % average precision using the UCF50 dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ImageNet Large Scale Visual Recognition Challenge

Article 11 April 2015

Olga Russakovsky, Jia Deng, … Li Fei-Fei

ByteTrack: Multi-object Tracking by Associating Every Detection Box

Human Action Recognition and Prediction: A Survey

Article 28 March 2022

Yu Kong & Yun Fu

References

Abd-Almageed W (2008) Online, simultaneous shot boundary detection and key frame extraction for sports videos using rank tracing. In: Proceedings of International Conference on Image Processing
Bandla S, Grauman K (2013) Active learning of an action detector from untrimmed videos. In: Proceedings of International Conference on Computer Vision
Banerjee P, Sengupta S (2008) Model generation for robust object tracking based on temporally stable regions. In: IEEE Workshop on Motion and Video Computing
Bay H, Ess A, Tuytelaars T, Gool LV (2008) Surf: Speeded up robust features. Comp Vision Image Underst 110(3):346–359
Article Google Scholar
Black M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space-time shape. In: Proceedings of International Conference on Computer Vision
Blilen H, Gool LV (2011) Action recognition: A region based approach. In: IEEE Workshop on Applications of Computer Vision
Bojanowski P, Bach F, Laptev I, Ponce J, Schmid C, Sivic J (2013) Finding actors and actions in movies. In: Proceedings of Computer Vision and Pattern Recognition
Brendal W, Todorovic S (2009) Video object segmentation by tracking regions. In: Proceedings of International Conference on Computer Vision
Chakraborty B, Holtre MB, Moesl TB, Gonzalez J (2012) Selective space-time interest points. Comput Vis Image Underst 116:396–410
Article Google Scholar
Chang CC, Lin CJ (2011) Libsvm: A library for support vector machine. ACM Transactions on IST 2(27):1–27. Software available at http://www.scie.ntu.edu.tw/cjlin/libsvm
Google Scholar
Choi J, Wang Z, Lee SC, Jeon WJ (2013) A spatio-temporal pyramid matching for video retrieval. Comp Vision Image Underst 117:660–669
Article Google Scholar
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of Computer Vision and Pattern Recognition
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: Proceedings of Computer Vision and Pattern Recognition
Demir G, Selim A (2007) Scene classification using bag-of-regions representations. In: Proceedings of Computer Vision and Pattern Recognition
Deng Y, Manjunath BS (2011) Unsupervised segmentation of color-texture regions in images and video. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(8):800–810
Article Google Scholar
Gonzalez RC, Woods RE Digital image processing. Pearson Prentice Hall
Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ (2013) Towards understanding action recognition. In: Proceedings of International Conference on Computer Vision
Junejo IN, Dexter E, Laptev I, Perez P (2010) View-independent action recognition from temporal self-similarities. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(1):172–185
Article Google Scholar
Ke Y, Sukthankar R (2004) Pca-sift: A more distinctive representation for local image descriptors. In: Proceedings of Computer Vision and Pattern Recognition
Kovashka A, Grauman K (2010) Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: Proceedings of Computer Vision and Pattern Recognition, pp 2046–2053
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: A large video database for human motion recognition. In: Proceedings of International Conference on Computer Vision
Laptev I (2005) On space-time interest points. Int J Comput Vis 108:207–229
Google Scholar
Lazebnik S, Schmid C, Ponce J (2006) Finding actors and actions in movies. In: Proceedings of Computer Vision and Pattern Recognition
Leutenegger S, Chli M, Siegwart RY (2011) Brisk: Binary robust invariant scalable keypoints. In: Proceedings of International Conference on Computer Vision
Li LJ, Fei LF (2007) What, where and who? classifying events by scene and object recognition. In: Proceedings of International Conference on Computer Vision
Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos “in the wild”. In: Proceedings of Computer Vision and Pattern Recognition
Lowe DG (2005) Object recognition from local scale-invariant features. In: Proceedings of International Conference on Computer Vision
Makadia A (2010) Feature tracking for wide-baseline image. In: Proceedings of European Conference on Computer Vision, pp 310–323
Marszalek M, Latev I, Schmid C (2009) Action in context. In: Proceedings of Computer Vision and Pattern Recognition
Masoud O, Papanikolopoulos NP (2001) A novel method for tracking and counting pedestrians in real-time using a single camera. Veh Technol 50(5):1267–1278
Article Google Scholar
McCandless T, Grauman K (2013) Object-centric spatio-temporal pyramids for egocentric activity recognition. In: Proceedings of British Machine Vision Conference
Niebles J, Chen C, Fei LF (2010) Modeling temporal structure of decomposable motion segments for activity classification. In: Proceedings of European Conference on Computer Vision
Pirsiavash H, Ramanan D (2012) Detecting activities of daily living in first-person camera view. In: Proceedings of Computer Vision and Pattern Recognition
Raptis M, Sigal K (2013) Poselet key-framing: A model for human activity recognition. In: Proceedings of Computer Vision and Pattern Recognition
Schuldt C, Laptev I, Caputo B (2004) Recognizing human action: A local svm approach. In: Proceedings of International Conference on Pattern Recognition
Sivic J, Schaffalitzky F (2006) Object level grouping for video shots. Int J Comput Vis 67(2):189–210
Article MATH Google Scholar
Sivic J, Zisserman A (2003) Video google: A text retrieval approach to object matching in videos. In: Proceedings of International Conference on Computer Vision
Turcot P, Lowe DG (2009) Better matching with fewer features: The selection of useful features in large database recognition problems. In: International Conference of Computer Vision Workshops, pp 2109–2116
Ullah MM, Laptev I (2012) Actlets: A novel local representation for human action recognition in video. In: Proceedings of International Conference on Image Processing, pp 777–780
Umamakeswari A, Rajaraman A (2007) Object based video analysis, interpretation and tracking. J Comput Sci 3(10):818–822
Article Google Scholar
Vapnik V (1998) Statistical learning theory. Wiley, Hoboken
MATH Google Scholar
Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: Proceedings of British Machine Vision Conference
Weinland D, Boyer E, Ronfard R (2007) Action recognition from arbitrary views using 3d exemplars. In: Proceedings of International Conference on Computer Vision
Yao B, Jiang X, Khosla A, Lin AL, Guibas L, Fei LF (2011) Human action recognition by learning bases of action attributes and parts. In: Proceedings of International Conference on Computer Vision
Ying L, Der WT, Neng HJ (2003) Object-based analysis and interpretation of human motion in sports video sequences by dynamic bayesian networks. Comp Vision Image Underst 92:196–216
Article Google Scholar

Download references

Acknowledgments

This work was supported by (1) Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (2012M3C4A7032781), (2) the ICT R&D program of MSIP/IITP. [2014(10047078), 3D reconstruction technology development for scene of car accident using multi view black box image], and (3) INHA UNIVERSITY Research Grant.

Author information

Authors and Affiliations

Department of Computer and Information Engineering 100 Inha-ro, Inha University, Yonghyun-dong, Nam-gu, Incheon, Republic of Korea
Min-Kook Choi, Ziyu Wang, Hyun-Gyu Lee & Sang-Chul Lee

Authors

Min-Kook Choi
View author publications
You can also search for this author in PubMed Google Scholar
Ziyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hyun-Gyu Lee
View author publications
You can also search for this author in PubMed Google Scholar
Sang-Chul Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sang-Chul Lee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Choi, MK., Wang, Z., Lee, HG. et al. A bag-of-regions representation for video classification. Multimed Tools Appl 75, 2453–2472 (2016). https://doi.org/10.1007/s11042-015-2876-y

Download citation

Received: 25 November 2014
Revised: 22 April 2015
Accepted: 10 August 2015
Published: 17 September 2015
Issue Date: March 2016
DOI: https://doi.org/10.1007/s11042-015-2876-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A bag-of-regions representation for video classification

Abstract

Access this article

Similar content being viewed by others

ImageNet Large Scale Visual Recognition Challenge

ByteTrack: Multi-object Tracking by Associating Every Detection Box

Human Action Recognition and Prediction: A Survey

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A bag-of-regions representation for video classification

Abstract

Access this article

Similar content being viewed by others

ImageNet Large Scale Visual Recognition Challenge

ByteTrack: Multi-object Tracking by Associating Every Detection Box

Human Action Recognition and Prediction: A Survey

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation