Abstract
A bag-of-regions (BoR) representation of a video sequence is a spatio-temporal tessellation for use in high-level applications such as video classifications and action recognitions. We obtain a BoR representation of a video sequence by extracting regions that exist in the majority of its frames and largely correspond to a single object. First, the significant regions are obtained using unsupervised frame segmentation based on the JSEG method. A tracking algorithm for splitting and merging the regions is then used to generate a relational graph of all regions in the segmented sequence. Finally, we perform a connectivity analysis on this graph to select the most significant regions, which are then used to create a high-level representation of the video sequence. We evaluated our representation using a SVM classifier for the video classification and achieved about 85 % average precision using the UCF50 dataset.
Similar content being viewed by others
References
Abd-Almageed W (2008) Online, simultaneous shot boundary detection and key frame extraction for sports videos using rank tracing. In: Proceedings of International Conference on Image Processing
Bandla S, Grauman K (2013) Active learning of an action detector from untrimmed videos. In: Proceedings of International Conference on Computer Vision
Banerjee P, Sengupta S (2008) Model generation for robust object tracking based on temporally stable regions. In: IEEE Workshop on Motion and Video Computing
Bay H, Ess A, Tuytelaars T, Gool LV (2008) Surf: Speeded up robust features. Comp Vision Image Underst 110(3):346–359
Black M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space-time shape. In: Proceedings of International Conference on Computer Vision
Blilen H, Gool LV (2011) Action recognition: A region based approach. In: IEEE Workshop on Applications of Computer Vision
Bojanowski P, Bach F, Laptev I, Ponce J, Schmid C, Sivic J (2013) Finding actors and actions in movies. In: Proceedings of Computer Vision and Pattern Recognition
Brendal W, Todorovic S (2009) Video object segmentation by tracking regions. In: Proceedings of International Conference on Computer Vision
Chakraborty B, Holtre MB, Moesl TB, Gonzalez J (2012) Selective space-time interest points. Comput Vis Image Underst 116:396–410
Chang CC, Lin CJ (2011) Libsvm: A library for support vector machine. ACM Transactions on IST 2(27):1–27. Software available at http://www.scie.ntu.edu.tw/cjlin/libsvm
Choi J, Wang Z, Lee SC, Jeon WJ (2013) A spatio-temporal pyramid matching for video retrieval. Comp Vision Image Underst 117:660–669
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of Computer Vision and Pattern Recognition
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: Proceedings of Computer Vision and Pattern Recognition
Demir G, Selim A (2007) Scene classification using bag-of-regions representations. In: Proceedings of Computer Vision and Pattern Recognition
Deng Y, Manjunath BS (2011) Unsupervised segmentation of color-texture regions in images and video. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(8):800–810
Gonzalez RC, Woods RE Digital image processing. Pearson Prentice Hall
Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ (2013) Towards understanding action recognition. In: Proceedings of International Conference on Computer Vision
Junejo IN, Dexter E, Laptev I, Perez P (2010) View-independent action recognition from temporal self-similarities. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(1):172–185
Ke Y, Sukthankar R (2004) Pca-sift: A more distinctive representation for local image descriptors. In: Proceedings of Computer Vision and Pattern Recognition
Kovashka A, Grauman K (2010) Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: Proceedings of Computer Vision and Pattern Recognition, pp 2046–2053
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: A large video database for human motion recognition. In: Proceedings of International Conference on Computer Vision
Laptev I (2005) On space-time interest points. Int J Comput Vis 108:207–229
Lazebnik S, Schmid C, Ponce J (2006) Finding actors and actions in movies. In: Proceedings of Computer Vision and Pattern Recognition
Leutenegger S, Chli M, Siegwart RY (2011) Brisk: Binary robust invariant scalable keypoints. In: Proceedings of International Conference on Computer Vision
Li LJ, Fei LF (2007) What, where and who? classifying events by scene and object recognition. In: Proceedings of International Conference on Computer Vision
Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos “in the wild”. In: Proceedings of Computer Vision and Pattern Recognition
Lowe DG (2005) Object recognition from local scale-invariant features. In: Proceedings of International Conference on Computer Vision
Makadia A (2010) Feature tracking for wide-baseline image. In: Proceedings of European Conference on Computer Vision, pp 310–323
Marszalek M, Latev I, Schmid C (2009) Action in context. In: Proceedings of Computer Vision and Pattern Recognition
Masoud O, Papanikolopoulos NP (2001) A novel method for tracking and counting pedestrians in real-time using a single camera. Veh Technol 50(5):1267–1278
McCandless T, Grauman K (2013) Object-centric spatio-temporal pyramids for egocentric activity recognition. In: Proceedings of British Machine Vision Conference
Niebles J, Chen C, Fei LF (2010) Modeling temporal structure of decomposable motion segments for activity classification. In: Proceedings of European Conference on Computer Vision
Pirsiavash H, Ramanan D (2012) Detecting activities of daily living in first-person camera view. In: Proceedings of Computer Vision and Pattern Recognition
Raptis M, Sigal K (2013) Poselet key-framing: A model for human activity recognition. In: Proceedings of Computer Vision and Pattern Recognition
Schuldt C, Laptev I, Caputo B (2004) Recognizing human action: A local svm approach. In: Proceedings of International Conference on Pattern Recognition
Sivic J, Schaffalitzky F (2006) Object level grouping for video shots. Int J Comput Vis 67(2):189–210
Sivic J, Zisserman A (2003) Video google: A text retrieval approach to object matching in videos. In: Proceedings of International Conference on Computer Vision
Turcot P, Lowe DG (2009) Better matching with fewer features: The selection of useful features in large database recognition problems. In: International Conference of Computer Vision Workshops, pp 2109–2116
Ullah MM, Laptev I (2012) Actlets: A novel local representation for human action recognition in video. In: Proceedings of International Conference on Image Processing, pp 777–780
Umamakeswari A, Rajaraman A (2007) Object based video analysis, interpretation and tracking. J Comput Sci 3(10):818–822
Vapnik V (1998) Statistical learning theory. Wiley, Hoboken
Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: Proceedings of British Machine Vision Conference
Weinland D, Boyer E, Ronfard R (2007) Action recognition from arbitrary views using 3d exemplars. In: Proceedings of International Conference on Computer Vision
Yao B, Jiang X, Khosla A, Lin AL, Guibas L, Fei LF (2011) Human action recognition by learning bases of action attributes and parts. In: Proceedings of International Conference on Computer Vision
Ying L, Der WT, Neng HJ (2003) Object-based analysis and interpretation of human motion in sports video sequences by dynamic bayesian networks. Comp Vision Image Underst 92:196–216
Acknowledgments
This work was supported by (1) Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (2012M3C4A7032781), (2) the ICT R&D program of MSIP/IITP. [2014(10047078), 3D reconstruction technology development for scene of car accident using multi view black box image], and (3) INHA UNIVERSITY Research Grant.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Choi, MK., Wang, Z., Lee, HG. et al. A bag-of-regions representation for video classification. Multimed Tools Appl 75, 2453–2472 (2016). https://doi.org/10.1007/s11042-015-2876-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-015-2876-y