Skip to main content
Log in

A bag-of-regions representation for video classification

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

A bag-of-regions (BoR) representation of a video sequence is a spatio-temporal tessellation for use in high-level applications such as video classifications and action recognitions. We obtain a BoR representation of a video sequence by extracting regions that exist in the majority of its frames and largely correspond to a single object. First, the significant regions are obtained using unsupervised frame segmentation based on the JSEG method. A tracking algorithm for splitting and merging the regions is then used to generate a relational graph of all regions in the segmented sequence. Finally, we perform a connectivity analysis on this graph to select the most significant regions, which are then used to create a high-level representation of the video sequence. We evaluated our representation using a SVM classifier for the video classification and achieved about 85 % average precision using the UCF50 dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Abd-Almageed W (2008) Online, simultaneous shot boundary detection and key frame extraction for sports videos using rank tracing. In: Proceedings of International Conference on Image Processing

  2. Bandla S, Grauman K (2013) Active learning of an action detector from untrimmed videos. In: Proceedings of International Conference on Computer Vision

  3. Banerjee P, Sengupta S (2008) Model generation for robust object tracking based on temporally stable regions. In: IEEE Workshop on Motion and Video Computing

  4. Bay H, Ess A, Tuytelaars T, Gool LV (2008) Surf: Speeded up robust features. Comp Vision Image Underst 110(3):346–359

    Article  Google Scholar 

  5. Black M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space-time shape. In: Proceedings of International Conference on Computer Vision

  6. Blilen H, Gool LV (2011) Action recognition: A region based approach. In: IEEE Workshop on Applications of Computer Vision

  7. Bojanowski P, Bach F, Laptev I, Ponce J, Schmid C, Sivic J (2013) Finding actors and actions in movies. In: Proceedings of Computer Vision and Pattern Recognition

  8. Brendal W, Todorovic S (2009) Video object segmentation by tracking regions. In: Proceedings of International Conference on Computer Vision

  9. Chakraborty B, Holtre MB, Moesl TB, Gonzalez J (2012) Selective space-time interest points. Comput Vis Image Underst 116:396–410

    Article  Google Scholar 

  10. Chang CC, Lin CJ (2011) Libsvm: A library for support vector machine. ACM Transactions on IST 2(27):1–27. Software available at http://www.scie.ntu.edu.tw/cjlin/libsvm

    Google Scholar 

  11. Choi J, Wang Z, Lee SC, Jeon WJ (2013) A spatio-temporal pyramid matching for video retrieval. Comp Vision Image Underst 117:660–669

    Article  Google Scholar 

  12. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of Computer Vision and Pattern Recognition

  13. Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: Proceedings of Computer Vision and Pattern Recognition

  14. Demir G, Selim A (2007) Scene classification using bag-of-regions representations. In: Proceedings of Computer Vision and Pattern Recognition

  15. Deng Y, Manjunath BS (2011) Unsupervised segmentation of color-texture regions in images and video. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(8):800–810

    Article  Google Scholar 

  16. Gonzalez RC, Woods RE Digital image processing. Pearson Prentice Hall

  17. Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ (2013) Towards understanding action recognition. In: Proceedings of International Conference on Computer Vision

  18. Junejo IN, Dexter E, Laptev I, Perez P (2010) View-independent action recognition from temporal self-similarities. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(1):172–185

    Article  Google Scholar 

  19. Ke Y, Sukthankar R (2004) Pca-sift: A more distinctive representation for local image descriptors. In: Proceedings of Computer Vision and Pattern Recognition

  20. Kovashka A, Grauman K (2010) Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: Proceedings of Computer Vision and Pattern Recognition, pp 2046–2053

  21. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: A large video database for human motion recognition. In: Proceedings of International Conference on Computer Vision

  22. Laptev I (2005) On space-time interest points. Int J Comput Vis 108:207–229

    Google Scholar 

  23. Lazebnik S, Schmid C, Ponce J (2006) Finding actors and actions in movies. In: Proceedings of Computer Vision and Pattern Recognition

  24. Leutenegger S, Chli M, Siegwart RY (2011) Brisk: Binary robust invariant scalable keypoints. In: Proceedings of International Conference on Computer Vision

  25. Li LJ, Fei LF (2007) What, where and who? classifying events by scene and object recognition. In: Proceedings of International Conference on Computer Vision

  26. Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos “in the wild”. In: Proceedings of Computer Vision and Pattern Recognition

  27. Lowe DG (2005) Object recognition from local scale-invariant features. In: Proceedings of International Conference on Computer Vision

  28. Makadia A (2010) Feature tracking for wide-baseline image. In: Proceedings of European Conference on Computer Vision, pp 310–323

  29. Marszalek M, Latev I, Schmid C (2009) Action in context. In: Proceedings of Computer Vision and Pattern Recognition

  30. Masoud O, Papanikolopoulos NP (2001) A novel method for tracking and counting pedestrians in real-time using a single camera. Veh Technol 50(5):1267–1278

    Article  Google Scholar 

  31. McCandless T, Grauman K (2013) Object-centric spatio-temporal pyramids for egocentric activity recognition. In: Proceedings of British Machine Vision Conference

  32. Niebles J, Chen C, Fei LF (2010) Modeling temporal structure of decomposable motion segments for activity classification. In: Proceedings of European Conference on Computer Vision

  33. Pirsiavash H, Ramanan D (2012) Detecting activities of daily living in first-person camera view. In: Proceedings of Computer Vision and Pattern Recognition

  34. Raptis M, Sigal K (2013) Poselet key-framing: A model for human activity recognition. In: Proceedings of Computer Vision and Pattern Recognition

  35. Schuldt C, Laptev I, Caputo B (2004) Recognizing human action: A local svm approach. In: Proceedings of International Conference on Pattern Recognition

  36. Sivic J, Schaffalitzky F (2006) Object level grouping for video shots. Int J Comput Vis 67(2):189–210

    Article  MATH  Google Scholar 

  37. Sivic J, Zisserman A (2003) Video google: A text retrieval approach to object matching in videos. In: Proceedings of International Conference on Computer Vision

  38. Turcot P, Lowe DG (2009) Better matching with fewer features: The selection of useful features in large database recognition problems. In: International Conference of Computer Vision Workshops, pp 2109–2116

  39. Ullah MM, Laptev I (2012) Actlets: A novel local representation for human action recognition in video. In: Proceedings of International Conference on Image Processing, pp 777–780

  40. Umamakeswari A, Rajaraman A (2007) Object based video analysis, interpretation and tracking. J Comput Sci 3(10):818–822

    Article  Google Scholar 

  41. Vapnik V (1998) Statistical learning theory. Wiley, Hoboken

    MATH  Google Scholar 

  42. Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: Proceedings of British Machine Vision Conference

  43. Weinland D, Boyer E, Ronfard R (2007) Action recognition from arbitrary views using 3d exemplars. In: Proceedings of International Conference on Computer Vision

  44. Yao B, Jiang X, Khosla A, Lin AL, Guibas L, Fei LF (2011) Human action recognition by learning bases of action attributes and parts. In: Proceedings of International Conference on Computer Vision

  45. Ying L, Der WT, Neng HJ (2003) Object-based analysis and interpretation of human motion in sports video sequences by dynamic bayesian networks. Comp Vision Image Underst 92:196–216

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by (1) Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (2012M3C4A7032781), (2) the ICT R&D program of MSIP/IITP. [2014(10047078), 3D reconstruction technology development for scene of car accident using multi view black box image], and (3) INHA UNIVERSITY Research Grant.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sang-Chul Lee.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Choi, MK., Wang, Z., Lee, HG. et al. A bag-of-regions representation for video classification. Multimed Tools Appl 75, 2453–2472 (2016). https://doi.org/10.1007/s11042-015-2876-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-015-2876-y

Keywords

Navigation