A Deep Structured Model with Radius–Margin Bound for 3D Human Activity Recognition

Lin, Liang; Wang, Keze; Zuo, Wangmeng; Wang, Meng; Luo, Jiebo; Zhang, Lei

doi:10.1007/s11263-015-0876-z

A Deep Structured Model with Radius–Margin Bound for 3D Human Activity Recognition

Published: 28 December 2015

Volume 118, pages 256–273, (2016)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Liang Lin^1,2,
Keze Wang^1,2,
Wangmeng Zuo³,
Meng Wang⁴,
Jiebo Luo⁵ &
…
Lei Zhang²

1684 Accesses
81 Citations
Explore all metrics

Abstract

Understanding human activity is very challenging even with the recently developed 3D/depth sensors. To solve this problem, this work investigates a novel deep structured model, which adaptively decomposes an activity instance into temporal parts using the convolutional neural networks. Our model advances the traditional deep learning approaches in two aspects. First, we incorporate latent temporal structure into the deep model, accounting for large temporal variations of diverse human activities. In particular, we utilize the latent variables to decompose the input activity into a number of temporally segmented sub-activities, and accordingly feed them into the parts (i.e. sub-networks) of the deep architecture. Second, we incorporate a radius–margin bound as a regularization term into our deep model, which effectively improves the generalization performance for classification. For model training, we propose a principled learning algorithm that iteratively (i) discovers the optimal latent variables (i.e. the ways of activity decomposition) for all training instances, (ii) updates the classifiers based on the generated features, and (iii) updates the parameters of multi-layer neural networks. In the experiments, our approach is validated on several complex scenarios for human activity recognition and demonstrates superior performances over other state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Efficient Human Activity Recognition Technique Based on Deep Learning

Article 01 October 2019

An active semi-supervised deep learning model for human activity recognition

Article Open access 20 March 2022

Human Activity Recognition Using Deep Learning: A Survey

Notes

http://vision.sysu.edu.cn/projects/3d-activity/.
We implement the 3D-CNN model Ji et al. (2013). For fair comparison, parameter pre-training and dropout have been also employed in our implementation, and the configuration of 3D-CNN is the same with that of our model except that we set \(M = 1\) for 3D-CNN.

References

Amer, M. R., & Todorovic, S. (2012). Sum-product networks for modeling activities with stochastic structure. In CVPR, pp 1314–1321
Bayer, J., Osendorfer, C., Korhammer, D., Chen, N., Urban, S., & van der Smagt, P. (2014). On fast dropout and its applicability to recurrent networks. In Proc. ICLR
Brendel, W., & Todorovic, S. (2011). Learning spatiotemporal graphs of human activities. In: ICCV, pp 778–785
Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46(1–3), 131–159.
Article MATH Google Scholar
Chaquet, J. M., Carmona, E. J., & Fernandez-Caballero, A. (2013). A survey of video datasets for human action and activity recognition. Computer Vision and Image Understanding, 117(6), 633–659.
Article Google Scholar
Cheng, Z., Qin, L., Huang, Q., Jiang, S., Yan, S., & Tian, Q. (2011). Human group activity analysis with fusion of motion and appearance information. In ACM Multimedia, pp 1401–1404
Chung, K. M., Kao, W. C., Sun, C. L., Wang, L. L., & Lin, C. J. (2003). Radius margin bounds for support vector machines with the rbf kernel. Neural Computation, 15(11), 2643–2681.
Article MATH Google Scholar
Do, H., & Kalousis, A. (2013). Convex formulations of radius-margin based support vector machines. In: ICML
Do, H., Kalousis, A., & Hilario, M. (2009). Feature weighting using margin and radius based error bound optimization in svms. Machine Learning and Knowledge Discovery in Databases (Vol. 5781, pp. 315–329)., Lecture Notes in Computer Science Berlin Heidelberg: Springer.
Donahue, J., Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In CVPR
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.
Article Google Scholar
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Gupta, R., Chia, A. Y., Rajan, D., Ng E. S., & Lung, E. H. (2013). Human activities recognition using depth images. In ACM Multimedia pp 283–292
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.
Article MathSciNet MATH Google Scholar
Huang, F. J., & LeCun, Y. (2006). Large-scale learning with svm and convolutional for generic object categorization. In CVPR, pp 284–291
Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell, 35(1), 221–231.
Article Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR
Koppula, H. S., Gupta, R., & Saxena, A. (2013). Learning human activities and object affordances from rgb-d videos. International Journal of Robotics Research (IJRR), 32(8), 951–970.
Article Google Scholar
Koppula, H. S., & Saxena, A. (2013). Learning spatio-temporal structure from rgb-d videos for human activity detection and anticipation. In ICML pp 792–800
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp 1097–1105
LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., & Jackel ea, L. D. (1990). Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems
Liang, X., Lin, L., & Cao, L. (2013). Learning latent spatio-temporal compositional model for human action recognition. In ACM Multimedia, pp 263–272
Lin, L., Wang, X., Yang, W., & Lai, J. H. (2015). Discriminatively trained and-or graph models for object shape detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(5), 959–972.
Article Google Scholar
Lin, L., Wu, T., Porway, J., & Xu, Z. (2009). A stochastic graph grammar for compositional object representation and recognition. Pattern Recognition, 42(7), 1297–1307.
Article MATH Google Scholar
Luo, P., Tian, Y., Wang, X., & Tang, X. (2014). Switchable deep network for pedestrian detection. In CVPR
Luo, P., Wang, X., & Tang, X. (2013a). A deep sum-product architecture for robust facial attributes analysis. In ICCV, pp 2864–2871
Luo, P., Wang, X., & Tang, X. (2013b). Pedestrian parsing via deep decompositional neural network. In ICCV, pp 2648–2655
Ni, B., Wang, G., & Moulin, P. (2013a). Rgbd-hudaact: A color-depth video database for human daily activity recognition. Consumer Depth Cameras for Computer Vision, Lecture Notes in Computer Science (pp. 193–208). Springer.
Ni, B., YPei, Z.L., Lin, L., & Moulin, P. (2013b). Integrating multi-stage depth-induced contextual information for human action recognition and localization. In International Conference and Workshops on Automatic Face and Gesture Recognition, pp 1–8
Oreifej, O., & Liu, Z. (2013). Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. In CVPR, pp 716–723
Packer, B., Saenko, K., & Koller, D. (2012). A combined pose, object, and feature model for action understanding. In CVPR, pp 1378–1385
Pei, M., Jia, Y., & Zhu, S. (2011). Parsing video events with goal inference and intent prediction. In ICCV, pp 487–494
Sadanand, S., & Corso, J. J. (2012). Action bank: A high-level representation of activity in video. In CVPR, pp 1234–1241
Scovanner, P., Ali, S., & Shah, M. (2007) A 3-dimensional sift descriptor and its application to action recognition. In ACM Multimedia, pp 357–360
Sermanet, P., Kavukcuoglu, K., Chintala, S., & LeCun, Y. (2013). Pedestrian detection with unsupervised multi- stage feature learning. In CVPR
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overtting. The Journal of Machine Learning Research, 15(1), 1929–1958.
MathSciNet MATH Google Scholar
Sung, J., Ponce, C., Selman, B., & Saxena, A. (2012) Unstructured human activity detection from rgbd images. In ICRA, pp 842–849
Tang, K., Fei-Fei, L., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In CVPR, pp 1250–1257
Tu, K., Meng, M., Lee, M. W., Choi, T., & Zhu, S. (2014). Joint video and text parsing for understanding events and answering queries. IEEE Transactions on Multimedia, 21(2), 42–70.
Article Google Scholar
Vapnik, V. (1998). Statistical learning theory. New York: John Wiley and Sons.
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., & Saenko, K. (2015). Translating videos to natural language using deep recurrent neural networks. In North American Chapter of the Association for Computational Linguistics
Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2012). Mining actionlet ensemble for action recognition with depth cameras. In CVPR, pp 1290–1297
Wang, Y., & Mori, G. (2011). Hidden part models for human action recognition: Probabilistic vs. max-margin. IEEE Trans Pattern Anal Mach Intell, 33(7), 1310–1323.
Article Google Scholar
Wang, K., Wang, X., Lin, L. (2014). 3d human activity recognition with reconfigurable convolutional neural networks. In ACM MM
Wang, C., Wang, Y., & Yuille, A. L. (2013). An approach to pose-based action recognition. In CVPR, pp 915–922
Wang, J., & Wu, Y. (2013) Learning maximum margin temporal warping for action recognition. In ICCV, pp 2688–2695
Wu, C. F. J. (1983). On the convergence properties of the em algorithm. Annals of Statistics, 11(1), 95–103.
Article MathSciNet MATH Google Scholar
Wu, P., Hoi, S., Xia, H., Zhao, P., Wang, D., & Miao, C. (2013) Online multimodal deep similarity learning with application to image retrieval. In ACM Mutilmedia, pp 153–162
Xia, L., & Aggarwal, J. (2013) Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In CVPR, pp 2834–2841
Xia, L., Chen, C., & Aggarwal, J. (2012a). View invariant human action recognition using histograms of 3d joints. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, IEEE, pp 20–27
Xia, L., Chen, C., & Aggarwal, J. K. (2012b). View invariant human action recognition using histograms of 3d joints. In CVPRW, pp 20–27
Yang, X., Zhang, C., & Tian, Y. (2012). Recognizing actions using depth motion maps-based histograms of oriented gradients. In ACM Multimedia, pp 1057–1060
Yu, K., Lin, Y., & Lafferty, J. (2011). Learning image representations from the pixel level via hierarchical sparse coding. In CVPR, pp 1713–1720
Yun, K., Honorio, J., Chattopadhyay, D., Berg, T. L., & Samaras, D. (2012) Two-person interaction detection using body-pose features and multiple instance learning. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, IEEE
Zhao, X., Liu, Y., & Fu, Y. (2013). Exploring discriminative pose sub-patterns for effective action classification. In: ACM Multimedia, pp 273–282
Zhou, X., Zhuang, X., Yan, S., Chang, S. F., Johnson, M. H., & Huang, T. S. (2009) Sift-bag kernel for video event analysis. In ACM Multimedia, pp 229–238
Zhu, S., & Mumford, D. (2007). A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision, 2(4), 259–362.
Article MATH Google Scholar

Download references

Acknowledgments

This work was supported in part by the Hong Kong Scholar Program, and in part by the HK PolyU’s Joint Supervision Scheme with the Chinese Mainland, Taiwan and Macao Universities (Grant no. G-SB20), in part by Guangdong Natural Science Foundation (Grant nos. S2013010013432 and S2013050014548), and in part by Guangdong Science and Technology Program (Grant no. 2013B010406005).

Author information

Authors and Affiliations

Sun Yat-sen University, Guangzhou, China
Liang Lin & Keze Wang
Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China
Liang Lin, Keze Wang & Lei Zhang
School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
Wangmeng Zuo
Hefei University of Technology, Hefei, China
Meng Wang
Department of Computer Science, University of Rochester, Rochester, NY, USA
Jiebo Luo

Authors

Liang Lin
View author publications
You can also search for this author in PubMed Google Scholar
Keze Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wangmeng Zuo
View author publications
You can also search for this author in PubMed Google Scholar
Meng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jiebo Luo
View author publications
You can also search for this author in PubMed Google Scholar
Lei Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wangmeng Zuo.

Additional information

Communicated by M. Hebert.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, L., Wang, K., Zuo, W. et al. A Deep Structured Model with Radius–Margin Bound for 3D Human Activity Recognition. Int J Comput Vis 118, 256–273 (2016). https://doi.org/10.1007/s11263-015-0876-z

Download citation

Received: 17 October 2014
Accepted: 28 November 2015
Published: 28 December 2015
Issue Date: June 2016
DOI: https://doi.org/10.1007/s11263-015-0876-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Deep Structured Model with Radius–Margin Bound for 3D Human Activity Recognition

Abstract

Access this article

Similar content being viewed by others

An Efficient Human Activity Recognition Technique Based on Deep Learning

An active semi-supervised deep learning model for human activity recognition

Human Activity Recognition Using Deep Learning: A Survey

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Deep Structured Model with Radius–Margin Bound for 3D Human Activity Recognition

Abstract

Access this article

Similar content being viewed by others

An Efficient Human Activity Recognition Technique Based on Deep Learning

An active semi-supervised deep learning model for human activity recognition

Human Activity Recognition Using Deep Learning: A Survey

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation