Skip to main content

Context-driven Multi-stream LSTM (M-LSTM) for Recognizing Fine-Grained Activity of Drivers

  • Conference paper
  • First Online:
Pattern Recognition (GCPR 2018)

Abstract

Automatic recognition of in-vehicle activities has significant impact on the next generation intelligent vehicles. In this paper, we present a novel Multi-stream Long Short-Term Memory (M-LSTM) network for recognizing driver activities. We bring together ideas from recent works on LSTMs, transfer learning for object detection and body pose by exploring the use of deep convolutional neural networks (CNN). Recent work has also shown that representations such as hand-object interactions are important cues in characterizing human activities. The proposed M-LSTM integrates these ideas under one framework, where two streams focus on appearance information with two different levels of abstractions. The other two streams analyze the contextual information involving configuration of body parts and body-object interactions. The proposed contextual descriptor is built to be semantically rich and meaningful, and even when coupled with appearance features it is turned out to be highly discriminating. We validate this on two challenging datasets consisting driver activities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abouelnaga, Y., Eraqi, H.M., Moustafa, M.N.: Real-time distracted driver posture classification. arXiv preprint arXiv:1706.09498 (2017)

  2. Aggarwal, J., Ryoo, M.: Human activity analysis: a review. ACM Comput. Surv. 43(3), 16:1–16:43 (2011)

    Article  Google Scholar 

  3. Behera, A., Hogg, D.C., Cohn, A.G.: Egocentric activity monitoring and recovery. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7726, pp. 519–532. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37431-9_40

    Chapter  Google Scholar 

  4. Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: ICCV, pp. 1395–1402 (2005)

    Google Scholar 

  5. Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: IEEE CVPR (2017)

    Google Scholar 

  6. Carsten, O.: From driver models to modelling the driver: what do we really need to know about the driver? In: Cacciabue, P.C. (ed.) Modelling Driver Behaviour in Automotive Environments, pp. 105–120. Springer, London (2007). https://doi.org/10.1007/978-1-84628-618-6_6

    Chapter  Google Scholar 

  7. State Farm Corporate: State farm distracted driver detection (2016). https://www.kaggle.com/c/state-farm-distracted-driver-detection

  8. Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. PAMI 39(4), 677–691 (2017)

    Article  Google Scholar 

  9. Fathi, A., Farhadi, A., Rehg, J.M.: Understanding egocentric activities. In: ICCV (2011)

    Google Scholar 

  10. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: IEEE CVPR, pp. 1933–1941 (2016)

    Google Scholar 

  11. Girdhar, R., Ramanan, D.: Attentional pooling for action recognition. In: Advances in NIPS, pp. 33–44 (2017)

    Google Scholar 

  12. Gkioxari, G., Girshick, R., Malik, J.: Contextual action recognition with R*CNN. In: ICCV, pp. 1080–1088 (2015)

    Google Scholar 

  13. Gupta, A., Davis, L.S.: Objects in action: an approach for combining action understanding and object perception. In: CVPR (2007)

    Google Scholar 

  14. Heide, A., Henning, K.: The “cognitive car”: a roadmap for research issues in the automotive sector. Ann. Rev. Control 30(2), 197–203 (2006)

    Article  Google Scholar 

  15. Herath, S., Harandi, M., Porikli, F.: Going deeper into action recognition: a survey. Image Vis. Comput. 60, 4–21 (2017)

    Article  Google Scholar 

  16. Hssayeni, M., Saxena, S., Ptucha, R., Savakis, A.: Distracted driver detection: deep learning vs handcrafted features. Electron. Imaging 10, 20–26 (2017)

    Article  Google Scholar 

  17. Huang, J., et al.: Speed/accuracy trade-offs for modern convolutional object detectors. In: IEEE CVPR, pp. 3296–3297 (2017)

    Google Scholar 

  18. Jozefowicz, R., Zaremba, W., Sutskever, I.: An empirical exploration of recurrent network architectures. In: ICML, pp. 2342–2350 (2015)

    Google Scholar 

  19. Kaplan, S., Guvensan, M.A., Yavuz, A.G., Karalurt, Y.: Driver behavior analysis for safe driving: a survey. IEEE Trans. Int. Transp. Syst. 16(6), 3017–3032 (2015). https://doi.org/10.1109/TITS.2015.2462084

    Article  Google Scholar 

  20. Kim, H.J., Yang, J.H.: Takeover requests in simulated partially autonomous vehicles considering human factors. IEEE Trans. Hum.-Mach. Syst. 47(5), 735–740 (2017). https://doi.org/10.1109/THMS.2017.2674998

    Article  Google Scholar 

  21. Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: IEEE CVPR (2010)

    Google Scholar 

  22. Laptev, I., Lindeberg, T.: Space-time interest points. In: ICCV, pp. 432–439 (2003)

    Google Scholar 

  23. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)

    Google Scholar 

  24. Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos “in the wild”. In: IEEE CVPR, pp. 1996–2003 (2009)

    Google Scholar 

  25. Luo, Z., Peng, B., Huang, D.A., Alahi, A., Fei-Fei, L.: Unsupervised learning of long-term motion dynamics for videos. arXiv preprint arXiv:1701.01821, vol. 2 (2017)

  26. Mallya, A., Lazebnik, S.: Learning models for actions and person-object interactions with transfer to question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 414–428. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_25

    Chapter  Google Scholar 

  27. Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: CVPR (2015)

    Google Scholar 

  28. Ranft, B., Stiller, C.: The role of machine vision for intelligent vehicles. IEEE Trans. Int. Veh. 1(1), 8–19 (2016). https://doi.org/10.1109/TIV.2016.2551553

    Article  Google Scholar 

  29. Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: IEEE CVPRW, pp. 512–519 (2014)

    Google Scholar 

  30. Rohrbach, M., Amin, S., Andriluka, M., Schiele, B.: A database for fine grained activity detection of cooking activities. In: IEEE CVPR, pp. 1194–1201, June 2012

    Google Scholar 

  31. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  32. Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: ICCV (2009)

    Google Scholar 

  33. Ryoo, M.S., Rothrock, B., Matthies, L.H.: Pooled motion features for first-person videos. In: IEEE CVPR (2014)

    Google Scholar 

  34. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  35. Singh, B., Marks, T.K., Jones, M., Tuzel, O., Shao, M.: A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: IEEE CVPR, pp. 1961–1970 (2016)

    Google Scholar 

  36. Singh, D.: Using convolutional neural networks to perform classification on state farm insurance driver images. Technical report. Stanford University, Stanford, CA (2016)

    Google Scholar 

  37. Tieleman, T., Hinton, G.: Lecture 65-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Mach. Learn. 4(2), 26–31 (2012)

    Google Scholar 

  38. Trivedi, M.M., Gandhi, T., McCall, J.: Looking-in and looking-out of a vehicle: computer-vision-based enhanced vehicle safety. IEEE Trans. Int. Transp. Syst. 8(1), 108–120 (2007). https://doi.org/10.1109/TITS.2006.889442

    Article  Google Scholar 

  39. Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. IJCV 103(1), 60–79 (2013)

    Article  MathSciNet  Google Scholar 

  40. Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: IEEE CVPR (2015)

    Google Scholar 

  41. Wu, Z., Jiang, Y.G., Wang, X., Ye, H., Xue, X., Wang, J.: Fusing multi-stream deep networks for video classification. arXiv preprint arXiv:1509.06086 (2015)

  42. Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in NIPS, pp. 802–810 (2015)

    Google Scholar 

  43. Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: NIPS, pp. 3320–3328 (2014)

    Google Scholar 

  44. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: IEEE CVPR, pp. 4694–4702 (2015)

    Google Scholar 

Download references

Acknowledgments

The research is supported by the Edge Hill University’s Research Investment Fund (RIF). We would like to thank Taylor Smith in State Farm Corporation for providing information about their dataset. The GPU used in this research is generously donated by the NVIDIA Corporation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ardhendu Behera .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 16870 KB)

Supplementary material 2 (pdf 1242 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Behera, A., Keidel, A., Debnath, B. (2019). Context-driven Multi-stream LSTM (M-LSTM) for Recognizing Fine-Grained Activity of Drivers. In: Brox, T., Bruhn, A., Fritz, M. (eds) Pattern Recognition. GCPR 2018. Lecture Notes in Computer Science(), vol 11269. Springer, Cham. https://doi.org/10.1007/978-3-030-12939-2_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-12939-2_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-12938-5

  • Online ISBN: 978-3-030-12939-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics