Skip to main content

Advertisement

Log in

STARNet: spatio-temporal aware recurrent network for efficient video object detection on embedded devices

  • RESEARCH
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

The challenge of converting various object detection methods from image to video remains unsolved. When applied to video, image methods frequently fail to generalize effectively due to issues, such as blurriness, different and unclear positions, low quality, and other relevant issues. Additionally, the lack of a good long-term memory in video object detection presents an additional challenge. In the majority of instances, the outputs of successive frames are known to be quite similar; therefore, this fact is relied upon. Furthermore, the information contained in a series of successive or non-successive frames is greater than that contained in a single frame. In this study, we present a novel recurrent cell for feature propagation and identify the optimal location of layers to increase the memory interval. As a result, we achieved higher accuracy compared to other proposed methods in other studies. Hardware limitations can exacerbate this challenge. The paper aims to implement and increase the efficiency of the methods on embedded devices. We achieved 68.7% mAP accuracy on the ImageNet VID dataset for embedded devices in real-time and at a speed of 52 fps.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Bertasius, G., Torresani, L., Shi, J.: Object detection in video with spatiotemporal sampling networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 331–346 (2018)

  2. Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: 2016 IEEE International Conference on Image Processing (ICIP), IEEE. pp. 3464–3468 (2016)

  3. Chen, L., Ai, H., Zhuang, Z., Shang, C.: Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In: 2018 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp. 1–6 (2018)

  4. Chen, Y., Cao, Y., Hu, H., Wang, L.: Memory enhanced global-local aggregation for video object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10337–10346 (2020)

  5. Cui, Y., Yan, L., Cao, Z., Liu, D.: Tf-blender: temporal feature blender for video object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8138–8147 (2021)

  6. Deng, J., Pan, Y., Yao, T., Zhou, W., Li, H., Mei, T.: Relation distillation networks for video object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7023–7032 (2019)

  7. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T.: Flownet: learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766 (2015)

  8. Ehteshami Bejnordi, B., Habibian, A., Porikli, F., Ghodrati, A.: Salisa: saliency-based input sampling for efficient video object detection. In: European Conference on Computer Vision, pp. 300–316. Springer (2022)

  9. Feichtenhofer, C., Pinz, A., Zisserman, A.: Detect to track and track to detect. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3038–3046 (2017)

  10. Galteri, L., Seidenari, L., Bertini, M., Del Bimbo, A.: Spatio-temporal closed-loop object detection. IEEE Trans. Image Process. 26, 1253–1263 (2017)

    Article  MathSciNet  Google Scholar 

  11. Habibian, A., Abati, D., Cohen, T.S., Bejnordi, B.E.: Skip-convolutions for efficient video processing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2695–2704 (2021)

  12. Habibian, A., Ben Yahia, H., Abati, D., Gavves, E., Porikli, F.: Delta distillation for efficient video processing. In: European Conference on Computer Vision, pp. 213–229. Springer (2022)

  13. Hajizadeh, M., Sabokrou, M., Rahmani, A.: MobileDenseNet: a new approach to object detection on mobile devices. Expert Syst. Appl. 215, 119348 (2023)

    Article  Google Scholar 

  14. Han, W., Khorrami, P., Paine, T.L., Ramachandran, P., Babaeizadeh, M., Shi, H., Li, J., Yan, S., Huang, T.S.: Seq-nms for video object detection (2016). arXiv preprint arXiv:1602.08465

  15. Kang, K., Li, H., Yan, J., Zeng, X., Yang, B., Xiao, T., Zhang, C., Wang, Z., Wang, R., Wang, X., et al.: T-cnn: tubelets with convolutional neural networks for object detection from videos. IEEE Trans. Circuits Syst. Video Technol. 28, 2896–2907 (2017)

    Article  Google Scholar 

  16. Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)

  17. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense´ object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)

  18. Liu, M., Zhu, M.: Mobile video object detection with temporally-aware feature maps. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5686–5695 (2018)

  19. Liu, M., Zhu, M., White, M., Li, Y., Kalenichenko, D.: Looking fast and slow: Memory-guided mobile video object detection (2019). arXiv preprint arXiv:1903.10172

  20. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: single shot multibox detector. In: European Conference on Computer Vision, pp. 21–37. Springer (2016)

  21. Mao, H., Zhu, S., Han, S., Dally, W.J.: Patchnet–short-range template matching for efficient video processing (2021). arXiv preprint arXiv:2103.07371

  22. Qin, Z., Li, Z., Zhang, Z., Bao, Y., Yu, G., Peng, Y., Sun, J.: Thundernet: towards real-time generic object detection on mobile devices. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6718–6727 (2019)

  23. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)

  24. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  25. Schulter, S., Vernaza, P., Choi, W., Chandraker, M.: Deep network flow for multi-object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6951–6960 (2017)

  26. Tan, M., Pang, R., Le, Q.V.: Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10781–10790 (2020)

  27. Tang, Q., Li, J., Shi, Z., Hu, Y.: Lightdet: a lightweight and accurate object detection network. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 2243–2247 (2020)

  28. Wang, S., Zhou, Y., Yan, J., Deng, Z.: Fully motion-aware network for video object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 542–557 (2018)

  29. Wang, Z., Zheng, L., Liu, Y., Li, Y., Wang, S.: Towards real-time multiobject tracking. In: European Conference on Computer Vision, pp. 107–122. Springer (2020)

  30. Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP), IEEE, pp. 3645–3649 (2017)

  31. Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)

  32. Wu, H., Chen, Y., Wang, N., Zhang, Z.: Sequence level semantics aggregation for video object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9217–9225 (2019)

  33. Xiao, F., Lee, Y.J.: Video object detection with an aligned spatialtemporal memory. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 485–501 (2018)

  34. Xu, R., Mu, F., Lee, J., Mukherjee, P., Chaterji, S., Bagchi, S., Li, Y.: Smartadapt: multi-branch object detection framework for videos on mobiles. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2528–2538 (2023)

  35. Yao, C.H., Fang, C., Shen, X., Wan, Y., Yang, M.H.: Video object detection via object-level temporal aggregation. In: European Conference on Computer Vision, pp. 160–177. Springer (2020)

  36. Zhu, X., Dai, J., Yuan, L., Wei, Y.: Towards high performance video object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7210–7218 (2018)

  37. Zhu, X., Wang, Y., Dai, J., Yuan, L., Wei, Y.: Flow-guided feature aggregation for video object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 408–417 (2017)

  38. Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2349–2358 (2017)

Download references

Author information

Authors and Affiliations

Authors

Contributions

MH: Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data Curation, Writing—Original Draft, MS: Investigation, Writing—Original Draft, Writing—Review and Editing, Visualization, Supervision. AR: Writing—Review and Editing, Supervision, Project administration,

Corresponding author

Correspondence to Adel Rahmani.

Ethics declarations

Conflict of interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

The comparison of small object detection on the proposed method and EfficientDet [26].

Method

Frame 1

Frame 2

Frame 3

Frame 4

EfficientDet [26] Localization

EfficientDet [26] Classification

Bird 96%

Bird 77%

Bird 97%

Bird 58%

Bird 94%

Fail

Squirrel 86%

Bird 93%

Proposed Method Localization

Proposed Method Classification

Bird 100%

Bird 98%

Bird 100%

Bird 96%

Bird 100%

Bird 94%

Bird 99%

Bird 99%

EfficientDet [26] Localization

EfficientDet [26] Classification

Dog 93%

Dog 66%

Cat 62%

Bird 51%

Proposed Method Localization

Proposed Method Classification

Dog 99%

Dog 98%

Dog 98%

Dog 98%

EfficientDet [26] Localization

EfficientDet [26] Classification

Motorcycle 43%

Car 92%

Motorcycle 48%

Car 91%

Car 57%

Car 97%

Motorcycle 66%

Car 98%

Proposed Method Localization

Proposed Method Classification

Motorcycle 69%

Car 96%

Car 80%

Car 97%

Car 89%

Car 97%

Car 92%

Car 99%

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hajizadeh, M., Sabokrou, M. & Rahmani, A. STARNet: spatio-temporal aware recurrent network for efficient video object detection on embedded devices. Machine Vision and Applications 35, 23 (2024). https://doi.org/10.1007/s00138-023-01504-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-023-01504-0

Keywords

Navigation