Target Detection Using Transformer: A Study Using DETR

Kumar, Akhilesh; Singh, Satish Kumar; Dubey, Shiv Ram

doi:10.1007/978-981-19-7867-8_59

Akhilesh Kumar¹³,
Satish Kumar Singh¹⁴ &
Shiv Ram Dubey¹⁴

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 586))

481 Accesses
1 Citations

Abstract

Transformer has been proposed to augment the attention mechanism in neural networks without using recurrence and convolutions. Starting with machine translation, it graduated to vision transformer. Among the vision transformers, we explore the DEtection TRansformer (DETR) model proposed in the End-to-end Object Detection with Transformers paper by the team at Facebook AI. The authors have demonstrated interesting object detection results from the DETR model. That triggered the curiosity to use the model for detection of custom objects. Here, we are presenting the way to fine-tune the pre-trained DETR model over custom dataset. The fine-tuning results demonstrate significant improvement with respect to number of training epochs, both visibly as well as statistically.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation (2013). (Online). Available: http://arxiv.org/abs/1311.2524
Girshick, R.: Fast R-CNN (2015). (Online). Available: http://arxiv.org/abs/1504.08083
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031
Article Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 386–397 (2020). https://doi.org/10.1109/TPAMI.2018.2844175
Article Google Scholar
Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective search for object recognition. Int. J. Comput. Vision 104(2), 154–171 (2013). https://doi.org/10.1007/s11263-013-0620-5
Article Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You Only Look Once: Unified, Real-Time Object Detection (2015) (Online). Available: http://arxiv.org/abs/1506.02640
Redmon, J., Farhadi, A.: YOLOv3: An incremental improvement. arXiv (2018)
Google Scholar
Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: Proceedings—30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 6517–6525 (2017). https://doi.org/10.1109/CVPR.2017.690.
Liu, W., et al.: SSD: single shot multibox detector. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9905 LNCS, pp. 21–37 (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Vaswani, A., et al.: Attention Is All You Need (2017). (Online). Available: http://arxiv.org/abs/1706.03762
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-End Object Detection with Transformers (2020). (Online). Available: http://arxiv.org/abs/2005.12872
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: Deformable Transformers for End-to-End Object Detection (2020). (Online). Available: http://arxiv.org/abs/2010.04159
El-Nouby, A., et al.: XCiT: Cross-Covariance Image Transformers. (2021). (Online). Available: http://arxiv.org/abs/2106.09681
Li, Y., Zhang, K., Cao, J., Timofte, R., van Gool, L.: LocalViT: Bringing Locality to Vision Transformers (2021). (Online). Available: http://arxiv.org/abs/2104.05707
Wang, W., et al.: Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021). (Online). Available: http://arxiv.org/abs/2102.12122
Zhang, P., et al.: Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding (2021). (Online). Available: http://arxiv.org/abs/2103.15358
Dubey, S.R., Singh, S.K., Chu, W.-T.: Vision Transformer Hashing for Image Retrieval (2021). (Online). Available: http://arxiv.org/abs/2109.12564
Muñoz, E.: Attention is all you need: Discovering the Transformer paper (2020)
Google Scholar
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8693 LNCS, no. PART 5, pp. 740–755 (2014). https://doi.org/10.1007/978-3-319-10602-1_48.
Yu, J., Li, J., Yu, Z., Huang, Q.: Multimodal Transformer with Multi-View Visual Representation for Image Captioning (2019). (Online). Available: http://arxiv.org/abs/1905.07841
Everingham, M., van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vision 88(2), 303–338 (2010). https://doi.org/10.1007/s11263-009-0275-4
Article Google Scholar
fmassa et al.: facebookresearch/detr (2020)
Google Scholar

Download references

Acknowledgements

We thank DIPR, DRDO for providing the R&D environment to carry out the research work. We also thank IIIT Allahabad for providing the opportunity to carry out the PhD course under the Working Professional Scheme.

Author information

Authors and Affiliations

Defence Institute of Psychological Research (DIPR), DRDO, Delhi, India
Akhilesh Kumar
Computer Vision and Biometrics Laboratory, Indian Institute of Information Technology, Allahabad, Prayagraj, India
Satish Kumar Singh & Shiv Ram Dubey

Authors

Akhilesh Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Satish Kumar Singh
View author publications
You can also search for this author in PubMed Google Scholar
Shiv Ram Dubey
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Akhilesh Kumar .

Editor information

Editors and Affiliations

Computer Vision Laboratory, University of Sassari, Alghero, Sassari, Italy
Massimo Tistarelli
Computer Vision and Biometrics Lab, Department of Information Technology, Indian Institute of Information Technology Allahabad, Prayagraj, India
Shiv Ram Dubey
Computer Vision and Biometrics Lab, Department of Information Technology, Indian Institute of Information Technology, Allahabad, India
Satish Kumar Singh
University of Münster, Münster, Germany
Xiaoyi Jiang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kumar, A., Singh, S.K., Dubey, S.R. (2023). Target Detection Using Transformer: A Study Using DETR. In: Tistarelli, M., Dubey, S.R., Singh, S.K., Jiang, X. (eds) Computer Vision and Machine Intelligence. Lecture Notes in Networks and Systems, vol 586. Springer, Singapore. https://doi.org/10.1007/978-981-19-7867-8_59

Download citation

DOI: https://doi.org/10.1007/978-981-19-7867-8_59
Published: 06 May 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-7866-1
Online ISBN: 978-981-19-7867-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics