Abstract
A dual-modal feature alignment based object detection algorithm is proposed for the full fusion of visible and infrared image features. First, we propose a two stream detection model. The algorithm supports simultaneous input of visible and infrared image pairs. Secondly, a gated fusion network is designed, consisting of a dual-modal feature alignment module and a feature fusion module. Medium-term fusion is used, which will be used as the middle layer of the dual-stream backbone network. In particular, the dual-mode feature alignment module extracts detailed information of the dual-mode aligned features by computing a multi-scale dual-mode aligned feature vector. The feature fusion module recalibrates the bimodal fused features and then multiplies them with the bimodal aligned features to achieve cross-modal fusion with joint enhancement of the lower and higher level features. We validate the performance of the proposed algorithm using both the publicly available KAIST pedestrian dataset and a self-built GIR dataset. On the KAIST dataset, the algorithm achieves an accuracy of 77.1%, which is 17.3% and 5.6% better than the accuracy of the benchmark algorithm YOLOv5-s for detecting visible and infrared images alone; on the self-built GIR dataset, the detection accuracy is 91%, which is 1.2% and 14.2% better than the benchmark algorithm for detecting visible and infrared images alone respectively. And the speed meets the real time requirements.
Supported in part by the National Natural Science Foundation of China under Grant 62072370 and in part by the Natural Science Foundation of Shaanxi Province under Grant No. 2023-JC-YB-598.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Zhou, Y., Tuzel, O.: Voxelnet: end-to-end learning for point cloud based 3D object detection. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 4490–4499 (2018)
Kim, S., Song, W.J., Kim, S.H.: Infrared variation optimized deep convolutional neural network for robust automatic ground target recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, USA, pp. 1–8 (2017)
Girshick, R., Donahue, J., Darrell, T.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Valverde, F.R., Hurtado, J.V., Valada, A.: There is more than meets the eye: self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge. In: 2021 IEEE Conference on Computer Vision and Pattern Recognition, pp. 11612–11621 (2021)
Liu, J., Zhang, S., Wang, S.: Multispectral deep neural networks for pedestrian detection. arXiv preprint arXiv:1611.02644 (2016)
Konig, D., Adam, M., Jarvers, C., Layher, G.: Fully convolutional region proposal networks for multispectral person detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 49–56 (2017)
Pfeuffer, A., Dietmayer, K.: Optimal sensor data fusion architecture for object detection in adverse weather conditions. In: International Conference on Information Fusion, England, UK, pp. 1–8 (2018)
Girshick, R.: Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Ren, S., He, K., Girshick, R.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28, pp. 91–99 (2015)
Liu, W., et al.: SSD: single shot MultiBox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Redmon, J., Divvala, S., Girshick, R.: You only look once: unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271 (2017)
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: optimal speed and accuracy of object detection. arXiv pre4print arXiv:2004.10934 (2020)
YOLOv5. https://github.com/ultralytics/yolov5. Accessed 4 Oct 2022
Law, H., Deng, J.: Cornernet: detecting objects as paired keypoints. In: 2018 European Conference on Computer Vision, pp. 734–750 (2018)
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)
Tian, Z., Shen, C., Chen, H.: FCOS: fully convolutional one-stage object detection. In: 2019 IEEE/CVF International Conference on Computer Vision, pp. 9627–9636 (2019)
Devaguptapu, C., Akolekar, N., Sharma, M.: Borrow from anywhere: pseudo multi-modal object detection in thermal imagery. In: 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, USA, pp. 1029–1038 (2019)
Yang, L., Ma, R., Zakhor, A.: Drone object detection using RGB/IR fusion. arXiv preprint arXiv:2201.03786 (2022)
Wang, Q., Chi, Y., Shen, T., Song, J.: Improving RGB-infrared object detection by reducing cross-modality redundancy. Remote Sens. 14(9), 2020–2035 (2022)
Geng, X., Li, M., Liu, W., Zhu, S.: Person tracking by detection using dual visible-infrared cameras. IEEE Internet Things J. 9(22), 23241–23251 (2022)
Zhang, Q., Huang, N., Yao, L., Zhang, D.: RGB-T salient object detection via fusing multi-level CNN features. IEEE Trans. Image Process. 29, 3321–3335 (2019)
Fang, Q., Han, D., Wang, Z.: Cross-modality fusion transformer for multispectral object detection. arXiv preprint arXiv:2111.00273 (2021)
Zhang, W., Ji, G.P., Wang, Z., Fu, K.: Depth quality-inspired feature manipulation for efficient RGB-D salient object detection. In: The 29th ACM International Conference on Multimedia, Chengdu, China, pp. 731–740 (2021)
Hwang, S., Park, J., Kim, N.: Multispectral pedestrian detection: benchmark dataset and baseline. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1037–1045 (2015)
Li, C., Song, D., Tong, R.: Multispectral pedestrian detection via simultaneous detection and segmentation. arXiv preprint arXiv:1808.04818 (2018)
Li, C., Zhao, N., Lu, Y.: Weighted sparse representation regularized graph learning for RGB-T object tracking. In: 2017 Proceedings of the 25th ACM International Conference on Multimedia, pp. 1856–1864 (2017)
Ge, Z., Liu, S., Wang, F., Li, Z.: Yolox: exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021)
Sun, Y., Cao, B., Zhu, P., Hu, Q.: Drone-based RGB-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Trans. Circuits Syst. Video Technol. 32(10), 6700–6713 (2019)
Wang, Q., Chi, Y., Shen, T., Song, J.: Improving RGB-infrared object detection by reducing cross-modality redundancy. Remote Sens. 14(9) (2020)
Acknowledgements
This work is supported by the National Natural Science Foundation of China under grant No. 62072370 and the Natural Science Foundation of Shaanxi Province under grant No. 2023-JC-YB-598.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sun, Y., Hou, Z., Yang, C., Ma, S., Fan, J. (2023). Object Detection Algorithm Based on Bimodal Feature Alignment. In: Lu, H., Blumenstein, M., Cho, SB., Liu, CL., Yagi, Y., Kamiya, T. (eds) Pattern Recognition. ACPR 2023. Lecture Notes in Computer Science, vol 14406. Springer, Cham. https://doi.org/10.1007/978-3-031-47634-1_30
Download citation
DOI: https://doi.org/10.1007/978-3-031-47634-1_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47633-4
Online ISBN: 978-3-031-47634-1
eBook Packages: Computer ScienceComputer Science (R0)