Abstract
In the field of computer vision, general single-stage object detection methods employ two individual subnets within detection head, serving classification and localization purposes respectively. However, the lack of explicit modeling for distinctions and associations poses challenges for aligning the spatial feature perception of these two tasks, consequently leading to sub-optimal detection performance. Although some methods utilize classification to evaluate localization, it is a compromise rather than multi-task optimization. In this paper, we propose a Task-coordinated Single-stage Object Detector (TSOD) to enhance the coordination of multiple tasks. Firstly, we introduce a Task-decoupled Feature Alignment Mechanism (TFAM), which adaptively provides compatible features for different tasks by decoupling spatial information. For classification and localization, the network adaptively samples from category-sensitive regions and boundary-separable regions, respectively. Secondly, we propose a Task-interactive Enhancement Mechanism (TEM), which explicitly combines different task-sensitive features for joint classification score prediction and selects samples with high task consistency for training. Through this interaction mechanism, consistency between tasks is bolstered. We conduct extensive experiments on the COCO, Cityscapes, CrowdHuman and WiderFace datasets to evaluate the performance of TSOD. The results demonstrate that our model outperforms several state-of-the-art detectors, achieving a 2.0 AP improvement over the baseline on COCO minival and a remarkable 50.4 AP at single-model single-scale testing on COCO test-dev. Additionally, our model, equipped with ResNet-50, performs significantly better than other representative detectors on the Cityscapes, CrowdHuman, and WiderFace datasets, showcasing its robustness and generalizability. Our study contributes a new perspective to the design of single-stage object detectors by emphasizing the importance of decoupling and interaction, which is crucial for task coordination. The experimental results validate the effectiveness of our proposed TSOD and its potential as a leading approach in the field. Codes are available at https://github.com/Majiawei/tsod-complete.
Similar content being viewed by others
Abbreviations
- TSOD::
-
Task-coordinated Single-stage Object Detector
- TFAM::
-
Task-decoupled Feature Alignment Mechanism
- TEM::
-
Task-interactive Enhancement Mechanism
- JD::
-
Joint Decision
- TS::
-
Task-consistent Sampling
- AP::
-
Average Precision
- COCO::
-
Common Objects in Context
- NMS::
-
Non-Maximum Suppression
- IoU::
-
Intersection over Union
- IACS::
-
IoU-aware Classification Score
- FPN::
-
Feature Pyramid Network
- DecAdaDconv::
-
Decoupled Adaptive Deformable Convolution
- FPS::
-
Frames Per Second
- FLOPs::
-
Floating Point Operations
- GT::
-
Ground Truth
- PCC::
-
Pearson Correlation Coefficient
References
Feng D, Haase-Schütz C, Rosenbaum L, Hertlein H, Glaeser C, Timm F, Wiesbeck W, Dietmayer K (2020) Deep multi-modal object detection and semantic segmentation for autonomous driving: datasets, methods, and challenges. IEEE Trans Intell Trans Syst 22(3):1341–1360
Naqvi SMA, Shabaz M, Khan MA, Hassan SI (2023) Adversarial attacks on visual objects using the fast gradient sign method. J Grid Comput 21(4):52. https://doi.org/10.1007/S10723-023-09684-9
Qadeer N, Shah JH, Sharif M, Khan MA, Muhammad G, Zhang Y (2022) Intelligent tracking of mechanically thrown objects by industrial catching robot for automated in-plant logistics 4.0. Sensors 22(6):2113
Li K, Wan G, Cheng G, Meng L, Han J (2020) Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J Photogrammetry Remote Sens 159:296–307
Redmon J, Divvala SK, Girshick RB, Farhadi A (2016) You only look once: unified real-time object detection. In: CVPR, pp 779–788
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. CoRR arXiv:1804.02767
Liu W, Anguelov D, al DE (2016) SSD: single shot multibox detector. In: ECCV, vol 9905, pp 21–37
Lin T, Goyal P, et al (2017) R.B.G: focal loss for dense object detection. In: ICCV, pp 2999–3007
Tian Z, Shen C, Chen H, He T (2019) FCOS: fully convolutional one-stage object detection. In: ICCV, pp 9626–9635
Law H, Deng J (2018) Cornernet: detecting objects as paired keypoints. In: ECCV, vol 11218, pp 765–781
Duan K, Bai S, al LX (2019) Centernet: keypoint triplets for object detection. In: ICCV, pp 6568–6577
Yang Z, Liu S, al HH (2019) Reppoints: point set representation for object detection. In: ICCV, pp 9656–9665
Zhang S, Chi C, al YY (2020) Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: CVPR, pp 9756–9765
Tang Z, Yang J, Pei Z, Song X (2021) Coordinate-based anchor-free module for object detection. Appl Intell 51(12):9066–9080
Wen G, Cao P, Wang H, Chen H, Liu X, Xu J, Zaïane OR (2023) MS-SSD: multi-scale single shot detector for ship detection in remote sensing images. Appl Intell 53(2):1586–1604
Li Y, Zhou S, Chen H (2022) Attention-based fusion factor in FPN for object detection. Appl Intell 52(13):15547–15556
Ren S, He K, Girshick RB, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS, pp 91–99
Cai Z, Vasconcelos N (2021) Cascade R-CNN: high quality object detection and instance segmentation. IEEE Trans Pattern Anal Mach Intell 43(5):1483–1498
Pang J, Chen K, al JS (2019) Libra R-CNN: towards balanced learning for object detection. In: CVPR, pp 821–830
Song G, Liu Y, Wang X (2020) Revisiting the sibling head in object detector. In: CVPR, pp 11560–11569
Wu Y, Chen Y, al LY (2020) Rethinking classification and localization for object detection. In: CVPR, pp 10183–10192
Kim K, Lee HS (2020) Probabilistic anchor assignment with iou prediction for object detection. In: ECCV, vol 12370, pp 355–371
Li X, Wang W, al LW (2020) Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. In: NeurIPS
Zhang H, Wang Y, Dayoub F, Sunderhauf N (2021) Varifocalnet: an iou-aware dense object detector. In: CVPR, pp 8514–8523
Lin T, Maire M, al (2014) SJB Microsoft COCO common objects in context. In: ECCV, vol 8693, pp 740–755
Jiang B, Luo R, al JM (2018) Acquisition of localization confidence for accurate object detection. In: ECCV, vol 11218, pp 816–832
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp 3213–3223
Zhu C, He Y, Savvides M (2019) Feature selective anchor-free module for single-shot object detection. In: CVPR, pp 840–849
Wang J, Chen K, al SY (2019) Region proposal by guided anchoring. In: CVPR, pp 2965–2974
Wang C, Bochkovskiy A, Liao HM (2022) Yolov7 trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. CoRR abs/220702696
Dong Z, Li G, Liao Y, Wang F, Ren P, Qian C (2020) Centripetalnet Pursuing high-quality keypoint pairs for object detection. In: CVPR, pp 10516–10525
Kong T, Sun F, al HL (2020) Foveabox Beyound anchor-based object detection. IEEE Trans Image Process 29:7389–7398
Vu T, Jang H, Pham TX, Yoo CD (2019) Cascade RPN delving into high-quality region proposal network with adaptive convolution. In: NeurIPS, pp 1430–1440
Qiu H, Ma Y, Li Z, Liu S, Sun J (2020) Borderdet: border feature for dense object detection. In: ECCV, vol 12346, pp 549–564
Chen Z, Yang C, Li Q, Zhao F, Zha Z, Wu F (2021) Disentangle your dense object detector. In: ACM multimedia conference, pp 4939–4948
Dai X, Chen Y, al BX (2021) Dynamic head Unifying object detection heads with attentions. In: CVPR, pp 7373–7382
Vandenhende S, Georgoulis S, Van Gansbeke W, Proesmans M, Dai D, Van Gool L (2021) Multi-task learning for dense prediction tasks: a survey. IEEE Trans Pattern Anal Mach Intell 44(7):3614–3633
Masood H, Zafar A, Ali MU, Hussain T, Khan MA, Tariq U, Damasevicius R (2022) Tracking of a fixed-shape moving object based on the gradient descent method. Sensors 22(3):1098
Hussain N, Khan MA, Kadry S, Tariq U, Mostafa RR, Choi J-I, Nam Y (2021) Intelligent deep learning and improved whale optimization algorithm based framework for object recognition. Hum Cent Comput Inf Sci 11(34):2021
Rashid M, Khan MA, Alhaisoni M, Wang S-H, Naqvi SR, Rehman A, Saba T (2020) A sustainable deep learning framework for object recognition using multi-layers deep features fusion and selection. Sustainability 12(12):5037
Tychsen-Smith L, Petersson L (2018) Improving object localization with fitness NMS and bounded iou loss. In: CVPR, pp 6877–6885
Li X, Wang W, al XH (2021) Generalized focal loss V2 learning reliable localization quality estimation for dense object detection. In: CVPR, pp 11632–11641
Feng C, Zhong Y, Gao Y, Scott MR, Huang W (2021) TOOD task-aligned one-stage object detection. arXiv:2108.07755
Oksuz K, Cam BC, Akbas E, Kalkan S (2020) A ranking-based, balanced loss function unifying classification and localisation in object detection. In: NeurIPS
Chen K, Lin W, Li J, See J, Wang J, Zou J (2021) Ap-loss for accurate one-stage object detection. IEEE Trans Pattern Anal Mach Intell 43(11):3782–3798
Lin T, Dollar P, al RBG (2017) Feature pyramid networks for object detection. In: CVPR, pp 936–944
Zhu X, Hu H, Lin S, Dai J (2019) Deformable convnets V2: more deformable, better results. In: CVPR, pp 9308–9316
Ma Y, Liu S, Li Z, Sun J (2021) Iqdet Instance-wise quality distribution sampling for object detection. In: CVPR, pp 1717–1725
Rezatofighi H, Tsoi N, al JG (2019) Generalized intersection over union A metric and a loss for bounding box regression. In: CVPR, pp 658–666
He K, Gkioxari G, Dollar P, Girshick RB (2017) Mask R-CNN. In: ICCV, pp 2980–2988
Gao Z, Wang L, Wu G (2021) Mutual supervision for dense object detection. In: ICCV, pp 3621–3630
Li Y, Chen Y, Wang N, Zhang Z (2019) Scale-aware trident networks for object detection. In: ICCV, pp 6053–6062
Zhang H, Chang H, Ma B, Wang N, Chen X (2020) Dynamic R-CNN towards high quality object detection via dynamic training. In: ECCV, vol 12360, pp 260–275
Zhang X, Wan F, Liu C, Ji R, Ye Q (2019) Freeanchor Learning to match anchors for visual object detection. In: NeurIPS, pp 147–155
Ke W, Zhang T, Huang Z, Ye Q, Liu J, Huang D (2020) Multiple anchor learning for visual object detection. In: CVPR, pp 10203–10212
Zhu C, Chen F, Shen Z, Savvides M (2020) Soft anchor-point object detection 12354:91–107
Chen Y, Zhang Z, Cao Y, Wang L, Lin S, Hu H (2020) Reppoints v2 Verification meets regression for object detection. In: NeurIPS
Ge Z, Liu S, Li Z, Yoshie O, Sun J (2021) OTA: optimal transport assignment for object detection. In: CVPR, pp 303–312
Li S, He C, Li R, Zhang L (2022) A dual weighting label assignment scheme for object detection. arXiv:2203.09730
Chen K, Wang J, al JP (2019) Mmdetection Open mmlab detection toolbox and benchmark. arXiv:1906.07155
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778
Deng J, Dong W, al RS (2009) Imagenet: a large-scale hierarchical image database. In: CVPR, pp 248–255
Li S, Li M, Li R, He C, Zhang L (2023) One-to-few label assignment for end-to-end dense detection. In: CVPR, IEEE, pp 7350–7359
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: Proc Eur Conf Comput Vis (ECCV), vol 12346, pp 213–229
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2021) Deformable DETR deformable transformers for end-to-end object detection. In: ICLR
Sun P, Zhang R, Jiang Y, Kong T, Xu C, Zhan W, Tomizuka M, Li L, Yuan Z, Wang C, Luo P (2021) Sparse R-CNN end-to-end object detection with learnable proposals. In: CVPR, pp 14454–14463
Gao Z, Wang L, Han B, Guo S (2022) Adamixer; a fast-converging query-based object detector. In: CVPR, pp 5354–5363
Shao S, Zhao Z, Li B, Xiao T, Yu G, Zhang X, Sun, J (2018) Crowdhuman: a benchmark for detecting human in a crowd. arXiv:1805.00123
Yang S, Luo P, Loy CC, Tang X (2016) WIDER FACE A face detection benchmark. In: CVPR, IEEE Computer Society, pp 5525–5533
Funding
This work was supported in part by the National Key Research and Development Program of China under Grant 2020AAA0109700, in part by the National Science Fund for Distinguished Young Scholars under Grant 62125601, by the National Social Science Fund of China under Grant 21FZXB020, and in part by the National Natural Science Foundation of China under Grant 62076024 and Grant 62006018.
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Jia-Wei Ma, Shu Tian and Haixia Man. The first draft of the manuscript was written by Jia-Wei Ma and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors have no relevant financial or non-financial interests to disclose.
Ethical and informed consent for data used
Ethical and informed consent for data used of this paper was obtained from University of Science and Technology Beijing and all authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ma, JW., Tian, S., Man, H. et al. Decoupling and Interaction: task coordination in single-stage object detection. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19257-x
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11042-024-19257-x