A Dynamic Feature Interaction Framework for Multi-task Visual Perception

Xi, Yuling; Chen, Hao; Wang, Ning; Wang, Peng; Zhang, Yanning; Shen, Chunhua; Liu, Yifan

doi:10.1007/s11263-023-01835-5

A Dynamic Feature Interaction Framework for Multi-task Visual Perception

Published: 09 July 2023

Volume 131, pages 2977–2993, (2023)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Yuling Xi¹,
Hao Chen²,
Ning Wang¹,
Peng Wang¹,
Yanning Zhang¹,
Chunhua Shen² &
…
Yifan Liu³

577 Accesses
2 Altmetric
Explore all metrics

Abstract

Multi-task visual perception has a wide range of applications in scene understanding such as autonomous driving. In this work, we devise an efficient unified framework to solve multiple common perception tasks, including instance segmentation, semantic segmentation, monocular 3D detection, and depth estimation. Simply sharing the same visual feature representations for these tasks impairs the performance of tasks, while independent task-specific feature extractors lead to parameter redundancy and latency. Thus, we design two feature-merge branches to learn feature basis, which can be useful to, and thus shared by, multiple perception tasks. Then, each task takes the corresponding feature basis as the input of the prediction task head to fulfill a specific task. In particular, one feature merge branch is designed for instance-level recognition the other for dense predictions. To enhance inter-branch communication, the instance branch passes pixel-wise spatial information of each instance to the dense branch using efficient dynamic convolution weighting. Moreover, a simple but effective dynamic routing mechanism is proposed to isolate task-specific features and leverage common properties among tasks. Our proposed framework, termed D2BNet, demonstrates a unique approach to parameter-efficient predictions for multi-task perception. In addition, as tasks benefit from co-training with each other, our solution achieves on par results on partially labeled settings on nuScenes and outperforms previous works for 3D detection and depth estimation on the Cityscapes dataset with full supervision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unifying Visual Perception by Dispersible Points Learning

Real-Time Multi-task Network for Autonomous Driving

Generic 3D Representation via Pose Estimation and Matching

Data Availibility

Datasets used in this work are all publicly available. NuScenes and NuImages datasets are available at https://www.nuscenes.org/. Cityscapes is available at https://www.cityscapes-dataset.com/, and MSCOCO at https://cocodataset.org/.

Notes

This makes \({\varvec{t}}^{(i)}\) a vector of length \(D'K\).
https://git.io/AdelaiDet.
https://github.com/open-mmlab/mmdetection3d/tree/master/configs/nuimages.
https://github.com/facebookresearch/detectron2.

References

Bolya, D., Zhou, C., Xiao, F., & Lee, Y. J. (2019). YOLACT: Real-time instance segmentation. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 9156–9165).
Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., & Beijbom, O. (2020). Nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 11621–11631).
Chen, H., Sun, K., Tian, Z., Shen, C., Huang, Y., & Yan, Y. (2020). BlendMask: Top-down meets bottom-up for instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8570–8578).
Cheng, B., Collins, M. D., Zhu, Y., Liu, T., Huang, T. S., Adam, H., & Chen, L. (2020). Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 12472–12482).
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 3213–3223).
Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017). Language modeling with gated convolutional networks. In International conference on machine learning (ICML) (pp. 933–941). PMLR.
Dumoulin, V., Perez, E., Schucher, N., Strub, F., Vries, H. D., Courville, A., & Bengio, Y. (2018). Feature-wise transformations. Distill, 3(7), e11.
Article Google Scholar
Dusenberry, M., Jerfel, G., Wen, Y., Ma, Y., Snoek, J., Heller, K., Lakshminarayanan, B., & Tran, D. (2020). Efficient and scalable Bayesian neural nets with rank-1 factors. In Proceedings of the 37th international conference on machine learning (ICML) (pp. 2782–2792).
Gählert, N., Jourdan, N., Cordts, M., Franke, U., & Denzler,J. (2020). Cityscapes 3d: Dataset and benchmark for 9 DOF vehicle detection. arXiv preprint arXiv:2006.07864
Gählert, N., Wan, J.-J., Jourdan, N., Finkbeiner, J., Franke, U., & Denzler, J. (2020). Single-shot 3d detection of vehicles from monocular RGB images via geometrically constrained keypoints in real-time. In IEEE intelligent vehicles symposium (IV) (pp. 437–444).
Gao, N., He, F., Jia, J., Shan, Y., Zhang, H., Zhao, X., & Huang, K. (2022). Panopticdepth: A unified framework for depth-aware panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 1632–1642).
He, K., Gkioxari, G., Dollár, P., & Girshick, R. B. (2017). Mask R-CNN. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 2980–2988).
Hou, L., Huang, Z., Shang, L., Jiang, X., Chen, X., & Liu, Q. (2020). Dynabert: Dynamic bert with adaptive width and depth. Advances in Neural Information Processing Systems, 33, 9782–9793.
Google Scholar
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 7132–7141).
Kendall, A., Gal, Y., & Cipolla, R. (2018). Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7482–7491).
Kirillov, A., He, K., Girshick, R. B., Rother, C., & Dollár, P. (2019). Panoptic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 9404–9413).
Kokkinos, I. (2017). Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 6129–6138).
Leang, I., Sistu, G., Bürger, F., Bursuc, A., & Yogamani, S. (2020). Dynamic task weighting methods for multi-task networks in autonomous driving systems. In IEEE 23rd international conference on intelligent transportation systems (ITSC) (pp. 1–8). IEEE.
Li, Y., Song, L., Chen, Y., Li, Z., Zhang, X., Wang, X., & Sun, J. (2020). Learning dynamic routing for semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8550–8559).
Lian, Q., Ye, B., Xu, R., Yao, W., & Zhang, T. (2021). Geometry-aware data augmentation for monocular 3d object detection. arXiv preprint arXiv:2104.05858
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV).
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR).
Liu, Z., Wu, Z., & Tóth, R. (2020). Smoke: Single-stage monocular 3d object detection via keypoint estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW) (pp. 996–997).
Li, Y., Wang, N., Shi, J., Hou, X., & Liu, J. (2018). Adaptive batch normalization for practical domain adaptation. Pattern Recognition (PR), 80, 109–117.
Article Google Scholar
Maninis, K.-K., Radosavovic, I., & Kokkinos, I. (2019). Attentive single-tasking of multiple tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 1851–1860).
Misra, I., Shrivastava, A., Gupta, A., & Hebert, M. (2016). Cross-stitch networks for multi-task learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp 3994–4003).
Mousavian, A., Pirsiavash, H., & Košecká, J. (2016). Joint semantic segmentation and depth estimation with deep convolutional networks. In 2016 Fourth international conference on 3D vision (3DV) (pp. 611–619). IEEE.
Nie, X., Feng, J., & Yan, S. (2018). Mutual learning to adapt for joint human parsing and pose estimation. In European conference on computer vision (ECCV) (pp. 502–517).
Park, D., Ambrus, R., Guizilini, V., Li, J., & Gaidon, A. (2021). Is pseudo-lidar needed for monocular 3d object detection? In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 3142–3152).
Perez, E., Strub, F., de Vries, H., Dumoulin, V., & Courville, A. C. (2018). Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence (pp. 3942–3951).
Qiao, S., Chen, L.-C., & Yuille, A. (2021). Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 10213–10224).
Saeedan, F., & Roth, S. (2021). Boosting monocular depth with panoptic segmentation maps. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (WACV) (pp. 3853–3862).
Schön, M., Buchholz, M., & Dietmayer, K. (2021). Mgnet: Monocular geometric scene understanding for autonomous driving. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 15804–15815).
Standley, T., Zamir, A., Chen, D., Guibas, L., Malik, J., & Savarese, S. (2020). Which tasks should be learned together in multi-task learning? In International conference on machine learning (ICML) (pp. 9120–9132). PMLR.
Teichmann, M., Weber, M., Zoellner, M., Cipolla, R., & Urtasun, R. (2018). Multinet: Real-time joint semantic reasoning for autonomous driving. In IEEE intelligent vehicles symposium (IV) (pp. 1013–1020). IEEE.
Tian, Z., Shen, C., & Chen, H. (2020). Conditional convolutions for instance segmentation. In: European conference on computer vision (ECCV) (pp. 282–298).
Tian, Z., Shen, C., Chen, H., & He, T. (2019). FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 9626–9635).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł, & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008.
Google Scholar
Wang, Y., Tsai, Y.-H., Hung, W.-C., Ding, W., Liu, S., & Yang, M.-H. (2022). Semi-supervised multi-task learning for semantics and depth. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 2505–2514).
Wang, L., Zhang, J., Wang, O., Lin, Z., & Lu, H. (2020). SDC-depth: Semantic divide-and-conquer network for monocular depth estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 541–550).
Wang, H., Zhu, Y., Adam, H., Yuille, A., & Chen, L.-C. (2021). Max-deeplab: End-to-end panoptic segmentation with mask transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 5463–5474).
Wang, T., Zhu, X., Pang, J., & Lin, D. (2021). Fcos3d: Fully convolutional one-stage monocular 3d object detection. In Proceedings of the IEEE/cvf international conference on computer vision (ICCV) (pp. 913–922).
Wen, Y., Tran, D., & Ba, J. (2020). Batchensemble: An alternative approach to efficient ensemble and lifelong learning. In International conference on learning representations (ICLR) (pp. 1–19).
Wortsman, M., Ramanujan, V., Liu, R., Kembhavi, A., Rastegari, M., Yosinski, J., & Farhadi, A. (2020). Supermasks in superposition. Advances in Neural Information Processing Systems, 33, 15173–15184.
Google Scholar
Xiong, Y., Liao, R., Zhao, H., Hu, R., Bai, M., Yumer, E., & Urtasun, R. (2019). UPSNet: A unified panoptic segmentation network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8818–8826).
Xu, D., Ouyang, W., Wang, X., & Sebe, N. (2018). Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 675–684).
Xu, Y., Yang, Y., & Zhang, L. (2023). DeMT: Deformable mixer transformer for multi-task learning of dense prediction. In Proceedings of the AAAI conference on artificial intelligence.
Yang, Y., Li, H., Li, X., Zhao, Q., Wu, J., & Lin, Z. (2020). Sognet: Scene overlap graph network for panoptic segmentation. In Proceedings of the AAAI conference on artificial intelligence (pp. 12637–12644).
Yanwei, L., Hengshuang, Z., Xiaojuan, Q., Liwei, W., Zeming, L., Jian, S., & Jiaya, J. (2021). Fully convolutional networks for panoptic segmentation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 214–223).
Yuan, H., Li, X., Yang, Y., Cheng, G., Zhang, J., Tong, Y., Zhang, L., & Tao, D. (2022). Polyphonicformer: Unified query learning for depth-aware video panoptic segmentation. In Computer Vision-ECCV. 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVII (pp. 582–599). Springer.
Zhang, H., Dana, K. J., Shi, J., Zhang, Z., Wang, X., Tyagi, A., & Agrawal, A. (2018). “Context encoding for semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7151–7160.
Zhao, H., Zhang, Y., Liu, S., Shi, J., Loy, C. C., Lin, D., & Jia, J. (2018). PSANet: Point-wise spatial attention network for scene parsing. In European conference on computer vision (ECCV) (pp. 270–286).
Zhou, H., Ge, Z., Mao, W., & Li, Z. (2022). Persdet: Monocular 3d detection in perspective bird’s-eye-view. arXiv preprint arXiv:2208.09394
Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arXiv preprint arXiv:1904.07850

Download references

Acknowledgements

This work was supported by National Key R &D Program of China (No. 2020AAA0106900), the National Natural Science Foundation of China (No. U19B2037, No. 62206244), Shaanxi Provincial Key R &D Program (No. 2021KWZ-03, 2023-GHZD-02), Natural Science Basic Research Program of Shaanxi (No. 2021JCW-03).

Author information

Authors and Affiliations

School of Computer Science and Ningbo Institute, Northwestern Polytechnical University, Xi’an, China
Yuling Xi, Ning Wang, Peng Wang & Yanning Zhang
Zhejiang University, Hangzhou, China
Hao Chen & Chunhua Shen
The University of Adelaide, Adelaide, Australia
Yifan Liu

Authors

Yuling Xi
View author publications
You can also search for this author in PubMed Google Scholar
Hao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ning Wang
View author publications
You can also search for this author in PubMed Google Scholar
Peng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yanning Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chunhua Shen
View author publications
You can also search for this author in PubMed Google Scholar
Yifan Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanning Zhang.

Additional information

Communicated by Shaodi You.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Yuling Xi and Hao Chen contributed equally. Part of this work was done when Yuling Xi was visiting Zhejiang University.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xi, Y., Chen, H., Wang, N. et al. A Dynamic Feature Interaction Framework for Multi-task Visual Perception. Int J Comput Vis 131, 2977–2993 (2023). https://doi.org/10.1007/s11263-023-01835-5

Download citation

Received: 17 October 2022
Accepted: 24 May 2023
Published: 09 July 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s11263-023-01835-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Dynamic Feature Interaction Framework for Multi-task Visual Perception

Abstract

Access this article

Similar content being viewed by others

Unifying Visual Perception by Dispersible Points Learning

Real-Time Multi-task Network for Autonomous Driving

Generic 3D Representation via Pose Estimation and Matching

Data Availibility

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Dynamic Feature Interaction Framework for Multi-task Visual Perception

Abstract

Access this article

Similar content being viewed by others

Unifying Visual Perception by Dispersible Points Learning

Real-Time Multi-task Network for Autonomous Driving

Generic 3D Representation via Pose Estimation and Matching

Data Availibility

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation