Skip to main content
Log in

A Dynamic Feature Interaction Framework for Multi-task Visual Perception

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Multi-task visual perception has a wide range of applications in scene understanding such as autonomous driving. In this work, we devise an efficient unified framework to solve multiple common perception tasks, including instance segmentation, semantic segmentation, monocular 3D detection, and depth estimation. Simply sharing the same visual feature representations for these tasks impairs the performance of tasks, while independent task-specific feature extractors lead to parameter redundancy and latency. Thus, we design two feature-merge branches to learn feature basis, which can be useful to, and thus shared by, multiple perception tasks. Then, each task takes the corresponding feature basis as the input of the prediction task head to fulfill a specific task. In particular, one feature merge branch is designed for instance-level recognition the other for dense predictions. To enhance inter-branch communication, the instance branch passes pixel-wise spatial information of each instance to the dense branch using efficient dynamic convolution weighting. Moreover, a simple but effective dynamic routing mechanism is proposed to isolate task-specific features and leverage common properties among tasks. Our proposed framework, termed D2BNet, demonstrates a unique approach to parameter-efficient predictions for multi-task perception. In addition, as tasks benefit from co-training with each other, our solution achieves on par results on partially labeled settings on nuScenes and outperforms previous works for 3D detection and depth estimation on the Cityscapes dataset with full supervision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data Availibility

Datasets used in this work are all publicly available. NuScenes and NuImages datasets are available at https://www.nuscenes.org/. Cityscapes is available at https://www.cityscapes-dataset.com/, and MSCOCO at https://cocodataset.org/.

Notes

  1. This makes \({\varvec{t}}^{(i)}\) a vector of length \(D'K\).

  2. https://git.io/AdelaiDet.

  3. https://github.com/open-mmlab/mmdetection3d/tree/master/configs/nuimages.

  4. https://github.com/facebookresearch/detectron2.

References

  • Bolya, D., Zhou, C., Xiao, F., & Lee, Y. J. (2019). YOLACT: Real-time instance segmentation. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 9156–9165).

  • Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., & Beijbom, O. (2020). Nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 11621–11631).

  • Chen, H., Sun, K., Tian, Z., Shen, C., Huang, Y., & Yan, Y. (2020). BlendMask: Top-down meets bottom-up for instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8570–8578).

  • Cheng, B., Collins, M. D., Zhu, Y., Liu, T., Huang, T. S., Adam, H., & Chen, L. (2020). Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 12472–12482).

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 3213–3223).

  • Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017). Language modeling with gated convolutional networks. In International conference on machine learning (ICML) (pp. 933–941). PMLR.

  • Dumoulin, V., Perez, E., Schucher, N., Strub, F., Vries, H. D., Courville, A., & Bengio, Y. (2018). Feature-wise transformations. Distill, 3(7), e11.

    Article  Google Scholar 

  • Dusenberry, M., Jerfel, G., Wen, Y., Ma, Y., Snoek, J., Heller, K., Lakshminarayanan, B., & Tran, D. (2020). Efficient and scalable Bayesian neural nets with rank-1 factors. In Proceedings of the 37th international conference on machine learning (ICML) (pp. 2782–2792).

  • Gählert, N., Jourdan, N., Cordts, M., Franke, U., & Denzler,J. (2020). Cityscapes 3d: Dataset and benchmark for 9 DOF vehicle detection. arXiv preprint arXiv:2006.07864

  • Gählert, N., Wan, J.-J., Jourdan, N., Finkbeiner, J., Franke, U., & Denzler, J. (2020). Single-shot 3d detection of vehicles from monocular RGB images via geometrically constrained keypoints in real-time. In IEEE intelligent vehicles symposium (IV) (pp. 437–444).

  • Gao, N., He, F., Jia, J., Shan, Y., Zhang, H., Zhao, X., & Huang, K. (2022). Panopticdepth: A unified framework for depth-aware panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 1632–1642).

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. B. (2017). Mask R-CNN. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 2980–2988).

  • Hou, L., Huang, Z., Shang, L., Jiang, X., Chen, X., & Liu, Q. (2020). Dynabert: Dynamic bert with adaptive width and depth. Advances in Neural Information Processing Systems, 33, 9782–9793.

    Google Scholar 

  • Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 7132–7141).

  • Kendall, A., Gal, Y., & Cipolla, R. (2018). Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7482–7491).

  • Kirillov, A., He, K., Girshick, R. B., Rother, C., & Dollár, P. (2019). Panoptic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 9404–9413).

  • Kokkinos, I. (2017). Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 6129–6138).

  • Leang, I., Sistu, G., Bürger, F., Bursuc, A., & Yogamani, S. (2020). Dynamic task weighting methods for multi-task networks in autonomous driving systems. In IEEE 23rd international conference on intelligent transportation systems (ITSC) (pp. 1–8). IEEE.

  • Li, Y., Song, L., Chen, Y., Li, Z., Zhang, X., Wang, X., & Sun, J. (2020). Learning dynamic routing for semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8550–8559).

  • Lian, Q., Ye, B., Xu, R., Yao, W., & Zhang, T. (2021). Geometry-aware data augmentation for monocular 3d object detection. arXiv preprint arXiv:2104.05858

  • Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV).

  • Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR).

  • Liu, Z., Wu, Z., & Tóth, R. (2020). Smoke: Single-stage monocular 3d object detection via keypoint estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW) (pp. 996–997).

  • Li, Y., Wang, N., Shi, J., Hou, X., & Liu, J. (2018). Adaptive batch normalization for practical domain adaptation. Pattern Recognition (PR), 80, 109–117.

    Article  Google Scholar 

  • Maninis, K.-K., Radosavovic, I., & Kokkinos, I. (2019). Attentive single-tasking of multiple tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 1851–1860).

  • Misra, I., Shrivastava, A., Gupta, A., & Hebert, M. (2016). Cross-stitch networks for multi-task learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp 3994–4003).

  • Mousavian, A., Pirsiavash, H., & Košecká, J. (2016). Joint semantic segmentation and depth estimation with deep convolutional networks. In 2016 Fourth international conference on 3D vision (3DV) (pp. 611–619). IEEE.

  • Nie, X., Feng, J., & Yan, S. (2018). Mutual learning to adapt for joint human parsing and pose estimation. In European conference on computer vision (ECCV) (pp. 502–517).

  • Park, D., Ambrus, R., Guizilini, V., Li, J., & Gaidon, A. (2021). Is pseudo-lidar needed for monocular 3d object detection? In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 3142–3152).

  • Perez, E., Strub, F., de Vries, H., Dumoulin, V., & Courville, A. C. (2018). Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence (pp. 3942–3951).

  • Qiao, S., Chen, L.-C., & Yuille, A. (2021). Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 10213–10224).

  • Saeedan, F., & Roth, S. (2021). Boosting monocular depth with panoptic segmentation maps. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (WACV) (pp. 3853–3862).

  • Schön, M., Buchholz, M., & Dietmayer, K. (2021). Mgnet: Monocular geometric scene understanding for autonomous driving. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 15804–15815).

  • Standley, T., Zamir, A., Chen, D., Guibas, L., Malik, J., & Savarese, S. (2020). Which tasks should be learned together in multi-task learning? In International conference on machine learning (ICML) (pp. 9120–9132). PMLR.

  • Teichmann, M., Weber, M., Zoellner, M., Cipolla, R., & Urtasun, R. (2018). Multinet: Real-time joint semantic reasoning for autonomous driving. In IEEE intelligent vehicles symposium (IV) (pp. 1013–1020). IEEE.

  • Tian, Z., Shen, C., & Chen, H. (2020). Conditional convolutions for instance segmentation. In: European conference on computer vision (ECCV) (pp. 282–298).

  • Tian, Z., Shen, C., Chen, H., & He, T. (2019). FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 9626–9635).

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł, & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008.

    Google Scholar 

  • Wang, Y., Tsai, Y.-H., Hung, W.-C., Ding, W., Liu, S., & Yang, M.-H. (2022). Semi-supervised multi-task learning for semantics and depth. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 2505–2514).

  • Wang, L., Zhang, J., Wang, O., Lin, Z., & Lu, H. (2020). SDC-depth: Semantic divide-and-conquer network for monocular depth estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 541–550).

  • Wang, H., Zhu, Y., Adam, H., Yuille, A., & Chen, L.-C. (2021). Max-deeplab: End-to-end panoptic segmentation with mask transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 5463–5474).

  • Wang, T., Zhu, X., Pang, J., & Lin, D. (2021). Fcos3d: Fully convolutional one-stage monocular 3d object detection. In Proceedings of the IEEE/cvf international conference on computer vision (ICCV) (pp. 913–922).

  • Wen, Y., Tran, D., & Ba, J. (2020). Batchensemble: An alternative approach to efficient ensemble and lifelong learning. In International conference on learning representations (ICLR) (pp. 1–19).

  • Wortsman, M., Ramanujan, V., Liu, R., Kembhavi, A., Rastegari, M., Yosinski, J., & Farhadi, A. (2020). Supermasks in superposition. Advances in Neural Information Processing Systems, 33, 15173–15184.

    Google Scholar 

  • Xiong, Y., Liao, R., Zhao, H., Hu, R., Bai, M., Yumer, E., & Urtasun, R. (2019). UPSNet: A unified panoptic segmentation network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8818–8826).

  • Xu, D., Ouyang, W., Wang, X., & Sebe, N. (2018). Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 675–684).

  • Xu, Y., Yang, Y., & Zhang, L. (2023). DeMT: Deformable mixer transformer for multi-task learning of dense prediction. In Proceedings of the AAAI conference on artificial intelligence.

  • Yang, Y., Li, H., Li, X., Zhao, Q., Wu, J., & Lin, Z. (2020). Sognet: Scene overlap graph network for panoptic segmentation. In Proceedings of the AAAI conference on artificial intelligence (pp. 12637–12644).

  • Yanwei, L., Hengshuang, Z., Xiaojuan, Q., Liwei, W., Zeming, L., Jian, S., & Jiaya, J. (2021). Fully convolutional networks for panoptic segmentation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 214–223).

  • Yuan, H., Li, X., Yang, Y., Cheng, G., Zhang, J., Tong, Y., Zhang, L., & Tao, D. (2022). Polyphonicformer: Unified query learning for depth-aware video panoptic segmentation. In Computer Vision-ECCV. 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVII (pp. 582–599). Springer.

  • Zhang, H., Dana, K. J., Shi, J., Zhang, Z., Wang, X., Tyagi, A., & Agrawal, A. (2018). “Context encoding for semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7151–7160.

  • Zhao, H., Zhang, Y., Liu, S., Shi, J., Loy, C. C., Lin, D., & Jia, J. (2018). PSANet: Point-wise spatial attention network for scene parsing. In European conference on computer vision (ECCV) (pp. 270–286).

  • Zhou, H., Ge, Z., Mao, W., & Li, Z. (2022). Persdet: Monocular 3d detection in perspective bird’s-eye-view. arXiv preprint arXiv:2208.09394

  • Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arXiv preprint arXiv:1904.07850

Download references

Acknowledgements

This work was supported by National Key R &D Program of China (No. 2020AAA0106900), the National Natural Science Foundation of China (No. U19B2037, No. 62206244), Shaanxi Provincial Key R &D Program (No. 2021KWZ-03, 2023-GHZD-02), Natural Science Basic Research Program of Shaanxi (No. 2021JCW-03).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanning Zhang.

Additional information

Communicated by Shaodi You.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Yuling Xi and Hao Chen contributed equally. Part of this work was done when Yuling Xi was visiting Zhejiang University.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xi, Y., Chen, H., Wang, N. et al. A Dynamic Feature Interaction Framework for Multi-task Visual Perception. Int J Comput Vis 131, 2977–2993 (2023). https://doi.org/10.1007/s11263-023-01835-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01835-5

Keywords

Navigation