Skip to main content

Unifying Visual Perception by Dispersible Points Learning

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13669))

Abstract

We present a conceptually simple, flexible, and universal visual perception head for variant visual tasks, e.g., classification, object detection, instance segmentation and pose estimation, and different frameworks, such as one-stage or two-stage pipelines. Our approach effectively identifies an object in an image while simultaneously generating a high-quality bounding box or contour-based segmentation mask or set of keypoints. The method, called UniHead, views different visual perception tasks as the dispersible points learning via the transformer encoder architecture. Given a fixed spatial coordinate, UniHead adaptively scatters it to different spatial points and reasons about their relations by transformer encoder. It directly outputs the final set of predictions in the form of multiple points, allowing us to perform different visual tasks in different frameworks with the same head design. We show extensive evaluations on ImageNet classification and all three tracks of the COCO suite of challenges, including object detection, instance segmentation and pose estimation. Without bells and whistles, UniHead can unify these visual tasks via a single visual head design and achieve comparable performance compared to expert models developed for each task. We hope our simple and universal UniHead will serve as a solid baseline and help promote universal visual perception research. Code and models are available at https://github.com/Sense-X/UniHead.

J. Liang—Work is done during the internship at SenseTime.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR, pp. 3686–3693 (2014)

    Google Scholar 

  2. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv:1607.06450 (2016)

  3. Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: YOLACT: real-time instance segmentation. In: ICCV, pp. 9157–9166 (2019)

    Google Scholar 

  4. Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: CVPR, pp. 6154–6162 (2018)

    Google Scholar 

  5. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV, pp. 213–229 (2020)

    Google Scholar 

  6. Chen, K., et al.: Hybrid task cascade for instance segmentation. In: CVPR, pp. 4974–4983 (2019)

    Google Scholar 

  7. Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: CVPR, pp. 7103–7112 (2018)

    Google Scholar 

  8. Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016)

    Google Scholar 

  9. Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: NeurIPS 29 (2016)

    Google Scholar 

  10. Dai, J., et al.: Deformable convolutional networks. In: ICCV, pp. 764–773 (2017)

    Google Scholar 

  11. Dai, X., et al.: Dynamic head: unifying object detection heads with attentions. In: CVPR, pp. 7373–7382 (2021)

    Google Scholar 

  12. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)

    Google Scholar 

  13. Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. arXiv:2010.11929 (2020)

  14. Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: CenterNet: keypoint triplets for object detection. In: ICCV, pp. 6569–6578 (2019)

    Google Scholar 

  15. Duan, K., Xie, L., Qi, H., Bai, S., Huang, Q., Tian, Q.: Location-sensitive visual recognition with cross-IoU loss. arXiv:2104.04899 (2021)

  16. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. IJCV 88(2), 303–338 (2010)

    Article  Google Scholar 

  17. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV, pp. 2961–2969 (2017)

    Google Scholar 

  18. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  19. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR, pp. 4700–4708 (2017)

    Google Scholar 

  20. Kirillov, A., Wu, Y., He, K., Girshick, R.: PointRend: image segmentation as rendering. In: CVPR, pp. 9799–9808 (2020)

    Google Scholar 

  21. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NeurIPS 25 (2012)

    Google Scholar 

  22. Law, H., Deng, J.: CornerNet: Detecting objects as paired keypoints. In: ECCV, pp. 734–750 (2018)

    Google Scholar 

  23. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2980–2988 (2017)

    Google Scholar 

  24. Lin, T.Y., et al.: Microsoft coco: common objects in context. In: ECCV, pp. 740–755 (2014)

    Google Scholar 

  25. Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: CVPR, pp. 8759–8768 (2018)

    Google Scholar 

  26. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: single shot multibox detector. In: ECCV, pp. 21–37 (2016)

    Google Scholar 

  27. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. arXiv:2103.14030 (2021)

  28. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv:1711.05101 (2017)

  29. Lu, X., Li, B., Yue, Y., Li, Q., Yan, J.: Grid R-CNN. In: CVPR, pp. 7363–7372 (2019)

    Google Scholar 

  30. Meng, D., et al.: Conditional DETR for fast training convergence. In: ICCV, pp. 3651–3660 (2021)

    Google Scholar 

  31. Peng, S., Jiang, W., Pi, H., Li, X., Bao, H., Zhou, X.: Deep snake for real-time instance segmentation. In: CVPR, pp. 8533–8542 (2020)

    Google Scholar 

  32. Qiao, S., Chen, L.C., Yuille, A.: Detectors: detecting objects with recursive feature pyramid and switchable atrous convolution. In: CVPR, pp. 10213–10224 (2021)

    Google Scholar 

  33. Qiu, H., Ma, Y., Li, Z., Liu, S., Sun, J.: BorderDet: border feature for dense object detection. In: ECCV, pp. 549–564 (2020)

    Google Scholar 

  34. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)

    Google Scholar 

  35. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR, pp. 658–666 (2019)

    Google Scholar 

  36. Song, G., Liu, Y., Wang, X.: Revisiting the sibling head in object detector. In: CVPR, pp. 11563–11572 (2020)

    Google Scholar 

  37. Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: CVPR, pp. 5693–5703 (2019)

    Google Scholar 

  38. Sun, P., et al.: Sparse R-CNN: end-to-end object detection with learnable proposals. In: CVPR, pp. 14454–14463 (2021)

    Google Scholar 

  39. Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: ECCV, pp. 529–545 (2018)

    Google Scholar 

  40. Tian, Z., Shen, C., Chen, H.: Conditional convolutions for instance segmentation. In: ECCV, pp. 282–298 (2020)

    Google Scholar 

  41. Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: ICCV, pp. 9627–9636 (2019)

    Google Scholar 

  42. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML. pp. 10347–10357 (2021)

    Google Scholar 

  43. Vaswani, A., et al.: Attention is all you need. In: NeurIPS 30 (2017)

    Google Scholar 

  44. Wei, F., Sun, X., Li, H., Wang, J., Lin, S.: Point-set anchors for object detection, instance segmentation and pose estimation. In: ECCV, pp. 527–544 (2020)

    Google Scholar 

  45. Xie, E., et al.: PolarMask: single shot instance segmentation with polar representation. In: CVPR, pp. 12193–12202 (2020)

    Google Scholar 

  46. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR, pp. 1492–1500 (2017)

    Google Scholar 

  47. Yang, Z., Liu, S., Hu, H., Wang, L., Lin, S.: RepPoints: point set representation for object detection. In: ICCV, pp. 9657–9666 (2019)

    Google Scholar 

  48. Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: CVPR, pp. 2403–2412 (2018)

    Google Scholar 

  49. Zhang, F., Zhu, X., Dai, H., Ye, M., Zhu, C.: Distribution-aware coordinate representation for human pose estimation. In: CVPR, pp. 7093–7102 (2020)

    Google Scholar 

  50. Zhang, S., Chi, C., Yao, Y., Lei, Z., Li, S.Z.: Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: CVPR, pp. 9759–9768 (2020)

    Google Scholar 

  51. Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv:1904.07850 (2019)

  52. Zhou, X., Zhuo, J., Krahenbuhl, P.: Bottom-up object detection by grouping extreme and center points. In: CVPR, pp. 850–859 (2019)

    Google Scholar 

  53. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv:2010.04159 (2020)

Download references

Acknowledgement

The work was supported by the National Key R &D Program of China under Grant 2019YFB2102400.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu Liu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2311 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liang, J., Song, G., Leng, B., Liu, Y. (2022). Unifying Visual Perception by Dispersible Points Learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13669. Springer, Cham. https://doi.org/10.1007/978-3-031-20077-9_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20077-9_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20076-2

  • Online ISBN: 978-3-031-20077-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics