Skip to main content

ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13692))

Included in the following conference series:

Abstract

Generating robust and reliable correspondences across images is a fundamental task for a diversity of applications. To capture context at both global and local granularity, we propose ASpanFormer, a Transformer-based detector-free matcher that is built on hierarchical attention structure, adopting a novel attention operation which is capable of adjusting attention span in a self-adaptive manner. To achieve this goal, first, flow maps are regressed in each cross attention phase to locate the center of search region. Next, a sampling grid is generated around the center, whose size, instead of being empirically configured as fixed, is adaptively computed from a pixel uncertainty estimated along with the flow map. Finally, attention is computed across two images within derived regions, referred to as attention span. By these means, we are able to not only maintain long-range dependencies, but also enable fine-grained attention among pixels of high relevance that compensates essential locality and piece-wise smoothness in matching tasks. State-of-the-art accuracy on a wide range of evaluation benchmarks validates the strong matching capability of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)

    Google Scholar 

  2. Widya, A.R., Torii, A., Okutomi, M.: Structure from motion using dense CNN features with keypoint relocalization. IPSJ Trans. Comput. Vis. Appl. 10(1), 1–7 (2018). https://doi.org/10.1186/s41074-018-0042-y

    Article  Google Scholar 

  3. Sattler, T., Weyand, T., Leibe, B., Kobbelt, L.: Image retrieval for image-based localization revisited. In: BMVC (2012)

    Google Scholar 

  4. Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: ORB-SLAM: a versatile and accurate monocular slam system. IEEE Trans. Robot. 31(5), 1147–1163 (2015)

    Article  Google Scholar 

  5. Mur-Artal, R., Tardos, J.: ORB-SLAM2: an open-source slam system for monocular, stereo and RGB-D cameras. IEEE Trans. Robot. 33(5), 1255–1262 (2016)

    Article  Google Scholar 

  6. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. In: IJCV (2004)

    Google Scholar 

  7. Rublee, E., Rabaud, V., Konolige, K., Bradski, G.R.: ORB: an efficient alternative to sift or surf. In: ICCV (2011)

    Google Scholar 

  8. Revaud, J., et al.: R2D2: repeatable and reliable detector and descriptor. In: NeurIPS (2019)

    Google Scholar 

  9. DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperPoint: self-supervised interest point detection and description. In: CVPRW (2018)

    Google Scholar 

  10. Luo, Z., et al.: ASLFeat: learning local features of accurate shape and localization. In: CVPR (2020)

    Google Scholar 

  11. Dusmanu, M., et al.: D2-net: a trainable CNN for joint description and detection of local features. In: CVPR (2019)

    Google Scholar 

  12. Luo, Z., et al.: ContextDesc: local descriptor augmentation with cross-modality context. In: CVPR (2019)

    Google Scholar 

  13. Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: LoFTR: detector-free local feature matching with transformers. In: CVPR (2021)

    Google Scholar 

  14. Jiang, W., Trulls, E., Hosang, J., Tagliasacchi, A., Yi, K.M.: COTR: correspondence transformer for matching across images. In: CVPR (2021)

    Google Scholar 

  15. Truong, P., Danelljan, M., Gool, L.V., Timofte, R.: Learning accurate dense correspondences and when to trust them. In: CVPR (2021)

    Google Scholar 

  16. Rocco, I., Cimpoi, M., Arandjelovi, R., Torii, A., Pajdla, T., Sivic, J.: Neighbourhood consensus networks. In: NeurIPS (2018)

    Google Scholar 

  17. Rocco, I., Arandjelović, R., Sivic, J.: Efficient neighbourhood consensus networks via submanifold sparse convolutions. In: ECCV (2020)

    Google Scholar 

  18. Li, X., Han, K., Li, S., Prisacariu, V.: Dual-resolution correspondence networks. In: NeurIPS (2020)

    Google Scholar 

  19. Truong, P., Danelljan, M., Timofte, R.: GLU-Net: global-local universal network for dense flow and correspondences. In: CVPR (2020)

    Google Scholar 

  20. Min, J., Cho, M.: Convolutional hough matching networks. In: CVPR (2021)

    Google Scholar 

  21. Shen, X., Darmon, F., Efros, A., Aubry, M.: Ransac-flow: generic two-stage image alignment. In: ECCV (2020)

    Google Scholar 

  22. Tang, S., Zhang, J., Zhu, S., Tan, P.: Quadtree attention for vision transformers. In: ICLR (2021)

    Google Scholar 

  23. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  24. Dosovitskiy, A., et al.: An image is worth 16 x 16 words: transformers for image recognition at scale. In: ICLR (2020)

    Google Scholar 

  25. Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are RNNs: fast autoregressive transformers with linear attention. In: ICML (2020)

    Google Scholar 

  26. Mishchuk, A., Mishkin, D., Radenović, F., Matas, J.: Working hard to know your neighbor’s margins: local descriptor learning loss. In: NeurIPS (2017)

    Google Scholar 

  27. Tian, Y., Fan, B., Wu, F.: L2-net: deep learning of discriminative patch descriptor in Euclidean space. In: CVPR (2017)

    Google Scholar 

  28. Luo, Z., et al.: GeoDesc: learning local descriptors by integrating geometry constraints. In: ECCV (2018)

    Google Scholar 

  29. Wang, Q., Zhou, X., Hariharan, B., Snavely, N.: Learning feature descriptors using camera pose supervision. In: ECCV (2020)

    Google Scholar 

  30. Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: Superglue: learning feature matching with graph neural networks. In: CVPR (2020)

    Google Scholar 

  31. Chen, H., et al.: Learning to match features with seeded graph matching network. In: ICCV (2021)

    Google Scholar 

  32. Zhang, J., et al.: Learning two-view correspondences and geometry using order-aware network. In: ICCV (2019)

    Google Scholar 

  33. Yi*, K.M., Trulls*, E., Ono, Y., Lepetit, V., Salzmann, M., Fua, P.: Learning to find good correspondences. In: CVPR (2018)

    Google Scholar 

  34. Sun, W., Jiang, W., Tagliasacchi, A., Trulls, E., Yi, K.M.: Attentive context normalization for robust permutation-equivariant learning. In: CVPR (2020)

    Google Scholar 

  35. Cavalli, L., Larsson, V., Oswald, M.R., Sattler, T., Pollefeys, M.: Handcrafted outlier detection revisited. In: ECCV (2020)

    Google Scholar 

  36. Bian, J., et al.: GMS: grid-based motion statistics for fast, ultra-robust feature correspondence. In: IJCV (2020)

    Google Scholar 

  37. Truong, P., Danelljan, M., Gool, L., Timofte, R.: Gocor: bringing globally optimized correspondence volumes into your neural network. In: NeurIPS (2020)

    Google Scholar 

  38. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: evolution of optical flow estimation with deep networks. In: CVPR (2017)

    Google Scholar 

  39. Teed, Z., Deng, J.: Raft: recurrent all-pairs field transforms for optical flow. In: ECCV (2020)

    Google Scholar 

  40. Fischer, P., et al.: FlowNet: learning optical flow with convolutional networks. In: ICCV (2015)

    Google Scholar 

  41. Yin, Z., Shi, J.: Geonet: unsupervised learning of dense depth, optical flow and camera pose. In: CVPR (2018)

    Google Scholar 

  42. Zhou, L., et al.: Kfnet: learning temporal camera relocalization using kalman filtering. In: CVPR (2020)

    Google Scholar 

  43. Gast, J., Roth, S.: Lightweight probabilistic deep networks. In: CVPR (2018)

    Google Scholar 

  44. Ilg, E., et al.: Uncertainty estimates and multi-hypotheses networks for optical flow. In: ECCV (2018)

    Google Scholar 

  45. Danelljan, M., Gool, L., Timofte, R.: Probabilistic regression for visual tracking. In: CVPR (2020)

    Google Scholar 

  46. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  47. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3d reconstructions of indoor scenes. In: CVPR (2017)

    Google Scholar 

  48. Li, Z., Snavely, N.: MegaDepth: learning single-view depth prediction from internet photos. In: CVPR (2018)

    Google Scholar 

  49. Thomee, B., et al.: YFCC100M: the new data in multimedia research. Commun. ACM 59(2), 64–73 (2016)

    Article  Google Scholar 

  50. Truong, P., Danelljan, M., Timofte, R., Van Gool, L.: PDC-Net+: enhanced probabilistic dense correspondence network (2021)

    Google Scholar 

  51. Wang, Q., Zhang, J., Yang, K., Peng, K., Stiefelhagen, R.: Matchformer: interleaving attention in transformers for feature matching (2022)

    Google Scholar 

  52. Edstedt, J., Wadenbäck, M., Felsberg, M.: Deep kernelized dense geometric matching (2022)

    Google Scholar 

  53. Taira, H., et al.: InLoc: indoor visual localization with dense matching and view synthesis. In: CVPR (2018)

    Google Scholar 

  54. Zhang, Z., Sattler, T., Scaramuzza, D.: Reference pose generation for long-term visual localization via learned features and view synthesis. In: IJCV (2021)

    Google Scholar 

  55. Sattler, T., et al.: Benchmarking 6DOF outdoor visual localization in changing conditions. In: CVPR (2018)

    Google Scholar 

  56. Toft, C., et al.: Long-term visual localization revisited. In: TPAMI (2020)

    Google Scholar 

  57. Sarlin, P.E., Cadena, C., Siegwart, R., Dymczyk, M.: From coarse to fine: robust hierarchical localization at large scale. In: CVPR (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongkai Chen .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 14039 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, H. et al. (2022). ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13692. Springer, Cham. https://doi.org/10.1007/978-3-031-19824-3_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19824-3_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19823-6

  • Online ISBN: 978-3-031-19824-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics