ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer

Chen, Hongkai; Luo, Zixin; Zhou, Lei; Tian, Yurun; Zhen, Mingmin; Fang, Tian; McKinnon, David; Tsin, Yanghai; Quan, Long

doi:10.1007/978-3-031-19824-3_2

Hongkai Chen¹³,
Zixin Luo¹²,
Lei Zhou¹²,
Yurun Tian¹²,
Mingmin Zhen¹²,
Tian Fang¹²,
David McKinnon¹²,
Yanghai Tsin¹² &
…
Long Quan¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13692))

Included in the following conference series:

European Conference on Computer Vision

3270 Accesses
28 Citations

Abstract

Generating robust and reliable correspondences across images is a fundamental task for a diversity of applications. To capture context at both global and local granularity, we propose ASpanFormer, a Transformer-based detector-free matcher that is built on hierarchical attention structure, adopting a novel attention operation which is capable of adjusting attention span in a self-adaptive manner. To achieve this goal, first, flow maps are regressed in each cross attention phase to locate the center of search region. Next, a sampling grid is generated around the center, whose size, instead of being empirically configured as fixed, is adaptively computed from a pixel uncertainty estimated along with the flow map. Finally, attention is computed across two images within derived regions, referred to as attention span. By these means, we are able to not only maintain long-range dependencies, but also enable fine-grained attention among pixels of high relevance that compensates essential locality and piece-wise smoothness in matching tasks. State-of-the-art accuracy on a wide range of evaluation benchmarks validates the strong matching capability of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)
Google Scholar
Widya, A.R., Torii, A., Okutomi, M.: Structure from motion using dense CNN features with keypoint relocalization. IPSJ Trans. Comput. Vis. Appl. 10(1), 1–7 (2018). https://doi.org/10.1186/s41074-018-0042-y
Article Google Scholar
Sattler, T., Weyand, T., Leibe, B., Kobbelt, L.: Image retrieval for image-based localization revisited. In: BMVC (2012)
Google Scholar
Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: ORB-SLAM: a versatile and accurate monocular slam system. IEEE Trans. Robot. 31(5), 1147–1163 (2015)
Article Google Scholar
Mur-Artal, R., Tardos, J.: ORB-SLAM2: an open-source slam system for monocular, stereo and RGB-D cameras. IEEE Trans. Robot. 33(5), 1255–1262 (2016)
Article Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. In: IJCV (2004)
Google Scholar
Rublee, E., Rabaud, V., Konolige, K., Bradski, G.R.: ORB: an efficient alternative to sift or surf. In: ICCV (2011)
Google Scholar
Revaud, J., et al.: R2D2: repeatable and reliable detector and descriptor. In: NeurIPS (2019)
Google Scholar
DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperPoint: self-supervised interest point detection and description. In: CVPRW (2018)
Google Scholar
Luo, Z., et al.: ASLFeat: learning local features of accurate shape and localization. In: CVPR (2020)
Google Scholar
Dusmanu, M., et al.: D2-net: a trainable CNN for joint description and detection of local features. In: CVPR (2019)
Google Scholar
Luo, Z., et al.: ContextDesc: local descriptor augmentation with cross-modality context. In: CVPR (2019)
Google Scholar
Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: LoFTR: detector-free local feature matching with transformers. In: CVPR (2021)
Google Scholar
Jiang, W., Trulls, E., Hosang, J., Tagliasacchi, A., Yi, K.M.: COTR: correspondence transformer for matching across images. In: CVPR (2021)
Google Scholar
Truong, P., Danelljan, M., Gool, L.V., Timofte, R.: Learning accurate dense correspondences and when to trust them. In: CVPR (2021)
Google Scholar
Rocco, I., Cimpoi, M., Arandjelovi, R., Torii, A., Pajdla, T., Sivic, J.: Neighbourhood consensus networks. In: NeurIPS (2018)
Google Scholar
Rocco, I., Arandjelović, R., Sivic, J.: Efficient neighbourhood consensus networks via submanifold sparse convolutions. In: ECCV (2020)
Google Scholar
Li, X., Han, K., Li, S., Prisacariu, V.: Dual-resolution correspondence networks. In: NeurIPS (2020)
Google Scholar
Truong, P., Danelljan, M., Timofte, R.: GLU-Net: global-local universal network for dense flow and correspondences. In: CVPR (2020)
Google Scholar
Min, J., Cho, M.: Convolutional hough matching networks. In: CVPR (2021)
Google Scholar
Shen, X., Darmon, F., Efros, A., Aubry, M.: Ransac-flow: generic two-stage image alignment. In: ECCV (2020)
Google Scholar
Tang, S., Zhang, J., Zhu, S., Tan, P.: Quadtree attention for vision transformers. In: ICLR (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 x 16 words: transformers for image recognition at scale. In: ICLR (2020)
Google Scholar
Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are RNNs: fast autoregressive transformers with linear attention. In: ICML (2020)
Google Scholar
Mishchuk, A., Mishkin, D., Radenović, F., Matas, J.: Working hard to know your neighbor’s margins: local descriptor learning loss. In: NeurIPS (2017)
Google Scholar
Tian, Y., Fan, B., Wu, F.: L2-net: deep learning of discriminative patch descriptor in Euclidean space. In: CVPR (2017)
Google Scholar
Luo, Z., et al.: GeoDesc: learning local descriptors by integrating geometry constraints. In: ECCV (2018)
Google Scholar
Wang, Q., Zhou, X., Hariharan, B., Snavely, N.: Learning feature descriptors using camera pose supervision. In: ECCV (2020)
Google Scholar
Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: Superglue: learning feature matching with graph neural networks. In: CVPR (2020)
Google Scholar
Chen, H., et al.: Learning to match features with seeded graph matching network. In: ICCV (2021)
Google Scholar
Zhang, J., et al.: Learning two-view correspondences and geometry using order-aware network. In: ICCV (2019)
Google Scholar
Yi*, K.M., Trulls*, E., Ono, Y., Lepetit, V., Salzmann, M., Fua, P.: Learning to find good correspondences. In: CVPR (2018)
Google Scholar
Sun, W., Jiang, W., Tagliasacchi, A., Trulls, E., Yi, K.M.: Attentive context normalization for robust permutation-equivariant learning. In: CVPR (2020)
Google Scholar
Cavalli, L., Larsson, V., Oswald, M.R., Sattler, T., Pollefeys, M.: Handcrafted outlier detection revisited. In: ECCV (2020)
Google Scholar
Bian, J., et al.: GMS: grid-based motion statistics for fast, ultra-robust feature correspondence. In: IJCV (2020)
Google Scholar
Truong, P., Danelljan, M., Gool, L., Timofte, R.: Gocor: bringing globally optimized correspondence volumes into your neural network. In: NeurIPS (2020)
Google Scholar
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: evolution of optical flow estimation with deep networks. In: CVPR (2017)
Google Scholar
Teed, Z., Deng, J.: Raft: recurrent all-pairs field transforms for optical flow. In: ECCV (2020)
Google Scholar
Fischer, P., et al.: FlowNet: learning optical flow with convolutional networks. In: ICCV (2015)
Google Scholar
Yin, Z., Shi, J.: Geonet: unsupervised learning of dense depth, optical flow and camera pose. In: CVPR (2018)
Google Scholar
Zhou, L., et al.: Kfnet: learning temporal camera relocalization using kalman filtering. In: CVPR (2020)
Google Scholar
Gast, J., Roth, S.: Lightweight probabilistic deep networks. In: CVPR (2018)
Google Scholar
Ilg, E., et al.: Uncertainty estimates and multi-hypotheses networks for optical flow. In: ECCV (2018)
Google Scholar
Danelljan, M., Gool, L., Timofte, R.: Probabilistic regression for visual tracking. In: CVPR (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3d reconstructions of indoor scenes. In: CVPR (2017)
Google Scholar
Li, Z., Snavely, N.: MegaDepth: learning single-view depth prediction from internet photos. In: CVPR (2018)
Google Scholar
Thomee, B., et al.: YFCC100M: the new data in multimedia research. Commun. ACM 59(2), 64–73 (2016)
Article Google Scholar
Truong, P., Danelljan, M., Timofte, R., Van Gool, L.: PDC-Net+: enhanced probabilistic dense correspondence network (2021)
Google Scholar
Wang, Q., Zhang, J., Yang, K., Peng, K., Stiefelhagen, R.: Matchformer: interleaving attention in transformers for feature matching (2022)
Google Scholar
Edstedt, J., Wadenbäck, M., Felsberg, M.: Deep kernelized dense geometric matching (2022)
Google Scholar
Taira, H., et al.: InLoc: indoor visual localization with dense matching and view synthesis. In: CVPR (2018)
Google Scholar
Zhang, Z., Sattler, T., Scaramuzza, D.: Reference pose generation for long-term visual localization via learned features and view synthesis. In: IJCV (2021)
Google Scholar
Sattler, T., et al.: Benchmarking 6DOF outdoor visual localization in changing conditions. In: CVPR (2018)
Google Scholar
Toft, C., et al.: Long-term visual localization revisited. In: TPAMI (2020)
Google Scholar
Sarlin, P.E., Cadena, C., Siegwart, R., Dymczyk, M.: From coarse to fine: robust hierarchical localization at large scale. In: CVPR (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Apple Inc., Cupertino, USA
Zixin Luo, Lei Zhou, Yurun Tian, Mingmin Zhen, Tian Fang, David McKinnon, Yanghai Tsin & Long Quan
HKUST, Hong Kong, People’s Republic of China
Hongkai Chen

Authors

Hongkai Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zixin Luo
View author publications
You can also search for this author in PubMed Google Scholar
Lei Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yurun Tian
View author publications
You can also search for this author in PubMed Google Scholar
Mingmin Zhen
View author publications
You can also search for this author in PubMed Google Scholar
Tian Fang
View author publications
You can also search for this author in PubMed Google Scholar
David McKinnon
View author publications
You can also search for this author in PubMed Google Scholar
Yanghai Tsin
View author publications
You can also search for this author in PubMed Google Scholar
Long Quan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongkai Chen .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 14039 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, H. et al. (2022). ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13692. Springer, Cham. https://doi.org/10.1007/978-3-031-19824-3_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-19824-3_2
Published: 11 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19823-6
Online ISBN: 978-3-031-19824-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer