Abstract
Generating robust and reliable correspondences across images is a fundamental task for a diversity of applications. To capture context at both global and local granularity, we propose ASpanFormer, a Transformer-based detector-free matcher that is built on hierarchical attention structure, adopting a novel attention operation which is capable of adjusting attention span in a self-adaptive manner. To achieve this goal, first, flow maps are regressed in each cross attention phase to locate the center of search region. Next, a sampling grid is generated around the center, whose size, instead of being empirically configured as fixed, is adaptively computed from a pixel uncertainty estimated along with the flow map. Finally, attention is computed across two images within derived regions, referred to as attention span. By these means, we are able to not only maintain long-range dependencies, but also enable fine-grained attention among pixels of high relevance that compensates essential locality and piece-wise smoothness in matching tasks. State-of-the-art accuracy on a wide range of evaluation benchmarks validates the strong matching capability of our method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)
Widya, A.R., Torii, A., Okutomi, M.: Structure from motion using dense CNN features with keypoint relocalization. IPSJ Trans. Comput. Vis. Appl. 10(1), 1–7 (2018). https://doi.org/10.1186/s41074-018-0042-y
Sattler, T., Weyand, T., Leibe, B., Kobbelt, L.: Image retrieval for image-based localization revisited. In: BMVC (2012)
Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: ORB-SLAM: a versatile and accurate monocular slam system. IEEE Trans. Robot. 31(5), 1147–1163 (2015)
Mur-Artal, R., Tardos, J.: ORB-SLAM2: an open-source slam system for monocular, stereo and RGB-D cameras. IEEE Trans. Robot. 33(5), 1255–1262 (2016)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. In: IJCV (2004)
Rublee, E., Rabaud, V., Konolige, K., Bradski, G.R.: ORB: an efficient alternative to sift or surf. In: ICCV (2011)
Revaud, J., et al.: R2D2: repeatable and reliable detector and descriptor. In: NeurIPS (2019)
DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperPoint: self-supervised interest point detection and description. In: CVPRW (2018)
Luo, Z., et al.: ASLFeat: learning local features of accurate shape and localization. In: CVPR (2020)
Dusmanu, M., et al.: D2-net: a trainable CNN for joint description and detection of local features. In: CVPR (2019)
Luo, Z., et al.: ContextDesc: local descriptor augmentation with cross-modality context. In: CVPR (2019)
Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: LoFTR: detector-free local feature matching with transformers. In: CVPR (2021)
Jiang, W., Trulls, E., Hosang, J., Tagliasacchi, A., Yi, K.M.: COTR: correspondence transformer for matching across images. In: CVPR (2021)
Truong, P., Danelljan, M., Gool, L.V., Timofte, R.: Learning accurate dense correspondences and when to trust them. In: CVPR (2021)
Rocco, I., Cimpoi, M., Arandjelovi, R., Torii, A., Pajdla, T., Sivic, J.: Neighbourhood consensus networks. In: NeurIPS (2018)
Rocco, I., Arandjelović, R., Sivic, J.: Efficient neighbourhood consensus networks via submanifold sparse convolutions. In: ECCV (2020)
Li, X., Han, K., Li, S., Prisacariu, V.: Dual-resolution correspondence networks. In: NeurIPS (2020)
Truong, P., Danelljan, M., Timofte, R.: GLU-Net: global-local universal network for dense flow and correspondences. In: CVPR (2020)
Min, J., Cho, M.: Convolutional hough matching networks. In: CVPR (2021)
Shen, X., Darmon, F., Efros, A., Aubry, M.: Ransac-flow: generic two-stage image alignment. In: ECCV (2020)
Tang, S., Zhang, J., Zhu, S., Tan, P.: Quadtree attention for vision transformers. In: ICLR (2021)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Dosovitskiy, A., et al.: An image is worth 16 x 16 words: transformers for image recognition at scale. In: ICLR (2020)
Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are RNNs: fast autoregressive transformers with linear attention. In: ICML (2020)
Mishchuk, A., Mishkin, D., Radenović, F., Matas, J.: Working hard to know your neighbor’s margins: local descriptor learning loss. In: NeurIPS (2017)
Tian, Y., Fan, B., Wu, F.: L2-net: deep learning of discriminative patch descriptor in Euclidean space. In: CVPR (2017)
Luo, Z., et al.: GeoDesc: learning local descriptors by integrating geometry constraints. In: ECCV (2018)
Wang, Q., Zhou, X., Hariharan, B., Snavely, N.: Learning feature descriptors using camera pose supervision. In: ECCV (2020)
Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: Superglue: learning feature matching with graph neural networks. In: CVPR (2020)
Chen, H., et al.: Learning to match features with seeded graph matching network. In: ICCV (2021)
Zhang, J., et al.: Learning two-view correspondences and geometry using order-aware network. In: ICCV (2019)
Yi*, K.M., Trulls*, E., Ono, Y., Lepetit, V., Salzmann, M., Fua, P.: Learning to find good correspondences. In: CVPR (2018)
Sun, W., Jiang, W., Tagliasacchi, A., Trulls, E., Yi, K.M.: Attentive context normalization for robust permutation-equivariant learning. In: CVPR (2020)
Cavalli, L., Larsson, V., Oswald, M.R., Sattler, T., Pollefeys, M.: Handcrafted outlier detection revisited. In: ECCV (2020)
Bian, J., et al.: GMS: grid-based motion statistics for fast, ultra-robust feature correspondence. In: IJCV (2020)
Truong, P., Danelljan, M., Gool, L., Timofte, R.: Gocor: bringing globally optimized correspondence volumes into your neural network. In: NeurIPS (2020)
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: evolution of optical flow estimation with deep networks. In: CVPR (2017)
Teed, Z., Deng, J.: Raft: recurrent all-pairs field transforms for optical flow. In: ECCV (2020)
Fischer, P., et al.: FlowNet: learning optical flow with convolutional networks. In: ICCV (2015)
Yin, Z., Shi, J.: Geonet: unsupervised learning of dense depth, optical flow and camera pose. In: CVPR (2018)
Zhou, L., et al.: Kfnet: learning temporal camera relocalization using kalman filtering. In: CVPR (2020)
Gast, J., Roth, S.: Lightweight probabilistic deep networks. In: CVPR (2018)
Ilg, E., et al.: Uncertainty estimates and multi-hypotheses networks for optical flow. In: ECCV (2018)
Danelljan, M., Gool, L., Timofte, R.: Probabilistic regression for visual tracking. In: CVPR (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3d reconstructions of indoor scenes. In: CVPR (2017)
Li, Z., Snavely, N.: MegaDepth: learning single-view depth prediction from internet photos. In: CVPR (2018)
Thomee, B., et al.: YFCC100M: the new data in multimedia research. Commun. ACM 59(2), 64–73 (2016)
Truong, P., Danelljan, M., Timofte, R., Van Gool, L.: PDC-Net+: enhanced probabilistic dense correspondence network (2021)
Wang, Q., Zhang, J., Yang, K., Peng, K., Stiefelhagen, R.: Matchformer: interleaving attention in transformers for feature matching (2022)
Edstedt, J., Wadenbäck, M., Felsberg, M.: Deep kernelized dense geometric matching (2022)
Taira, H., et al.: InLoc: indoor visual localization with dense matching and view synthesis. In: CVPR (2018)
Zhang, Z., Sattler, T., Scaramuzza, D.: Reference pose generation for long-term visual localization via learned features and view synthesis. In: IJCV (2021)
Sattler, T., et al.: Benchmarking 6DOF outdoor visual localization in changing conditions. In: CVPR (2018)
Toft, C., et al.: Long-term visual localization revisited. In: TPAMI (2020)
Sarlin, P.E., Cadena, C., Siegwart, R., Dymczyk, M.: From coarse to fine: robust hierarchical localization at large scale. In: CVPR (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, H. et al. (2022). ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13692. Springer, Cham. https://doi.org/10.1007/978-3-031-19824-3_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-19824-3_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19823-6
Online ISBN: 978-3-031-19824-3
eBook Packages: Computer ScienceComputer Science (R0)