Skip to main content
Log in

What Limits the Performance of Local Self-attention?

  • Manuscript
  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Although self-attention is powerful in modeling long-range dependencies, the performance of local self-attention (LSA) is just similar to depth-wise convolution, which puzzles researchers on whether to use LSA or its counterparts, which one is better, and what limits the performance of LSA. To clarify these, we comprehensively investigate LSA and its counterparts from channel setting and spatial processing. We find that the devil lies in attention generation and application, where relative position embedding and neighboring filter application are key factors. Based on these findings, we propose enhanced local self-attention (ELSA) with Hadamard attention and the ghost head. Hadamard attention introduces the Hadamard product to efficiently generate attention in the neighboring area, while maintaining the high-order mapping. The ghost head combines attention maps with static matrices to increase channel capacity. Experiments demonstrate the effectiveness of ELSA. Without architecture/hyperparameter modification, drop-in replacing LSA with ELSA boosts Swin Transformer by up to \(+\)1.4 on top-1 accuracy. ELSA also consistently benefits VOLO from D1 to D5, where ELSA-VOLO-D5 achieves 87.2 on the ImageNet-1K without extra training images. In addition, we evaluate ELSA in downstream tasks. ELSA significantly improves the baseline by up to \(+\)1.9 box Ap/\(+\)1.3 mask Ap on the COCO, and by up to \(+\)1.9 mIoU on the ADE20K.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

    Google Scholar 

  • Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6154—6162).

  • Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S (2020). End-to-end object detection with transformers. In ECCV.

  • Chen, B., Li, P., Li, B., Li, C., Bai, L., Lin, C., Sun, M., Yan, J., & Ouyang, W. (2021a). Psvit: Better vision transformer via token pooling and attention sharing. arXiv preprint arXiv:2108.03428

  • Chen, C. F., Panda, R., & Fan, Q. (2021b). Regionvit: Regional-to-local attention for vision transformers. arXiv preprint arXiv:2106.02689

  • Chen, J., Wang, X., Guo, Z., Zhang, X., & Sun, J. (2021c). Dynamic region-aware convolution. In CVPR.

  • Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C. C., & Lin, D. (2019) MMDetection: Openmmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155

  • Chen, Y., Dai, X., Liu, M., Chen, D., Yuan, L., & Liu, Z. (2020). Dynamic convolution: Attention over convolution kernels. In CVPR.

  • Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., & Liu, Z. (2022) Mobile-former: Bridging mobilenet and transformer. In CVPR.

  • Chen, Z., Xie, L., Niu, J., Liu, X., Wei, L., & Tian, Q. (2021d). Visformer: The vision-friendly transformer. In ICCV.

  • Chen, Z., Zhu, Y., Zhao, C., Hu, G., Zeng, W., Wang, J., & Tang, M. (2021e). Dpt: Deformable patch-based transformer for visual recognition. In ACM MM, pp 2899–2907

  • Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In PAMI.

  • Chu, X., Zhang, B., Tian, Z., Wei, X., & Xia, H. (2021). Do we really need explicit position encodings for vision transformers? arXiv preprint arXiv:2102.10882

  • Contributors, M. (2020). MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation

  • Dai, Z., Liu, H., Le, Q.V., & Tan, M. (2021) Coatnet: Marrying convolution and attention for all data sizes. In NeurIPS

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR

  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL.

  • Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., & Guo, B. (2021). Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR.

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., & Uszkoreit, J. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR

  • El-Nouby, A., Touvron, H., Caron, M., Bojanowski, P., Douze, M., Joulin, A., Laptev, I., Neverova, N., Synnaeve, G., & Verbeek, J. (2021) Xcit: Cross-covariance image transformers. In NeurIPS.

  • Fang, Y., Wang, X., Wu, R., & Liu, W. (2021). What makes for hierarchical vision transformer? arXiv preprint arXiv:2107.02174

  • Gong, C., Wang, D., Li, M., Chandra, V., & Liu, Q. (2021). Improve vision transformers training by suppressing over-smoothing. arXiv preprint arXiv:2104.12753

  • Graham, B., El-Nouby, A., Touvron, H., Stock, P., Joulin, A., Jégou, H., & Douze, M. (2021). Levit: A vision transformer in convnet’s clothing for faster inference. In ICCV.

  • Gu, J., Kwon, H., Wang, D., Ye, W., Li, M., Chen, Y.H., Lai, L., Chandra, V., & Pan, D. Z. (2021). Hrvit: Multi-scale high-resolution vision transformer. arXiv preprint arXiv:2111.01236

  • Guo, J., Han, K., Wu, H., Xu, C., Tang, Y., Xu, C., & Wang, Y. (2022). Cmt: Convolutional neural networks meet vision transformers. In CVPR

  • Han, Q., Fan, Z., Dai, Q., Sun, L., Cheng, M. M., Liu, J., & Wang, J. (2022). On the connection between local attention and dynamic depth-wise convolution. In ICLR

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In CVPR.

  • He, S., Luo, H., Wang, P., Wang, F., Li, H., & Jiang, W. (2021). Transreid: Transformer-based object re-identification. In ICCV.

  • Howard, A., Sandler, M., Chu, G., Chen, L. C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., & Lee, Q. V. (2019). Searching for mobilenetv3. In ICCV.

  • Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861

  • Huang, Z., Ben, Y., Luo, G., Cheng, P., Yu, G., & Fu, B. (2021). Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650

  • Jia, X., De Brabandere, B., Tuytelaars, T., & Gool, L. V. (2016). Dynamic filter networks. In NeurIPS.

  • Jiang, Z., Hou, Q., Yuan, L., Zhou, D., Jin, X., Wang, A., & Feng, J. (2021). Token labeling: Training a 85.5% top-1 accuracy vision transformer with 56m parameters on imagenet. arXiv preprint arXiv:2104.10858

  • Li, D., Hu, J., Wang, C., Li, X., She, Q., Zhu, L., Zhang, T., & Chen, Q. (2021a). Involution: Inverting the inherence of convolution for visual recognition. In CVPR.

  • Li, J., Yan, Y., Liao, S., Yang, X., & Shao, L. (2021b). Local-to-global self-attention in vision transformers. arXiv preprint arXiv:2107.04735

  • Li, S., Chen, X., He, D., & Hsieh, C. J. (2021c). Can vision transformers perform convolution? arXiv preprint arXiv:2111.01353

  • Lin, M., & Ye, J. (2016). A non-convex one-pass framework for generalized factorization machine and rank-one matrix sensing. In NeurIPS.

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV.

  • Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV.

  • Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., & Wei, F. (2022a). Swin transformer v2: Scaling up capacity and resolution. In CVPR.

  • Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022b). A convnet for the 2020s. In CVPR.

  • Loshchilov, I., & Hutter, F. (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101

  • Lu, J., Yao, J., Zhang, J., Zhu, X., Xu, H., Gao, W., Xu, C., Xiang, T., & Zhang, L. (2021). Soft: Softmax-free transformer with linear complexity. In NeurIPS.

  • Ma, N., Zhang, X., Huang, J., & Sun, J. (2020). Weightnet: Revisiting the design space of weight networks. In ECCV.

  • Mehta, S., & Rastegari, M. (2021). Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178

  • Pan, Z., Zhuang, B., He, H., Liu, J., Cai, J. (2022). Less is more: Pay less attention in vision transformers. In AAAI.

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., & Desmaison, A. (2019). Pytorch: An imperative style, high-performance deep learning library. In NeurIPS.

  • Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4), 838–855.

    Article  MathSciNet  MATH  Google Scholar 

  • Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., & Dosovitskiy, A. (2021). Do vision transformers see like convolutional neural networks? In NeurIPS

  • Rao, Y., Zhao, W., Tang, Y., Zhou, J., Lim, S. N., & Lu, J. (2022) Hornet: Efficient high-order spatial interactions with recursive gated convolutions. NeurIPS.

  • Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR

  • Su, H., Jampani, V., Sun, D., Gallo, O., Learned-Miller, E., & Kautz, J. (2019). Pixel-adaptive convolutional neural networks. In CVPR.

  • Sun, P., Cao, J., Jiang, Y., Zhang, R., Xie, E., Yuan, Z., Wang, C., & Luo, P. (2020). Transtrack: Multiple-object tracking with transformer. arXiv preprint arXiv:2012.15460

  • Tan, M., Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML.

  • Tarvainen, A., & Valpola, H. (2017). Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS.

  • Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. In ICML

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In NeurIPS.

  • Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N., Hechtman, B., & Shlens, J. (2021). Scaling local self-attention for parameter efficient visual backbones. In CVPR.

  • Wang, J., Chen, K., Xu, R., Liu, Z., Loy, C. C., & Lin, D. (2019). Carafe: Content-aware reassembly of features. In ICCV.

  • Wang, J., Chen, K., Xu, R., Liu, Z., Loy, C. C , & Lin, D. (2021a). Carafe++: Unified content-aware reassembly of features. PAMI

  • Wang, P., Wang, X., Luo, H., Zhou, J., Zhou, Z., Wang, F., Li, H., & Jin, R. (2022a). Scaled relu matters for training vision transformers. In AAAI.

  • Wang, P., Wang, X., Wang, F., Lin, M., Chang, S,. Xie, W., Li, H., & Jin, R. (2022b) Kvt: K-nn attention for boosting vision transformers. In ECCV.

  • Wang, W., Xie, E., Li, X., Fan, D. P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021b). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV

  • Wang, W., Yao, L., Chen, L., Cai, D., He, X., & Liu, W. (2021c). Crossformer: A versatile vision transformer based on cross-scale attention. arXiv preprint arXiv:2108.00154

  • Wightman, R. (2019). Pytorch image models. https://github.com/rwightman/pytorch-image-models, https://doi.org/10.5281/zenodo.4414861

  • Wu, K., Peng, H., Chen, M., Fu, J., & Chao, H. (2021). Rethinking and improving relative position encoding for vision transformer. In ICCV.

  • Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual parsing for scene understanding. In ECCV

  • Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., & Girshick, R. (2021). Early convolutions help transformers see better. In NeurIPS.

  • Yang, B., Bender, G., Le, Q. V., & Ngiam, J. (2019). Condconv: Conditionally parameterized convolutions for efficient inference. In NeurIPS.

  • Yang, J., Li, C., Zhang, P., Dai, X., Xiao, B., Yuan, L., & Gao, J. (2021). Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641

  • Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z., Tay, F. E., Feng, J., & Yan, S. (2021a). Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV

  • Yuan, L., Hou, Q., Jiang, Z., Feng, J., & Yan, S. (2022). Volo: Vision outlooker for visual recognition. PAMI.

  • Yuan, Y., Fu, R., Huang, L., Lin, W., Zhang, C., Chen, X., & Wang, J. (2021b). Hrformer: High-resolution transformer for dense prediction. In NeurIPS.

  • Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. In CVPR.

  • Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). Mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412

  • Zhang, P., Dai, X., Yang, J., Xiao, B., Yuan, L., Zhang, L., & Gao, J. (2021). Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In ICCV

  • Zhang, Y., Zhang, J., Wang, Q., & Zhong, Z. (2020). Dynet: Dynamic convolution for accelerating convolutional neural networks. arXiv preprint arXiv:2004.10694

  • Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., & Zhang, L. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR

  • Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017) Scene parsing through ade20k dataset. In CVPR

  • Zhou, D., Shi, Y., Kang, B., Yu, W., Jiang, Z., Li, Y., Jin, X., Hou, Q., & Feng, J. (2021a). Refiner: Refining self-attention for vision transformers. arXiv preprint arXiv:2106.03714

  • Zhou, J., Jampani, V., Pi, Z., Liu, Q., & Yang, M. H. (2021b). Decoupled dynamic filter networks. In CVPR.

Download references

Acknowledgements

This work was supported by Alibaba Group through the Alibaba Research Intern Program and the National Natural Science Foundation of China (No.61976094).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pichao Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Work done during an internship at Alibaba Group.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, J., Wang, P., Tang, J. et al. What Limits the Performance of Local Self-attention?. Int J Comput Vis 131, 2516–2528 (2023). https://doi.org/10.1007/s11263-023-01813-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01813-x

Keywords

Navigation