What Limits the Performance of Local Self-attention?

Zhou, Jingkai; Wang, Pichao; Tang, Jiasheng; Wang, Fan; Liu, Qiong; Li, Hao; Jin, Rong

doi:10.1007/s11263-023-01813-x

What Limits the Performance of Local Self-attention?

Manuscript
Published: 09 June 2023

Volume 131, pages 2516–2528, (2023)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Jingkai Zhou¹,
Pichao Wang²,
Jiasheng Tang³,
Fan Wang⁴,
Qiong Liu¹,
Hao Li³ &
…
Rong Jin²

741 Accesses
1 Citation
2 Altmetric
Explore all metrics

Abstract

Although self-attention is powerful in modeling long-range dependencies, the performance of local self-attention (LSA) is just similar to depth-wise convolution, which puzzles researchers on whether to use LSA or its counterparts, which one is better, and what limits the performance of LSA. To clarify these, we comprehensively investigate LSA and its counterparts from channel setting and spatial processing. We find that the devil lies in attention generation and application, where relative position embedding and neighboring filter application are key factors. Based on these findings, we propose enhanced local self-attention (ELSA) with Hadamard attention and the ghost head. Hadamard attention introduces the Hadamard product to efficiently generate attention in the neighboring area, while maintaining the high-order mapping. The ghost head combines attention maps with static matrices to increase channel capacity. Experiments demonstrate the effectiveness of ELSA. Without architecture/hyperparameter modification, drop-in replacing LSA with ELSA boosts Swin Transformer by up to \(+\)1.4 on top-1 accuracy. ELSA also consistently benefits VOLO from D1 to D5, where ELSA-VOLO-D5 achieves 87.2 on the ImageNet-1K without extra training images. In addition, we evaluate ELSA in downstream tasks. ELSA significantly improves the baseline by up to \(+\)1.9 box Ap/\(+\)1.3 mask Ap on the COCO, and by up to \(+\)1.9 mIoU on the ADE20K.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DaViT: Dual Attention Vision Transformers

LMA: lightweight mixed-domain attention for efficient network design

Article 11 October 2022

MaxViT: Multi-axis Vision Transformer

References

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
Google Scholar
Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6154—6162).
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S (2020). End-to-end object detection with transformers. In ECCV.
Chen, B., Li, P., Li, B., Li, C., Bai, L., Lin, C., Sun, M., Yan, J., & Ouyang, W. (2021a). Psvit: Better vision transformer via token pooling and attention sharing. arXiv preprint arXiv:2108.03428
Chen, C. F., Panda, R., & Fan, Q. (2021b). Regionvit: Regional-to-local attention for vision transformers. arXiv preprint arXiv:2106.02689
Chen, J., Wang, X., Guo, Z., Zhang, X., & Sun, J. (2021c). Dynamic region-aware convolution. In CVPR.
Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C. C., & Lin, D. (2019) MMDetection: Openmmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155
Chen, Y., Dai, X., Liu, M., Chen, D., Yuan, L., & Liu, Z. (2020). Dynamic convolution: Attention over convolution kernels. In CVPR.
Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., & Liu, Z. (2022) Mobile-former: Bridging mobilenet and transformer. In CVPR.
Chen, Z., Xie, L., Niu, J., Liu, X., Wei, L., & Tian, Q. (2021d). Visformer: The vision-friendly transformer. In ICCV.
Chen, Z., Zhu, Y., Zhao, C., Hu, G., Zeng, W., Wang, J., & Tang, M. (2021e). Dpt: Deformable patch-based transformer for visual recognition. In ACM MM, pp 2899–2907
Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In PAMI.
Chu, X., Zhang, B., Tian, Z., Wei, X., & Xia, H. (2021). Do we really need explicit position encodings for vision transformers? arXiv preprint arXiv:2102.10882
Contributors, M. (2020). MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation
Dai, Z., Liu, H., Le, Q.V., & Tan, M. (2021) Coatnet: Marrying convolution and attention for all data sizes. In NeurIPS
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., & Guo, B. (2021). Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., & Uszkoreit, J. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR
El-Nouby, A., Touvron, H., Caron, M., Bojanowski, P., Douze, M., Joulin, A., Laptev, I., Neverova, N., Synnaeve, G., & Verbeek, J. (2021) Xcit: Cross-covariance image transformers. In NeurIPS.
Fang, Y., Wang, X., Wu, R., & Liu, W. (2021). What makes for hierarchical vision transformer? arXiv preprint arXiv:2107.02174
Gong, C., Wang, D., Li, M., Chandra, V., & Liu, Q. (2021). Improve vision transformers training by suppressing over-smoothing. arXiv preprint arXiv:2104.12753
Graham, B., El-Nouby, A., Touvron, H., Stock, P., Joulin, A., Jégou, H., & Douze, M. (2021). Levit: A vision transformer in convnet’s clothing for faster inference. In ICCV.
Gu, J., Kwon, H., Wang, D., Ye, W., Li, M., Chen, Y.H., Lai, L., Chandra, V., & Pan, D. Z. (2021). Hrvit: Multi-scale high-resolution vision transformer. arXiv preprint arXiv:2111.01236
Guo, J., Han, K., Wu, H., Xu, C., Tang, Y., Xu, C., & Wang, Y. (2022). Cmt: Convolutional neural networks meet vision transformers. In CVPR
Han, Q., Fan, Z., Dai, Q., Sun, L., Cheng, M. M., Liu, J., & Wang, J. (2022). On the connection between local attention and dynamic depth-wise convolution. In ICLR
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In CVPR.
He, S., Luo, H., Wang, P., Wang, F., Li, H., & Jiang, W. (2021). Transreid: Transformer-based object re-identification. In ICCV.
Howard, A., Sandler, M., Chu, G., Chen, L. C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., & Lee, Q. V. (2019). Searching for mobilenetv3. In ICCV.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861
Huang, Z., Ben, Y., Luo, G., Cheng, P., Yu, G., & Fu, B. (2021). Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650
Jia, X., De Brabandere, B., Tuytelaars, T., & Gool, L. V. (2016). Dynamic filter networks. In NeurIPS.
Jiang, Z., Hou, Q., Yuan, L., Zhou, D., Jin, X., Wang, A., & Feng, J. (2021). Token labeling: Training a 85.5% top-1 accuracy vision transformer with 56m parameters on imagenet. arXiv preprint arXiv:2104.10858
Li, D., Hu, J., Wang, C., Li, X., She, Q., Zhu, L., Zhang, T., & Chen, Q. (2021a). Involution: Inverting the inherence of convolution for visual recognition. In CVPR.
Li, J., Yan, Y., Liao, S., Yang, X., & Shao, L. (2021b). Local-to-global self-attention in vision transformers. arXiv preprint arXiv:2107.04735
Li, S., Chen, X., He, D., & Hsieh, C. J. (2021c). Can vision transformers perform convolution? arXiv preprint arXiv:2111.01353
Lin, M., & Ye, J. (2016). A non-convex one-pass framework for generalized factorization machine and rank-one matrix sensing. In NeurIPS.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV.
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., & Wei, F. (2022a). Swin transformer v2: Scaling up capacity and resolution. In CVPR.
Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022b). A convnet for the 2020s. In CVPR.
Loshchilov, I., & Hutter, F. (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101
Lu, J., Yao, J., Zhang, J., Zhu, X., Xu, H., Gao, W., Xu, C., Xiang, T., & Zhang, L. (2021). Soft: Softmax-free transformer with linear complexity. In NeurIPS.
Ma, N., Zhang, X., Huang, J., & Sun, J. (2020). Weightnet: Revisiting the design space of weight networks. In ECCV.
Mehta, S., & Rastegari, M. (2021). Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178
Pan, Z., Zhuang, B., He, H., Liu, J., Cai, J. (2022). Less is more: Pay less attention in vision transformers. In AAAI.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., & Desmaison, A. (2019). Pytorch: An imperative style, high-performance deep learning library. In NeurIPS.
Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4), 838–855.
Article MathSciNet MATH Google Scholar
Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., & Dosovitskiy, A. (2021). Do vision transformers see like convolutional neural networks? In NeurIPS
Rao, Y., Zhao, W., Tang, Y., Zhou, J., Lim, S. N., & Lu, J. (2022) Hornet: Efficient high-order spatial interactions with recursive gated convolutions. NeurIPS.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR
Su, H., Jampani, V., Sun, D., Gallo, O., Learned-Miller, E., & Kautz, J. (2019). Pixel-adaptive convolutional neural networks. In CVPR.
Sun, P., Cao, J., Jiang, Y., Zhang, R., Xie, E., Yuan, Z., Wang, C., & Luo, P. (2020). Transtrack: Multiple-object tracking with transformer. arXiv preprint arXiv:2012.15460
Tan, M., Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML.
Tarvainen, A., & Valpola, H. (2017). Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS.
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. In ICML
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In NeurIPS.
Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N., Hechtman, B., & Shlens, J. (2021). Scaling local self-attention for parameter efficient visual backbones. In CVPR.
Wang, J., Chen, K., Xu, R., Liu, Z., Loy, C. C., & Lin, D. (2019). Carafe: Content-aware reassembly of features. In ICCV.
Wang, J., Chen, K., Xu, R., Liu, Z., Loy, C. C , & Lin, D. (2021a). Carafe++: Unified content-aware reassembly of features. PAMI
Wang, P., Wang, X., Luo, H., Zhou, J., Zhou, Z., Wang, F., Li, H., & Jin, R. (2022a). Scaled relu matters for training vision transformers. In AAAI.
Wang, P., Wang, X., Wang, F., Lin, M., Chang, S,. Xie, W., Li, H., & Jin, R. (2022b) Kvt: K-nn attention for boosting vision transformers. In ECCV.
Wang, W., Xie, E., Li, X., Fan, D. P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021b). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV
Wang, W., Yao, L., Chen, L., Cai, D., He, X., & Liu, W. (2021c). Crossformer: A versatile vision transformer based on cross-scale attention. arXiv preprint arXiv:2108.00154
Wightman, R. (2019). Pytorch image models. https://github.com/rwightman/pytorch-image-models, https://doi.org/10.5281/zenodo.4414861
Wu, K., Peng, H., Chen, M., Fu, J., & Chao, H. (2021). Rethinking and improving relative position encoding for vision transformer. In ICCV.
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual parsing for scene understanding. In ECCV
Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., & Girshick, R. (2021). Early convolutions help transformers see better. In NeurIPS.
Yang, B., Bender, G., Le, Q. V., & Ngiam, J. (2019). Condconv: Conditionally parameterized convolutions for efficient inference. In NeurIPS.
Yang, J., Li, C., Zhang, P., Dai, X., Xiao, B., Yuan, L., & Gao, J. (2021). Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641
Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z., Tay, F. E., Feng, J., & Yan, S. (2021a). Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV
Yuan, L., Hou, Q., Jiang, Z., Feng, J., & Yan, S. (2022). Volo: Vision outlooker for visual recognition. PAMI.
Yuan, Y., Fu, R., Huang, L., Lin, W., Zhang, C., Chen, X., & Wang, J. (2021b). Hrformer: High-resolution transformer for dense prediction. In NeurIPS.
Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. In CVPR.
Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). Mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412
Zhang, P., Dai, X., Yang, J., Xiao, B., Yuan, L., Zhang, L., & Gao, J. (2021). Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In ICCV
Zhang, Y., Zhang, J., Wang, Q., & Zhong, Z. (2020). Dynet: Dynamic convolution for accelerating convolutional neural networks. arXiv preprint arXiv:2004.10694
Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., & Zhang, L. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017) Scene parsing through ade20k dataset. In CVPR
Zhou, D., Shi, Y., Kang, B., Yu, W., Jiang, Z., Li, Y., Jin, X., Hou, Q., & Feng, J. (2021a). Refiner: Refining self-attention for vision transformers. arXiv preprint arXiv:2106.03714
Zhou, J., Jampani, V., Pi, Z., Liu, Q., & Yang, M. H. (2021b). Decoupled dynamic filter networks. In CVPR.

Download references

Acknowledgements

This work was supported by Alibaba Group through the Alibaba Research Intern Program and the National Natural Science Foundation of China (No.61976094).

Author information

Authors and Affiliations

South China University of Technology, Gruangzhou, Guangdong, China
Jingkai Zhou & Qiong Liu
Alibaba Group, Bellevue, WA, USA
Pichao Wang & Rong Jin
Alibaba Group, Hangzhou, Zhejiang, China
Jiasheng Tang & Hao Li
Alibaba Group, Sunnyvale, CA, USA
Fan Wang

Authors

Jingkai Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Pichao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jiasheng Tang
View author publications
You can also search for this author in PubMed Google Scholar
Fan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qiong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hao Li
View author publications
You can also search for this author in PubMed Google Scholar
Rong Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pichao Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Work done during an internship at Alibaba Group.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhou, J., Wang, P., Tang, J. et al. What Limits the Performance of Local Self-attention?. Int J Comput Vis 131, 2516–2528 (2023). https://doi.org/10.1007/s11263-023-01813-x

Download citation

Received: 12 June 2022
Accepted: 05 May 2023
Published: 09 June 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s11263-023-01813-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

What Limits the Performance of Local Self-attention?

Abstract

Access this article

Similar content being viewed by others

DaViT: Dual Attention Vision Transformers

LMA: lightweight mixed-domain attention for efficient network design

MaxViT: Multi-axis Vision Transformer

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

What Limits the Performance of Local Self-attention?

Abstract

Access this article

Similar content being viewed by others

DaViT: Dual Attention Vision Transformers

LMA: lightweight mixed-domain attention for efficient network design

MaxViT: Multi-axis Vision Transformer

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation