Introducing inductive bias on vision transformers through Gram matrix similarity based regularization

Mormille, Luiz H.; Broni-Bediako, Clifford; Atsumi, Masayasu

doi:10.1007/s10015-022-00845-9

Introducing inductive bias on vision transformers through Gram matrix similarity based regularization

Original Article
Published: 05 January 2023

Volume 28, pages 106–116, (2023)
Cite this article

Artificial Life and Robotics Aims and scope Submit manuscript

Luiz H. Mormille¹,
Clifford Broni-Bediako¹ &
Masayasu Atsumi¹

383 Accesses
1 Citation
Explore all metrics

Abstract

In recent years, the transformer achieved remarkable results in computer vision related tasks, matching, or even surpassing those of convolutional neural networks (CNN). However, unlike CNNs, those vision transformers lack strong inductive biases and, to achieve state-of-the-art results, rely on large architectures and extensive pre-training on tens of millions of images. Approaches like combining convolution layers or adapting the vision transformer architecture managed to mitigate this limitation, however, large volumes of data are still demanded to attain state-of-the-art performance. Therefore, introducing the appropriate inductive biases to vision transformers can lead to better convergence and generalization on settings with fewer training data. To that end, we propose a self-attention regularization method based on the similarity between different image regions. At its core is the Attention Loss, a new loss function devised to penalize self-attention computation between image patches based on the similarity between gram matrices, leading to better convergence and generalization, especially on models pre-trained on mid-size datasets. We deploy the method on ARViT, a small capacity vision transformer and, after pre-training with a self-supervised pretext-task on the ILSVRC-2012 ImageNet dataset, our self-attention regularization method improved ARViT’s performance by up to 13% on benchmark classification tasks and achieved competitive results with state-of-the-art vision transformers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Regularizing self-attention on vision transformers with 2D spatial distance loss

Article 18 July 2022

A data efficient transformer based on Swin Transformer

Article 30 July 2023

KVT: k-NN Attention for Boosting Vision Transformers

Notes

Available at: https://github.com/fastai/imagenette.
The difference in training time per epoch when adopting region resolutions \(16\times 16\) and \(32\times 32\) is negligible.

References

Schaffer J (2015) What not to multiply without necessity. Australas J Philos 93:644–664
Article Google Scholar
Mitchell TM (1980) The need for biases in learning generalizations. Department of Computer Science, Laboratory for Computer Science Research, New Jersey
Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser U, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems, vol 30
Parmar N, Vaswani A, Uszkoreit J, Kaiser L, Shazeer N, Ku A, Tran D (2018) Image transformer. In: International conference on machine learning, PMLR, pp 4055–4064
Bello I, Zoph B, Le Q, Vaswani A, Shlens J (2019) Attention augmented convolutional networks. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), (Seoul, Korea (South)), IEEE, pp. 3285–3294
Zhao H, Jia J, Koltun V (2020) Exploring self-attention for image recognition. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (Seattle, WA, USA), IEEE, pp 10073–1008
Chen Z, Xie L, Niu J, Liu X, Wei L, Tian Q (2021) Visformer: the vision-friendly transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 589–598
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2020) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929 [cs]
Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A, Xu C, Xu Y (2022) A survey on vision transformer. In: IEEE transactions on pattern analysis and machine intelligence. ISBN: 0162-8828 Publisher: IEEE
Xu Y, Zhang Q, Zhang J, Tao D (2021) Vitae: vision transformer advanced by exploring intrinsic inductive bias. Adv Neural Inf Process Syst 34:28522–28535
Google Scholar
Chen C-FR, Fan Q, Panda R (2021) Crossvit: cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 357–366
Chu X, Tian Z, Zhang B, Wang X, Wei X, Xia H, Shen C (2021) Conditional positional encodings for vision transformers. arXiv:2102.10882 [cs]
Han K, Xiao A, Wu E, Guo J, Xu C, Wang Y (2021) Transformer in transformer. Adv Neural Inf Process Syst 34:15908–15919
Google Scholar
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers and distillation through attention. In: International Conference on Machine Learning. PMLR, pp. 10347–10357
Wang W, Xie E, Li X, Fan D-P, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 568–578
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10012–10022
Gatys LA, Ecker AS, Bethge M (2015) A neural algorithm of artistic style. arXiv:1508.06576 [cs, q-bio]
Mormille LH, Broni-Bediako C, Atsumi M (2022) Regularizing self-attention on vision transformers with 2D spatial distance loss. Artif Life Robot 27:586–593
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2017) ImageNet classification with deep convolutional neural networks. Commun ACM 60:84–90
Article Google Scholar
Gidaris S, Singh P, Komodakis N (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728
Ramachandran P, Parmar N, Vaswani A, Bello I, Levskaya A, Shlens J (2019) Stand-alone self-attention in vision models. Advances in Neural Information Processing Systems, vol. 32
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. arXiv:2005.12872 [cs]
Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y. TransUNet: Transformers make strong encoders for medical image segmentation. p 13
Sun C, Shrivastava A, Singh S, Gupta A (2017) Revisiting unreasonable effectiveness of data in deep learning era. In: 2017 IEEE International Conference on Computer Vision (ICCV), (Venice). IEEE, pp 843–852
Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, Zhang L (2021) CvT: introducing convolutions to vision transformers. arXiv:2103.15808 [cs]
Yan H, Li Z, Li W, Wang C, Wu M, Zhang C (2021) ConTNet: why not use convolution and transformer at the same time? arXiv:2104.13497 [cs]
Yuan K, Guo S, Liu Z, Zhou A, Yu F, Wu W (2021) Incorporating convolution designs into visual transformers. arXiv:2103.11816 [cs]
Li Y, Zhang K, Cao J, Timofte R, Van Gool L (2021) LocalViT: bringing locality to vision transformers. arXiv:2104.05707 [cs]
Peng Z, Huang W, Gu S, Xie L, Wang Y, Jiao J, Ye Q (2021) Conformer: Local features coupling global representations for visual recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 367–376
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H. Training data-efficient image transformers and distillation through attention. p 11
Graham B, El-Nouby A, Touvron H, Stock P, Joulin A, Jégou H, Douze M (2021) Levit: a vision transformer in convnet’s clothing for faster inference. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 12259–12269
Xie Z, Lin Y, Yao Z, Zhang Z, Dai Q, Cao Y, Hu H (2021) Self-supervised learning with swin transformers. arXiv preprint arXiv:2105.04553
Liu Z, Ning J, Cao Y, Wei Y, Zhang Z, Lin S, Hu H (2022) Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3202–3211
Chu X, Tian Z, Wang Y, Zhang B, Ren H, Wei X, Xia H, Shen C (2021) Twins: revisiting the design of spatial attention in vision transformers. Adv Neural Inf Process Syst 34:9355–9366
Google Scholar
Grill J-B, Strub F, Altché F, Tallec C, Richemond PH, Buchatskaya E, Doersch C, Pires BA, Guo ZD, Azar MG, Piot B, Kavukcuoglu K, Munos R, Valko M. Bootstrap your own latent a new approach to self-supervised learning, p 14
Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. Adv Neural Inf Process Syst 33:9912–9924
Google Scholar
Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Toronto, ON, Canada
Nilsback M-E, Zisserman A (2008) Automated Flower Classification over a Large Number of Classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, (Bhubaneswar, India). IEEE, pp 722–729
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, IEEE, pp 248–255
Atito S, Awais M, Kittler J (2021) SiT: Self-supervised vIsion Transformer. arXiv:2104.03602 [cs]
Chen X, Xie S, He K (2021) An empirical study of training self-supervised vision transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (Montreal, QC, Canada). IEEE, pp 9620–9629
Caron M, Touvron H, Misra I, Jegou H, Mairal J, Bojanowski P, Joulin A (2021) Emerging properties in self-supervised vision transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (Montreal, QC, Canada). IEEE, pp. 9630–9640

Download references

Author information

Authors and Affiliations

Soka University, Hachioji-shi, Tokyo, Japan
Luiz H. Mormille, Clifford Broni-Bediako & Masayasu Atsumi

Authors

Luiz H. Mormille
View author publications
You can also search for this author in PubMed Google Scholar
Clifford Broni-Bediako
View author publications
You can also search for this author in PubMed Google Scholar
Masayasu Atsumi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luiz H. Mormille.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was submitted and accepted for the Journal Track of the joint symposium of the 28th International Symposium on Artificial Life and Robotics, the 8th International Symposium on BioComplexity, and the 6th International Symposium on Swarm Behavior and Bio-Inspired Robotics (Beppu, Oita, January 25–27, 2023).

About this article

Cite this article

Mormille, L.H., Broni-Bediako, C. & Atsumi, M. Introducing inductive bias on vision transformers through Gram matrix similarity based regularization. Artif Life Robotics 28, 106–116 (2023). https://doi.org/10.1007/s10015-022-00845-9

Download citation

Received: 31 August 2022
Accepted: 30 November 2022
Published: 05 January 2023
Issue Date: February 2023
DOI: https://doi.org/10.1007/s10015-022-00845-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Introducing inductive bias on vision transformers through Gram matrix similarity based regularization

Abstract

Access this article

Similar content being viewed by others

Regularizing self-attention on vision transformers with 2D spatial distance loss

A data efficient transformer based on Swin Transformer

KVT: k-NN Attention for Boosting Vision Transformers

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

About this article

Cite this article

Keywords

Navigation

Introducing inductive bias on vision transformers through Gram matrix similarity based regularization

Abstract

Access this article

Similar content being viewed by others

Regularizing self-attention on vision transformers with 2D spatial distance loss

A data efficient transformer based on Swin Transformer

KVT: k-NN Attention for Boosting Vision Transformers

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

About this article

Cite this article

Share this article

Keywords

Search

Navigation