Skip to main content
Log in

Introducing inductive bias on vision transformers through Gram matrix similarity based regularization

  • Original Article
  • Published:
Artificial Life and Robotics Aims and scope Submit manuscript

Abstract

In recent years, the transformer achieved remarkable results in computer vision related tasks, matching, or even surpassing those of convolutional neural networks (CNN). However, unlike CNNs, those vision transformers lack strong inductive biases and, to achieve state-of-the-art results, rely on large architectures and extensive pre-training on tens of millions of images. Approaches like combining convolution layers or adapting the vision transformer architecture managed to mitigate this limitation, however, large volumes of data are still demanded to attain state-of-the-art performance. Therefore, introducing the appropriate inductive biases to vision transformers can lead to better convergence and generalization on settings with fewer training data. To that end, we propose a self-attention regularization method based on the similarity between different image regions. At its core is the Attention Loss, a new loss function devised to penalize self-attention computation between image patches based on the similarity between gram matrices, leading to better convergence and generalization, especially on models pre-trained on mid-size datasets. We deploy the method on ARViT, a small capacity vision transformer and, after pre-training with a self-supervised pretext-task on the ILSVRC-2012 ImageNet dataset, our self-attention regularization method improved ARViT’s performance by up to 13% on benchmark classification tasks and achieved competitive results with state-of-the-art vision transformers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Available at: https://github.com/fastai/imagenette.

  2. The difference in training time per epoch when adopting region resolutions \(16\times 16\) and \(32\times 32\) is negligible.

References

  1. Schaffer J (2015) What not to multiply without necessity. Australas J Philos 93:644–664

    Article  Google Scholar 

  2. Mitchell TM (1980) The need for biases in learning generalizations. Department of Computer Science, Laboratory for Computer Science Research, New Jersey

    Google Scholar 

  3. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser U, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems, vol 30

  4. Parmar N, Vaswani A, Uszkoreit J, Kaiser L, Shazeer N, Ku A, Tran D (2018) Image transformer. In: International conference on machine learning, PMLR, pp 4055–4064

  5. Bello I, Zoph B, Le Q, Vaswani A, Shlens J (2019) Attention augmented convolutional networks. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), (Seoul, Korea (South)), IEEE, pp. 3285–3294

  6. Zhao H, Jia J, Koltun V (2020) Exploring self-attention for image recognition. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (Seattle, WA, USA), IEEE, pp 10073–1008

  7. Chen Z, Xie L, Niu J, Liu X, Wei L, Tian Q (2021) Visformer: the vision-friendly transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 589–598

  8. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2020) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929 [cs]

  9. Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A, Xu C, Xu Y (2022) A survey on vision transformer. In: IEEE transactions on pattern analysis and machine intelligence. ISBN: 0162-8828 Publisher: IEEE

  10. Xu Y, Zhang Q, Zhang J, Tao D (2021) Vitae: vision transformer advanced by exploring intrinsic inductive bias. Adv Neural Inf Process Syst 34:28522–28535

    Google Scholar 

  11. Chen C-FR, Fan Q, Panda R (2021) Crossvit: cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 357–366

  12. Chu X, Tian Z, Zhang B, Wang X, Wei X, Xia H, Shen C (2021) Conditional positional encodings for vision transformers. arXiv:2102.10882 [cs]

  13. Han K, Xiao A, Wu E, Guo J, Xu C, Wang Y (2021) Transformer in transformer. Adv Neural Inf Process Syst 34:15908–15919

    Google Scholar 

  14. Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers and distillation through attention. In: International Conference on Machine Learning. PMLR, pp. 10347–10357

  15. Wang W, Xie E, Li X, Fan D-P, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 568–578

  16. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10012–10022

  17. Gatys LA, Ecker AS, Bethge M (2015) A neural algorithm of artistic style. arXiv:1508.06576 [cs, q-bio]

  18. Mormille LH, Broni-Bediako C, Atsumi M (2022) Regularizing self-attention on vision transformers with 2D spatial distance loss. Artif Life Robot 27:586–593

    Article  Google Scholar 

  19. Krizhevsky A, Sutskever I, Hinton GE (2017) ImageNet classification with deep convolutional neural networks. Commun ACM 60:84–90

    Article  Google Scholar 

  20. Gidaris S, Singh P, Komodakis N (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728

  21. Ramachandran P, Parmar N, Vaswani A, Bello I, Levskaya A, Shlens J (2019) Stand-alone self-attention in vision models. Advances in Neural Information Processing Systems, vol. 32

  22. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. arXiv:2005.12872 [cs]

  23. Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y. TransUNet: Transformers make strong encoders for medical image segmentation. p 13

  24. Sun C, Shrivastava A, Singh S, Gupta A (2017) Revisiting unreasonable effectiveness of data in deep learning era. In: 2017 IEEE International Conference on Computer Vision (ICCV), (Venice). IEEE, pp 843–852

  25. Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, Zhang L (2021) CvT: introducing convolutions to vision transformers. arXiv:2103.15808 [cs]

  26. Yan H, Li Z, Li W, Wang C, Wu M, Zhang C (2021) ConTNet: why not use convolution and transformer at the same time? arXiv:2104.13497 [cs]

  27. Yuan K, Guo S, Liu Z, Zhou A, Yu F, Wu W (2021) Incorporating convolution designs into visual transformers. arXiv:2103.11816 [cs]

  28. Li Y, Zhang K, Cao J, Timofte R, Van Gool L (2021) LocalViT: bringing locality to vision transformers. arXiv:2104.05707 [cs]

  29. Peng Z, Huang W, Gu S, Xie L, Wang Y, Jiao J, Ye Q (2021) Conformer: Local features coupling global representations for visual recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 367–376

  30. Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H. Training data-efficient image transformers and distillation through attention. p 11

  31. Graham B, El-Nouby A, Touvron H, Stock P, Joulin A, Jégou H, Douze M (2021) Levit: a vision transformer in convnet’s clothing for faster inference. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 12259–12269

  32. Xie Z, Lin Y, Yao Z, Zhang Z, Dai Q, Cao Y, Hu H (2021) Self-supervised learning with swin transformers. arXiv preprint arXiv:2105.04553

  33. Liu Z, Ning J, Cao Y, Wei Y, Zhang Z, Lin S, Hu H (2022) Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3202–3211

  34. Chu X, Tian Z, Wang Y, Zhang B, Ren H, Wei X, Xia H, Shen C (2021) Twins: revisiting the design of spatial attention in vision transformers. Adv Neural Inf Process Syst 34:9355–9366

    Google Scholar 

  35. Grill J-B, Strub F, Altché F, Tallec C, Richemond PH, Buchatskaya E, Doersch C, Pires BA, Guo ZD, Azar MG, Piot B, Kavukcuoglu K, Munos R, Valko M. Bootstrap your own latent a new approach to self-supervised learning, p 14

  36. Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. Adv Neural Inf Process Syst 33:9912–9924

    Google Scholar 

  37. Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Toronto, ON, Canada

  38. Nilsback M-E, Zisserman A (2008) Automated Flower Classification over a Large Number of Classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, (Bhubaneswar, India). IEEE, pp 722–729

  39. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, IEEE, pp 248–255

  40. Atito S, Awais M, Kittler J (2021) SiT: Self-supervised vIsion Transformer. arXiv:2104.03602 [cs]

  41. Chen X, Xie S, He K (2021) An empirical study of training self-supervised vision transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (Montreal, QC, Canada). IEEE, pp 9620–9629

  42. Caron M, Touvron H, Misra I, Jegou H, Mairal J, Bojanowski P, Joulin A (2021) Emerging properties in self-supervised vision transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (Montreal, QC, Canada). IEEE, pp. 9630–9640

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luiz H. Mormille.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was submitted and accepted for the Journal Track of the joint symposium of the 28th International Symposium on Artificial Life and Robotics, the 8th International Symposium on BioComplexity, and the 6th International Symposium on Swarm Behavior and Bio-Inspired Robotics (Beppu, Oita, January 25–27, 2023).

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mormille, L.H., Broni-Bediako, C. & Atsumi, M. Introducing inductive bias on vision transformers through Gram matrix similarity based regularization. Artif Life Robotics 28, 106–116 (2023). https://doi.org/10.1007/s10015-022-00845-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10015-022-00845-9

Keywords

Navigation