Abstract
In recent years, the transformer achieved remarkable results in computer vision related tasks, matching, or even surpassing those of convolutional neural networks (CNN). However, unlike CNNs, those vision transformers lack strong inductive biases and, to achieve state-of-the-art results, rely on large architectures and extensive pre-training on tens of millions of images. Approaches like combining convolution layers or adapting the vision transformer architecture managed to mitigate this limitation, however, large volumes of data are still demanded to attain state-of-the-art performance. Therefore, introducing the appropriate inductive biases to vision transformers can lead to better convergence and generalization on settings with fewer training data. To that end, we propose a self-attention regularization method based on the similarity between different image regions. At its core is the Attention Loss, a new loss function devised to penalize self-attention computation between image patches based on the similarity between gram matrices, leading to better convergence and generalization, especially on models pre-trained on mid-size datasets. We deploy the method on ARViT, a small capacity vision transformer and, after pre-training with a self-supervised pretext-task on the ILSVRC-2012 ImageNet dataset, our self-attention regularization method improved ARViT’s performance by up to 13% on benchmark classification tasks and achieved competitive results with state-of-the-art vision transformers.
Similar content being viewed by others
Notes
Available at: https://github.com/fastai/imagenette.
The difference in training time per epoch when adopting region resolutions \(16\times 16\) and \(32\times 32\) is negligible.
References
Schaffer J (2015) What not to multiply without necessity. Australas J Philos 93:644–664
Mitchell TM (1980) The need for biases in learning generalizations. Department of Computer Science, Laboratory for Computer Science Research, New Jersey
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser U, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems, vol 30
Parmar N, Vaswani A, Uszkoreit J, Kaiser L, Shazeer N, Ku A, Tran D (2018) Image transformer. In: International conference on machine learning, PMLR, pp 4055–4064
Bello I, Zoph B, Le Q, Vaswani A, Shlens J (2019) Attention augmented convolutional networks. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), (Seoul, Korea (South)), IEEE, pp. 3285–3294
Zhao H, Jia J, Koltun V (2020) Exploring self-attention for image recognition. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (Seattle, WA, USA), IEEE, pp 10073–1008
Chen Z, Xie L, Niu J, Liu X, Wei L, Tian Q (2021) Visformer: the vision-friendly transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 589–598
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2020) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929 [cs]
Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A, Xu C, Xu Y (2022) A survey on vision transformer. In: IEEE transactions on pattern analysis and machine intelligence. ISBN: 0162-8828 Publisher: IEEE
Xu Y, Zhang Q, Zhang J, Tao D (2021) Vitae: vision transformer advanced by exploring intrinsic inductive bias. Adv Neural Inf Process Syst 34:28522–28535
Chen C-FR, Fan Q, Panda R (2021) Crossvit: cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 357–366
Chu X, Tian Z, Zhang B, Wang X, Wei X, Xia H, Shen C (2021) Conditional positional encodings for vision transformers. arXiv:2102.10882 [cs]
Han K, Xiao A, Wu E, Guo J, Xu C, Wang Y (2021) Transformer in transformer. Adv Neural Inf Process Syst 34:15908–15919
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers and distillation through attention. In: International Conference on Machine Learning. PMLR, pp. 10347–10357
Wang W, Xie E, Li X, Fan D-P, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 568–578
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10012–10022
Gatys LA, Ecker AS, Bethge M (2015) A neural algorithm of artistic style. arXiv:1508.06576 [cs, q-bio]
Mormille LH, Broni-Bediako C, Atsumi M (2022) Regularizing self-attention on vision transformers with 2D spatial distance loss. Artif Life Robot 27:586–593
Krizhevsky A, Sutskever I, Hinton GE (2017) ImageNet classification with deep convolutional neural networks. Commun ACM 60:84–90
Gidaris S, Singh P, Komodakis N (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728
Ramachandran P, Parmar N, Vaswani A, Bello I, Levskaya A, Shlens J (2019) Stand-alone self-attention in vision models. Advances in Neural Information Processing Systems, vol. 32
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. arXiv:2005.12872 [cs]
Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y. TransUNet: Transformers make strong encoders for medical image segmentation. p 13
Sun C, Shrivastava A, Singh S, Gupta A (2017) Revisiting unreasonable effectiveness of data in deep learning era. In: 2017 IEEE International Conference on Computer Vision (ICCV), (Venice). IEEE, pp 843–852
Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, Zhang L (2021) CvT: introducing convolutions to vision transformers. arXiv:2103.15808 [cs]
Yan H, Li Z, Li W, Wang C, Wu M, Zhang C (2021) ConTNet: why not use convolution and transformer at the same time? arXiv:2104.13497 [cs]
Yuan K, Guo S, Liu Z, Zhou A, Yu F, Wu W (2021) Incorporating convolution designs into visual transformers. arXiv:2103.11816 [cs]
Li Y, Zhang K, Cao J, Timofte R, Van Gool L (2021) LocalViT: bringing locality to vision transformers. arXiv:2104.05707 [cs]
Peng Z, Huang W, Gu S, Xie L, Wang Y, Jiao J, Ye Q (2021) Conformer: Local features coupling global representations for visual recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 367–376
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H. Training data-efficient image transformers and distillation through attention. p 11
Graham B, El-Nouby A, Touvron H, Stock P, Joulin A, Jégou H, Douze M (2021) Levit: a vision transformer in convnet’s clothing for faster inference. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 12259–12269
Xie Z, Lin Y, Yao Z, Zhang Z, Dai Q, Cao Y, Hu H (2021) Self-supervised learning with swin transformers. arXiv preprint arXiv:2105.04553
Liu Z, Ning J, Cao Y, Wei Y, Zhang Z, Lin S, Hu H (2022) Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3202–3211
Chu X, Tian Z, Wang Y, Zhang B, Ren H, Wei X, Xia H, Shen C (2021) Twins: revisiting the design of spatial attention in vision transformers. Adv Neural Inf Process Syst 34:9355–9366
Grill J-B, Strub F, Altché F, Tallec C, Richemond PH, Buchatskaya E, Doersch C, Pires BA, Guo ZD, Azar MG, Piot B, Kavukcuoglu K, Munos R, Valko M. Bootstrap your own latent a new approach to self-supervised learning, p 14
Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. Adv Neural Inf Process Syst 33:9912–9924
Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Toronto, ON, Canada
Nilsback M-E, Zisserman A (2008) Automated Flower Classification over a Large Number of Classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, (Bhubaneswar, India). IEEE, pp 722–729
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, IEEE, pp 248–255
Atito S, Awais M, Kittler J (2021) SiT: Self-supervised vIsion Transformer. arXiv:2104.03602 [cs]
Chen X, Xie S, He K (2021) An empirical study of training self-supervised vision transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (Montreal, QC, Canada). IEEE, pp 9620–9629
Caron M, Touvron H, Misra I, Jegou H, Mairal J, Bojanowski P, Joulin A (2021) Emerging properties in self-supervised vision transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (Montreal, QC, Canada). IEEE, pp. 9630–9640
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was submitted and accepted for the Journal Track of the joint symposium of the 28th International Symposium on Artificial Life and Robotics, the 8th International Symposium on BioComplexity, and the 6th International Symposium on Swarm Behavior and Bio-Inspired Robotics (Beppu, Oita, January 25–27, 2023).
About this article
Cite this article
Mormille, L.H., Broni-Bediako, C. & Atsumi, M. Introducing inductive bias on vision transformers through Gram matrix similarity based regularization. Artif Life Robotics 28, 106–116 (2023). https://doi.org/10.1007/s10015-022-00845-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10015-022-00845-9