Mapping of attention mechanisms to a generalized Potts model

Open Access

Mapping of attention mechanisms to a generalized Potts model

Riccardo Rende, Federica Gerace, Alessandro Laio, and Sebastian Goldt

Phys. Rev. Research 6, 023057 – Published 16 April 2024

Abstract

Transformers are neural networks that revolutionized natural language processing and machine learning. They process sequences of inputs, like words, using a mechanism called self-attention, which is trained via masked language modeling (MLM). In MLM, a word is randomly masked in an input sequence, and the network is trained to predict the missing word. Despite the practical success of transformers, it remains unclear what type of data distribution self-attention can learn efficiently. Here, we show analytically that if one decouples the treatment of word positions and embeddings, a single layer of self-attention learns the conditionals of a generalized Potts model with interactions between sites and Potts colors. Moreover, we show that training this neural network is exactly equivalent to solving the inverse Potts problem by the so-called pseudolikelihood method, well known in statistical physics. Using this mapping, we compute the generalization error of self-attention in a model scenario analytically using the replica method.

Received 27 April 2023
Revised 14 December 2023
Accepted 4 March 2024

DOI:https://doi.org/10.1103/PhysRevResearch.6.023057

Published by the American Physical Society under the terms of the Creative Commons Attribution 4.0 International license. Further distribution of this work must maintain attribution to the author(s) and the published article's title, journal citation, and DOI.

Published by the American Physical Society

Physics Subject Headings (PhySH)

Artificial neural networks Disordered systems

Replica methods

Statistical Physics & Thermodynamics

Authors & Affiliations

Riccardo Rende , Federica Gerace , Alessandro Laio, and Sebastian Goldt ^*

Scuola Internazionale Superiore di Studi Avanzati (SISSA), Via Bonomea 265, 34136 Trieste, Italy

^*sgoldt@sissa.it

Article Text

Click to Expand

References

Click to Expand

Issue

Vol. 6, Iss. 2 — April - June 2024

Subject Areas

Reuse & Permissions

Author publication services for translation and copyediting assistance advertisement

Images

Figure 1
Masked language modeling (MLM) with a single layer of self-attention. The goal of MLM is to predict the masked word in a given sentence. Self-attention first maps words into representations $e_{j} + p_{j}$ , where $e_{j}$ are embedding vectors representing words, and $p_{j}$ encode their positions. For a given masked word, the associated attention vector $h_{k}$ is computed as a linear combination of the values $v_{j} = V (e_{j} + p_{j})$ of all other tokens, weighted by the attention weights $A_{k j}$ . In vanilla self-attention, values and attention weights depend on embeddings and positional vectors, while in factored attention, attention weights depend only on positions, and values only on the embeddings. By identifying the attention weights $A$ with the interaction matrix $J$ of a Potts model (1), the value matrix $V$ with the color similarity matrix $U$ , and the embedding vectors with the one-hot spins, we get a learning model identical to a Potts model.
Reuse & Permissions
Figure 2
A single layer of factored self-attention learns the generalized Potts model efficiently. (a) Test loss (3) for factored self-attention and for vanilla transformers with one and three layers during training with stochastic gradient descent. The optimal generalization loss is shown as a black dashed line. (b) Interaction matrix $J$ of the generative Potts model (1) compared to the attention maps learned by transformers with vanilla and factored self-attention. For the three-layer transformer, the attention map was obtained by averaging the maps of the last two layers. (c) Reconstruction error of the interaction ${(J - A)}^{2}$ as a function of the number of epochs for all considered architectures. (d) Test loss as a function of perturbation level $a$ . Decoupling the treatment between positions and colors by decreasing $a$ decreases the test loss. Parameters: sequence length $L = 20$ , vocabulary size $C = 20$ , embedding dimension $d = 20, M = 3000$ data points.
Reuse & Permissions
Figure 3
The interpolation peak of factored attention in theory and practice. Left: A replica analysis predicts the test loss exactly. Test loss of a single layer of factored self-attention as a function of the number of samples per input dimension, as computed using replica theory (solid line). The blue points represent the outcome of numerical minimization of the square loss (6), averaged over 30 realizations, and show perfect agreement with the theory. Error bars are smaller than point size. Right: Same plot for a single layer of factored self-attention in the setting of Fig. 2 ( $L = C = 20$ ), showing the same qualitative behavior. The simulations are averaged over $n = 30$ different realizations.
Reuse & Permissions

Physical Review Research