Arithmetic with Language Models: from Memorization to Computation

A better understanding of the emergent computation and problem-solving capabilities of recent large language models is of paramount importance to further improve them and broaden their applicability. This work investigates how a language model, trained to predict the next token, can perform arithmetic computations generalizing beyond training data. Binary addition and multiplication constitute a good testbed for this purpose, since they require a very small vocabulary and exhibit relevant input/output discontinuities making smooth input interpolation ineffective for novel data. We successfully trained a light language model to learn these tasks and ran a number of experiments to investigate the extrapolation capabilities and internal information processing. Our findings support the hypothesis that the language model works as an Encoding-Regression-Decoding machine where the computation takes place in the value space once the input token representation is mapped to an appropriate internal representation.


Introduction
Large Language Models (LLMs) based on Transformer architecture (Vaswani et al., 2017) have recently demonstrated surprising problem-solving capabilities that require logic reasoning, advanced information processing and common sense (Bubeck et al., 2023;Wei et al., 2023Wei et al., , 2022)).Their huge storage capacity combined with a massive training on terabytes of heterogeneous data could suggest that the memorization of an enormous amount of knowledge is enough to perform well on similar test data.However, validations on carefully selected Outof-Distribution (OoD) data proved their reasoning capabilities on novel examples requiring nontrivial generalizations.Unfortunately, the depth and width of such models is so high that decoding and understanding the internal information processing is very challenging.
Focusing on arithmetic calculations, some studies (Yuan et al., 2023) demonstrate that recent LLMs (such as GPT-4) can perform additions and multiplications with long-digit operands, for which the number of variants is so high to exclude the exhaustive memorization of the training set.Nevertheless, the computational approach put in place by LLMs, as well as the interpolation/extrapolation capabilities remain unexplained.
In this work we design some controlled experiments, consisting of simple computation tasks such as binary addition and multiplication, and solve them with two Language Models (LMs) based on Transformer architecture: (i) the original encoder-decoder architecture by Vaswani et al. (2017) and (ii) a more recent decoder-only architecture denoted as nanoGPT (Karpathy, 2022).In spite of their simplicity, these tasks cannot be solved by pure memorization or smooth interpolation and investigating how an LM learn them can improve our understanding of the underlying mechanisms.In particular, using a tiny vocabulary of just 5 tokens and a small training set allows to operate with a light (non-pretrained) LM and use interpretability techniques to investigate internal information processing.
Other studies addressed the ability of LLMs to perform arithmetic computation and train small LMs to learn these tasks from scratch (see related works in Section 2).However, our aim is different: we are not interested in finding the best LM architecture and setup to maximize accuracy on arithmetic operations, but we look for a simple architecture and setup that allow to effectively solve the task in order to be able to investigate the underlying computational approach.The main novelty and contribution of this work are the formalization of the hypothesis that our LM works as an Encoding-Regression-Decoding machine and the design of a number of experiments to support and validate this hypothesis (see Table 1).
After presentation of related works in Section 2, in Section 3 we introduce the experimental testbed and the architecture of the LM used.Section 4 presents the results achieved and introduces control experiments and elaborations to shed light on the computation approach used to solve the tasks.In Section 5 an ablation study is presented and, finally, in Section 6 we include a final discussion and draw some conclusions.

Amnesic probing
Prove that the "value" information is crucial to properly compute the output Appendix D

LM and LLM capabilities on arithmetic tasks
In Yuan et al. (2023) recent LLMs have been benchmarked in arithmetic tasks, including long-digits sum and multiplication, showing that LLMs such as ChatGPT and GPT-4 can perform reasonably well on these tasks even with no specific tuning.On the other hand, the accuracy of smaller models is markedly lower, and in general they are not able to work with long operands and generalize to OoD data.
Goat (Liu and Low, 2023) a LLaMA model specifically fine-tuned on arithmetic tasks performed even better than GPT-4 on large-number additions and subtractions, probably due to the consistent (digit level) tokenization of numbers in LLaMA models.However, it was able to perform multi-digits multiplication and division only forcing a Chain of Thought (CoT) (Wei et al., 2023) decomposition of such tasks during instruction tuning.Nogueira et al. (2021) tuned a T5-based pre-trained LM on additions and subtractions, and argued that tokenization and input representation are critical to achieve good accuracy.In particular, in their experiments character-based tokenization works better than sub-word tokenization, and making explicit the digit position in the input string (i.e., inserting after each digit a marker to denote its position in the sequence) generally leads to better accuracy.They also trained a vanilla non-pretrained LM on smaller numbers and found that classical sinusoidal-based positional embedding does not perform well, so they proposed a tailored position-wise masked embedding.Their paper contains other interesting findings such as the impact of the digit order (plain or reverse) and the size of the training set.Muffo et al. (2023) tuned pre-trained GPT-2 models on 5-digit additions and 2-digit multiplications.They also found that making explicit the digit position in the input sequence helps to improve accuracy.While good accuracy is reported for addition, the tuned models struggle to learn multiplication even on two-digit operands.Lee et al. (2023) trained small LMs to learn arithmetic tasks, mainly focusing on addition, but also experimenting with subtraction, multiplication, sine and square root.The authors carefully ablated different aspects of the training data to isolate the factors that contribute to the appearance of arithmetic capabilities.In particular, they studied the impact of the input order (plain or reverse) and the utility of providing intermediate information about the decomposition of the task in steps to promote CoT reasoning.Some results and findings included in Lee et al. (2023) will be further discussed throughout this paper.
All the above works provide useful contributions to understand the capabilities and limitations of large and small LMs to deal with arithmetic tasks, but none of them focus on the computational approach used to solve them, which is the main purpose of the present work (see Table 1).

Interpretability techniques
A large number of techniques can be used to investigate the internal working mode of deep neural networks, including Transformers and LMs: see Rauker et al. (2023) for a recent survey.Weights, single neurons, subnetworks/circuits, and activations can be the target of intrinsic approaches (implemented during training) or post-hoc approaches (implemented after training).
Probing is a common technique used to investigate the representations learned by pre-trained LMs: it typically involves training a simple model (denoted as probe) on top of the LM embeddings to predict a given property (Belinkov, 2022).Moreover, structural probing can be used to check whether internal representations encode discrete structures such as syntax trees (Hewitt and Manning, 2019), (White et al., 2021).However, a certain criticism emerged on probing analyses which is believed to disconnect the probing task from the original one and/or to reveal correlations instead of causations.Therefore, instead of focusing on the presence of information on internal encoding, some researchers proposed to check whether the removal of some knowledge from embeddings (e.g., amnesic probing (Elazar et al., 2021)) negatively influences the model ability to perform a task (Elazar et al., 2021), (Lasri et al., 2022).Other interesting approaches to interpretability are mechanistic interpretability (Elhage et al., 2021) and causal abstraction (Geiger et al., 2021): the former is aimed at reverse engineering the algorithm that a model uses to solve a task and to map it to neural circuits; the latter constructs an interpretable causal model and aligns it with neural representations.
In this work we use a mix of intrinsic and post-hoc interpretability techniques: in particular through the experiments we manipulate the training set, change the input representation and the architecture components, perform correlation analyses of embeddings and apply amnesic probing.

Interpretability of arithmetic reasoning with LMs
Stolfo et al. ( 2023) introduced a causal mediation analysis to point out the LM components (e.g., attention heads, Multi-Layer Perceptrons -MLPs) involved in the information processing of simple arithmetic operations, focusing on the flow of numerical information throughout the model layers/columns.The main outcomes of this study are that the model: (i) processes the representation of numbers and operators with the first layers; (ii) information is then conveyed (by attention heads) to the last part of the sequence (i.e., output column), where (iii) it is numerically processed by late MLPs.Nanda et al. (2023) carefully studied the algorithmic approach put in place by a small Transformer to implement modular addition of small numbers.They discovered that the internal algorithmic implementation is based on discrete Fourier transforms and trigonometric identities to convert addition to rotation on a circle.While the outcomes are somewhat surprising, here the term algorithm must be taken with care: even if the experiments prove that internal processing well approximates given equations, the approach is a numerical approximation (based on weight encoded values) that does not generalize to different moduli (as a symbolic implementation of the equations could do).
Both these studies adopted a simplified setting where numbers are presented as single token, and the output is expected at the last position of the sequence.So the models are not operated in autoregressive manner and the multi-token encoding/decoding stages are simplified.In Section 6 we discuss how the above findings are compatible with our findings.

The tasks
We focused on two simple computation tasks: binary addition and binary multiplication.Using binary encoding allows keeping the vocabulary very compact, since we need to encode only the symbols '0', '1' and a few other tokens.The selected tasks have other nice properties such as computing input similarities by Hamming distance and easily generating all combinations.Of course, a classical artificial neural network can be trained to learn to sum and multiply two integers or floating-point numbers, but adding/multiplying strings of tokens with an LM is trickier.
More formally, given two integers A, B (both in the range [0,127]) our input sequence (or prompt) is a 15-token string taking the form: where a i , b i ∈ {'0', '1'} are the symbols corresponding to bits in the i-th position in the binary representation of A and B respectively, and ⟨op⟩ can be either '+' or '×'.
The expected output string (or input completion) is: where r i is the i-th bit in the binary representation of A⟨op⟩B, and m is the number of bits of the expected output string R (8 and 14 for addition and multiplication, respectively).
It is worth noting that: • we are using a fixed-length input/output representation (with zero padding for unused most significant bits) to make the digit positions more explicit.
• in both the input and output the Least Significant Bits (LSBs) are provided before the Most Significant Bits (MSBs) (a.k.a., reverse or little-endian order) since this was supposed to simplify the model learning1 .As discussed in Appendix C this assumption leads to a much faster training.
If we consider the sequence-to-sequence mapping underlying the proposed tasks we note that even in a simple binary addition a slight change in the input (i.e., a single bit) can produce a relevant change in the output because of the carries propagation.In the example below a single bit modification in the input produces an 8 bit modification in the output: 1000000 + 0111111 → 11111110 1000000 + 1111111 → 00000001 Such input-output discontinuity is made more explicit for addition in Appendix A.

The architecture
A non-pretrained encoder-decoder Transformer based on the original architecture introduced in Vaswani et al. (2017) was used as primary LM.Table 2 reports the model setup and parametrization.The small vocabulary used allows us to keep the model small (just 701K learnable parameters) and trainable from scratch with a limited number of examples.
The LM was trained to learn separately the addition/multiplication tasks.For both problems, we exhaustively generated all the 2 14 = 16384 input/output combinations, which were then randomly split into training (3/4 → 12288) and validation (1/4 → 4096) sets.In our experiments we do not need a separate dataset to tune hyperparameters so our validation set coincides with the test set.
An additional control experiment was run where the input sequences were the same of the addition experiment but the output completion was randomly generated (with the same length as the addition, i.e., 8 tokens).In this case, the lack of any dependencies between input and output makes it impossible to learn an algorithmic approach (or smooth mapping) to solve the problem and the only strategy to learn the training set is memorizing all the sequences.
When the trained LM is used in inference mode, we always pick the most probable token from the logit outputs (i.e., greedy decoding).Two metrics can be used to denote the LM accuracy: token accuracy refers to the probability of generating the next token correctly, while sequence accuracy refers to the probability of generating the whole output string correctly in autoregressive mode (i.e., generating one token at a time and appending it to the current prompt).Most of the experiments have been repeated with a second LM (nanoGPT by Karpathy ( 2022)) which is a good representative of the decoder-only family.Details are reported in Appendix E.
All the experiments included in this paper can be easily reproduced by running the code available at: (to be disclosed upon acceptance).

Learning addition and multiplication
Figure 1 shows that our simple LM is able to learn addition in less than 50 epochs, and multiplication in about 250 epochs2 .As expected multiplication is more complex and requires more training: this is due to the high non-linearity of this operation (more on this later) and to the higher length of the output (14 vs 8 tokens).The accuracy on the validation set is very close to the training set, denoting almost perfect generalization on numbers never seen before.This is a somewhat surprising result, especially considering the limited size of the training data.No grokking3 was observed (Nanda et al., 2023).Similar results were obtained with nanoGPT (see  2021) (see their Appendix B for a similar setup), we were able to learn addition with the native sinusoidal positional encoding.Moreover, in Lee et al. (2023) additions can be effectively learnt by a simple LM, but to reach 100% accuracy the training set had to be balanced in terms of the operand magnitude (i.e., number of digits) and carry propagation.
The effectiveness of our training procedure is probably due to the lower complexity determined by a small vocabulary and fixed-length representation.As to multiplication, Muffo et al. (2023) were not able to effectively learn two (decimal) digits multiplication, while Lee et al. (2023) and Liu and Low (2023) had to provide extra intermediate steps in the prompt (denoted as detailed scratchpad) or during instruction tuning, respectively.On the contrary our model effectively learnt multiplication of 7 binary digit operands: again the simplified setup may have been the key.
On the workstation used (with a single Titan RTX GPU) training can be completed in just 8 and 46 minutes for addition and multiplication, respectively.An estimation of the training complexity C of an LLM in term of floating point operations is 6 × N × T (Kaplan et al., 2020), where N is the number of model parameters (about 701K as reported in Table 2) and T the number of training tokens.T can be obtained as the product of the training set size (12288 in our experiments -see Section 3.2), the sequence length in tokens (23 and 29 for addition and multiplication, respectively -see Section 3.1) and the number of epochs (50 and 250 for addition and multiplication, respectively).Hence, for addition T is 14M (12288 × 23 × 50) and therefore C is about 59 × 10 12 operations while for multiplication T is 89M (12288 × 29 × 250) and C is about 374 × 10 12 operations.

Control experiment: random output
If the output is randomly generated and therefore there is no relation with the input, the only possibility of learning the training set is by memorizing the whole data.Figure 2 shows the training results: a much larger number of epochs (i.e., 1000) were necessary to reach a sequence accuracy of 87.8%, and, as expected, the validation accuracy did not increase over the epochs.The difficulty of memorizing the training set (many more epochs) is due to the high discontinuity of the input-output mapping.In fact, because of the random output generation, very similar input sequences can be associated to completely different outputs.
Therefore, even if we only consider the accuracy on the training set, this result shows that an exhaustive memorization of the input is much more complex for the LM than solving the addition and multiplication tasks.This leads us to assume that, to efficiently solve the above computation tasks, the LM has found a computational approach (or algorithm) to simplify the output prediction.Now the question is: what is the approach?

The computational approach
Let us consider two alternative approaches: Symbolic Manipulation (SM): a first idea is that the LM could learn the binary integer addition/multiplication algorithms used by an ALU inside a CPU (see Appendix B for a short reminder).Indeed, the addition algorithm is not complex and can be solved by using a 3-bit truth table (to sum each pair of corresponding bits with the carry-in) and iterative carry-out propagations.However, multiplication (by iterative additions) is much more complex and trickier to learn by using a symbolic manipulation approach.Furthermore, as shown in Lee et al. (2023), a simple LM can also learn complex operations such as the sine function or the square root, whose mathematical (and algorithmic) decomposition is very complex since they require Taylor expansion and Newton method, respectively.
Encoding-Regression-Decoding (ERD): if we consider the model architecture (Transformer) used for the LM and the underlying word embedding by vector representations, it is more likely that the LM solves the problem by decomposing it in the following three phases: 1. Encoding (token to value): maps the input sequence (i.e., a 0 a 1 a 2 a 3 a 4 a 5 a 6 ⟨op⟩ b 0 b 1 b 2 b 3 b 4 b 5 b 6 ) to a suitable vector representation.In principle, two vectors v A and v B representing the values (or magnitudes) of A and B are enough.2. Regression: learns the computation as a supervised regression problem in the vector space: v R = regress(v A , v B ). Actually this regression formulation is an oversimplification of the problem since in the next-token-prediction training the LM works incrementally.In Appendix C this discussion will be expanded.3. Decoding (value to token): maps the value vector v R back to token representation (i.e., r 0 r 1 ...r m ).
It is worth noting that the above Encoding and Decoding phases do not need to be mapped onto the Transformer encoder and decoder (more on this later).The experiments reported in Sections 4.4 and 4.5 support the ERD assumption.The capability of capturing number magnitudes by pretrained embedders was also investigated by Wallace et al. (2019) who successfully trained a simple external regressor to compute the sum of two numbers starting from their embeddings.Other interesting studies on capturing numeracy with embedding were carried out by Naik et al. (2019) and Sundararaman et al. (2020).

Interpolation vs extrapolation
The random training/validation split performed for the experiments reported in Section 4.1 constitutes a somewhat simplified testbed to learn the two tasks.In fact, random split leads to a complete (even if sparse) coverage of the input space by both the training and validation sets, where each example in the validation set has high chance to be close to a training set example, and interpolation is enough to fill the gaps.
Hereafter, we exploit the well-known difficulty of a numerical regressor to work in the extrapolation regime to get insights about the computational approach of the LM.In particular, we considered two different criteria to isolate specific portion of the input space for the validation set, in order to better investigate extrapolation capabilities: where NN 4096 ((A * , B * )) is the set of 4096 pairs (A, B) which are the nearest neighbors to a centroid (A * , B * ) according to the Hamming distance between the corresponding token representations (i.e., number of different tokens at corresponding positions).As centroid (A * , B * ) in the token space we used: 1010101 ⟨op⟩ 0101010.
here the centroid is located in the middle of the value space (64, 64), so VS v is a squared region (of side 64) centered in the value space.
Both VS t and VS v isolate a contiguous data region of 4096 samples to be included in the validation set, but in the former the samples are close in the token representation space, while in latter are close in the value space.Being such contiguous portions of space excluded from the training set, we can expect a worse generalization.From the results (see Figure 3) we note that VS t is very marginally affecting LM training and generalization while VS v has a major impact: in fact, in the second case, for both addition and multiplication the final sequence accuracy is from 4% to 6% points lower.This result strengthens the ERD hypothesis, since: (i) using VS v leads to the exclusion of a specific contiguous portion of value space during phase 2 and does not allow to properly train the regressor in this region; (ii) the encoding performed during phase 1 makes irrelevant the selection performed according to VS t because, after encoding, the corresponding data point remains scattered in the value space and the regressor can easily interpolate among them.Similar results were obtained with nanoGPT (see Figure E.8 in Appendix E.)

Looking at internal representations
Understanding the internal representation (embeddings in the vector space) in a trained Transformer is not an easy task.However, in the specific setting considered we can gain some hints by looking at the distances between the embedding of different data points (at different layers) and correlating them with the corresponding distances at input/output levels.
Given an LM trained on addition (or multiplication) we consider the dataset S including the 128 input pairs where the two operands have identical values4 : Results are averaged over five runs.VS t reaches 100% accuracy on additions (the same of Random split) and 97.5% accuracy on multiplication (just 1.4% less than random split); VS v reaches 93.7% on addition and 94.3% on multiplication (6.3% and 4.6% less than Random split, respectively).
At the input level (in) we can compute two ordered sets of 8128 (128×127/2) distances each: where hdist(X, Y) is the Hamming distance between the token representation of X and Y, and the subscript letters t and v denote token and value levels, respectively.
At the output level (out) we can compute the two corresponding sets of distances as: where (P = X + X and Q = Y + Y) for addition, and (P = X × X and Q = Y × Y) for multiplication.
Finally, for each intermediate level of the Transformer encoder (enc) or decoder (dec) we can compute the Euclidean distances among the corresponding embedding vectors.
where enc i and dec i are the output vectors obtained by concatenating all the token embeddings (each of dimensionality 64) after the i-th encoder and decoder layer, respectively.For example enc i has dimensionality 960 = 64 × 15 where 15 is the number of tokens in the encoder.
Even if the distances in the different sets have different ranges, we can use correlation to find out similarities.If two sets of distances are correlated we can expect that the corresponding representations/embeddings are correlated as well.Since both Pearson and Spearman correlations (Schober et al., 2018) provided similar outputs, for simplicity in Figure 4 we report only Pearson correlations.
The yellow cells in the tables of Figure 4 confirm the low correlation between the token and value representation at both input and output level.The blue cells show that correlation remains quite similar across the encoder layers as if the encoder was not performing any significant computation (this is confirmed in Section 5 where we achieve similar results by totally removing all intermediate attention and MLP layers in the encoder).More interesting is the trend of correlations across the decoder layers (green cells).In particular, for the addition the token representation has high correlation with the first and last layers and low with central layers, while the value representation has an opposite trend (see also Figure 4.c).These results support the ERD hypothesis and in particular that the initial and final layers in the decoder transform from token to value representation (and vice versa) while the central layers perform regression in the value space.In particular, at layer 3, the correlation at token level is minimum while the correlation at value level is maximum.
For multiplication the low-high-low trend at value level is less evident (Figure 4.d orange curve), probably because the quadratic dependence of the output from the input (at value level) does not allow to learn a simple regressor smoothly working in the whole vector space, and the mapping is performed by piecewise linear approximation in different space regions, which introduces discontinuities that make global distances in the vector space unsuitable to quantify the representation similarity.
As discussed in Section 2.2, correlation analyses might be insufficient to prove that the presence of a certain information in the embeddings is really necessary to compute the output (direct causation).So to further strengthen our hypothesis we applied an amnesic probing technique (Elazar et al., 2021) and proved that, upon removal of value information from the embeddings, the LM is no longer capable of performing the right computation.Details are reported in Appendix D.

Ablation study
This section presents the results of an ablation study where the LM architecture was simplified, to understand which components are necessary to learn the addition/multiplication computation.Squeezing the encoder (i.e., removing all intermediate attention and MLP layers) does not have a relevant impact; this is consistent with other works claiming that a decoder only architecture (Liu et al., 2018) can achieve similar results with respect to an encoder-decoder Transformer, and further confirmed by the nanoGPT results presented in Appendix E. A simplification of the architecture in terms of (i) reduction of dimensionality; (ii) reduction of number of heads; (iii) removal of fully connected layers is well tolerated, while positional embedding and attention layers are mandatory for the LM in order to properly perform token to value transformation (and vice versa).Table 3 summarizes the results.
Table 3: Epochs necessary to reach 95% accuracy on the validation set.A dash is used when 95% accuracy is not achieved in 1K epochs: in such case the accuracy reached is reported within brackets.

Configuration
Addition Multiplication Full (see

Discussion and conclusions
In this paper we introduced a simplified setup to allow a light LM to learn binary addition and multiplication.Both the LM architectures considered easily learn the two tasks and generalizes well on unseen data, proving that memorization of the training data is neither necessary nor efficient.The experiments on the interpolation/extrapolation capabilities and correlation of input-output representations with internal embedding suggest that the model solves the computational task as a supervised regression problem in the value space after an initial encoding from token to values, and a final decoding from output value to tokens.Under this hypothesis: (i) any task that can be solved by a neural network regressor can be solved by an LM as well, with the extra burden of end-to-end learning decoding/encoding steps; (ii) when looking at interpolation/extrapolation capabilities of an LM applied to a mathematical task, we should not concentrate on the input token representation but on the internal representation after encoding, keeping in mind the difficulties of a numerical regressor to work on region spaces not covered by the training set; (iii) on a more speculative side, we could guess that modern LLMs learn the number encoding/decoding once and reuse it across different numerical tasks whereas a specific regressor is learned for each task.
Our ERD hypothesis could be questioned considering some recent findings from Lee et al. (2023) where providing in the prompt intermediate information (scratchpad) about the decomposition of arithmetic tasks improves the training efficiency and requires fewer examples.This could suggest that a symbolic manipulation approach is adopted to learn imitating step by step the proposed decomposition.However, in most of the cases their model was able to learn the same task (even if slowly) without scratchpad and/or with wrong scratchpads.As argued by the authors the higher efficiency is actually in terms of examples and not in terms of tokens since each scratchpad requires a large number of extra tokens, and we guess these could be used as extra features by the underlying regressor.Furthermore, scratchpad contribution is negligible for more complex operations such as sine and square root, but, unexpectedly, learning such complex operations was simpler than multiplication.This is not strange under the ERD hypothesis where a unary smooth operator like the sine can be learned by a supervised regressor independently of the mathematical method used for its computation.
The algorithmic interpretation that Nanda et al. (2023) provided for modular addition, could also suggest that their LM discovered and efficient symbolic manipulation approach; however, as discussed in Section 2.3, it is more likely that a regressor was learned to numerically approximate an efficient sparse Fourier decomposition, under regularization constraints favoring sparsity.Finally, the information flow described in Stolfo et al. (2023), points out that MLPs in the last layers are responsible for the numerical computation of the solution, which is compatible with the hypothesis of a multi-layer regressor.
Of course we are not claiming that all the capabilities of modern LLMs can be explained by regression, but regression is likely to be one of the internal tools that LLMs uses to predict the next token when numbers come into play.
As to future research we plan to: i) further investigate the generalization capabilities of LMs in arithmetic tasks with respect to the composition of the training and test sets (Feng et al., 2023;Keskar et al., 2017), ii) design simplified experiments/setups for tasks that cannot be easily mapped to regression problems such as chain of reasoning and logic deductions.current bits5 , then a two-output 3-bit truth table (Table B.5) can be used to generate the output bit o i and carry c i used when summing the next pair of bits:

Inputs
Outputs A simple approach to execute binary multiplication is through iterative binary sums.Each bit b i of the second operand is multiplied by the whole first operand, but this inner multiplication is straightforward since it results either in a sequence of 0 (if b i = 0) or a copy of the first operand (if b i = 1).This intermediate result is then shifted left and summed to the current output.An example is reported in Figure B.5 below.

Appendix D. Amnesic probing results
The outcome of correlation analyses performed in Section 4.3 suggests that the embeddings in the central layers of the decoder contain information related to the value representation of the output (see Figure 4).However correlation does not mean causation, and here we investigate deeper.Amnesic probing was proposed in Elazar et al. (2021) building on the approach Ravfogel et al. (2020) to check to what extent a model output is affected by the removal of specific features or attributes in an intermediate level embeddings.Here we focus on addition and we try to remove some features from the decoder layer 3 embeddings (dec 3 (X, Y)).To this purpose a linear probe (a linear regressor in our case) was trained to predict the output value (X + Y) starting from the dec 3 (X, Y) embeddings and its nullspace is used to project the embeddings in a new space lacking output value information.According to Ravfogel et al. (2020), due to the simplicity of the linear regressor used, the procedure is repeated twice to remove more information.Our results show that: 1.A simple linear regressor trained on dec 3 (X, Y) embeddings can reach high accuracy in predicting X + Y (rmse = 0.28).2. If the projected embeddings are overwritten in the LM decoder at level 3, and a partial forward pass is performed thereafter, the addition sequence accuracy severely drops from 100% to 0.13%.3.As indicated in Elazar et al. (2021) since any information removal could hamper the model accuracy, a control test was performed by removing the same amount of information (but on random directions instead of the nullspace directions) and in this case the LM final sequence accuracy remained 100%.
This experiment provides further support to the hypothesis that the value information is not only present in the inspected embeddings but is also crucial for the output computation.
On the computational side, we argue that amnesic probing complexity is low because it relies on simple steps as linear regression and null space computation, with the former being the most demanding step.Linear regression complexity is O(nd 2 + d 3 ) where n is the number of training examples and d the dimensionality of the embeddings.

Appendix E. NanoGPT -a decoder-only LM
To demonstrate that our findings generalize beyond the encoder-decoder architecture of the original Transformer used in this work, the main experiments have been repeated using a second LM, that is the nanoGPT (Karpathy, 2022) decoder-only model.Table E.6 reports the details of the nanoGPT model adopted.Using the VS t subset, it reaches 100% and 99.9% accuracy on addition and multiplication, respectively (the same of Random split) while, using the VS v subset, it reaches 82.0% on addition and 80.6% on multiplication (18.0% and 19.4% less than Random split, respectively).Results are inline with those obtained in Section 4.4 but here the difference between VS t , and VS v is still more significant.
Figure E.7 in Appendix E.) Unlike Nogueira et al. (

Figure 1 :
Figure 1: Sequence accuracy.From the left: addition and multiplication.Results are averaged over five runs.Note that, training and validation curves are almost overlapped.At the end of training the Mean Absolute Error (MAE) on the validation set, between the real and generated operation results, is 0 and 1.3 for addition and multiplication, respectively.

Figure 2 :
Figure 2: Sequence accuracy using random output in the training set.Results are averaged over five runs.

Figure 3 :
Figure 3: Sequence accuracy on Random, VS t , and VS v validation subsets for addition (left) and multiplication (right).Results are averaged over five runs.VS t reaches 100% accuracy on additions (the same of Random split) and 97.5% accuracy on multiplication (just 1.4% less than random split); VS v reaches 93.7% on addition and 94.3% on multiplication (6.3% and 4.6% less than Random split, respectively).

Figure 4 :
Figure 4: Pearson correlation between ordered sets of distances for addition (a) and multiplication (b).Each cell denotes the correlation between the two ordered set of distances specified in the corresponding row and column.Note that since for addition in this experiment the output value is always twice the input, the correlation values (blue and green cells) are the same for d in, and d out, block of values.Graphs (c) and (d) show the correlations of output distances d out,t (at token level -blue curves) and d out,v (at value level -orange curves) with the embedding distances d dec i across the 6 decoder layers for addition and multiplication, respectively.

Figure B. 5 :
Figure B.5: Example of 4-digit binary multiplication.The sum can be performed incrementally with a two-operand adder.

Figure C. 6 :
Figure C.6: Sequence accuracy on validation set for reverse (default in this work) and plain order of the input and output representations.From left to right: addition and multiplication.

Figure E. 7
Figure E.7 shows that the nanoGPT model was able to learn addition and multiplication still more efficiently than the original Transformer (compare Figure 1 with Figure E.7).For the training, we used a minibatch size of 128, a standard CrossEntropy loss, the AdamW optimizer with a learning rate of 0.001 and betas = 0.9 and 0.98, and a gradient clipping to 1.0.

Figure E. 7 :
Figure E.7: Sequence accuracy of the nanoGPT model (refer to Section 4.1 for more details).From the left: addition and multiplication.Results are averaged over five runs.Note that, training and validation curves are almost overlapped.

Figure
Figure E.8 shows the sequence accuracy of the nanoGPT model on Random, VS t , and VS v validation subsets for addition and multiplication (see Section 4.4 for more details).Using the VS t subset, it reaches 100% and 99.9% accuracy on addition and multiplication, respectively (the same of Random split) while, using the VS v subset, it reaches 82.0% on addition and 80.6% on multiplication (18.0% and 19.4% less than Random split, respectively).Results are inline with those obtained in Section 4.4 but here the difference between VS t , and VS v is still more significant.

Figure E. 8 :
Figure E.8: Sequence accuracy of the nanoGPT model on Random, VS t , and VS v validation subsets for addition (left) and multiplication (right).Results are averaged over five runs.

Table 1 :
The main contributions of this work.

Table 2 :
Details of the LM model used in our experiments.The total number of learnable parameters is just 701K, which is several orders of magnitudes smaller than recent billion-parameters LLMs.

Table B .
5: Two-output 3-bit truth table for binary addition.

Table E .
6: Details of the nanoGPT model.