Reservoir Transformers

We demonstrate that transformers obtain impressive performance even when some of the layers are randomly initialized and never updated. Inspired by old and well-established ideas in machine learning, we explore a variety of non-linear “reservoir” layers interspersed with regular transformer layers, and show improvements in wall-clock compute time until convergence, as well as overall performance, on various machine translation and (masked) language modelling tasks.


Introduction
Transformers (Vaswani et al., 2017) have dominated natural language processing (NLP) in recent years, from large scale machine translation (Ott et al., 2018) to pre-trained (masked) language modeling (Devlin et al., 2018;Radford et al., 2018), and are becoming more popular in other fields as well, from reinforcement learning (Vinyals et al., 2019) to speech recognition (Baevski et al., 2019) and computer vision (Carion et al., 2020).Their success is enabled in part by ever increasing computational demands, which has naturally led to an increased interest in improving their efficiency.Scalability gains in transformers could facilitate bigger, deeper networks with longer contexts (Kitaev et al., 2020;Wang et al., 2020;Beltagy et al., 2020;Kaplan et al., 2020;Tay et al., 2020b).Conversely, improved efficiency could reduce environmental costs (Strubell et al., 2019) and hopefully help democratize the technology.
In this work, we explore a simple question: if some layers of the transformer are kept frozeni.e., never updated after random initializationcan we match the performance of fully learned transformers, while being more efficient?Surprisingly, the answer is resoundingly yes; and what is more, we find that freezing layers may actually improve performance.
Beyond desirable efficiency gains, random layers are interesting for several additional reasons.Fixed randomly initialized networks (Gallicchio and Scardapane, 2020) converge to Gaussian processes in the limit of infinite width (Daniely et al., 2016), have intriguing interpretations in metric learning (Rosenfeld and Tsotsos, 2019;Giryes et al., 2016), and have been shown to provide excellent "priors" either for subsequent learning (Ulyanov et al., 2018) or pruning (Frankle and Carbin, 2018).Fixed layers allow for efficient low-cost hardware implementations (Schrauwen et al., 2007) and can be characterized using only a random number generator and its seed.This could facilitate distributed training and enables highly efficient deployment to edge devices, since it only requires transmission of a single number.The strong performance of networks with fixed layers also sheds new light on the inner workings of BERT (Devlin et al., 2018), and layer-wise interpretations of such models (Rogers et al., 2020;Tenney et al., 2019).It appears that "not all layers are created equal" (Zhang et al., 2019) is true to such an extent that some layers can simply remain random and fixed.
Random projections have a long history in machine learning.By Cover's theorem (Cover, 1965), any high-dimensional non-linear transformation is more likely to be linearly separable than its lower-or-equal-dimensional input space.By Johnson-Lindenstrauss (Johnson and Lindenstrauss, 1984), random projections distort Euclidean distances very little under mild assumptions, which is useful e.g. for dimensionality reduction and random indexing (Sahlgren, 2005).Fixed random layers in neural networks pre-date deep learning by far (Gamba et al., 1961;Baum, 1988).Indeed, random kernel methods have long arXiv:2012.15045v2[cs.CL] 1 Jun 2021 been influential in machine learning (Rahimi andRecht, 2008, 2009).
One way to think of such layers is as "reservoirs" (Lukoševičius and Jaeger, 2009), where a highly non-linear high-dimensional black box representation is provided to a lightweight "readout" network, as in echo state networks (Jaeger, 2003) and liquid state machines (Maass et al., 2002).The benefit of such an approach is that the reservoir has fixed parameters and is computationally efficient, as it can be pre-computed and does not (necessarily) require backpropagation.
In NLP, Wieting and Kiela (2019) showed that random sentence encoders present a strong baseline for text classification, with subsequent work showing applications in a variety of tasks from summarization to machine translation (Enguehard et al., 2019;Garg et al., 2020;Pilault et al., 2020).To our knowledge, this work is the first to examine this phenomenon in transformers, and the first to recursively alternate reservoirs with subsequent transformer layers acting as readout functions.We introduce "reservoir transformers", wherein fixed random reservoir layers are interspersed with regular updateable transformer layers.The goal of this work is to put our understanding of transformer models on a more solid footing by providing empirical evidence of their capabilities even when some of their parameters are fixed.Our contributions are as follows: • We introduce a area under the convergence curve metric for measuring performanceefficiency trade-offs, and show that replacing regular transformer layers with reservoir layers leads to improvements.
• We show that the addition of reservoir layers leads to improved test set generalization on a variety of tasks in a variety of settings.
• We show that pre-trained masked language modelling architectures like BERT and RoBERTa (Liu et al., 2019) can benefit from having some of their layers frozen, both during pre-training as well as when fine-tuning on downstream tasks.
• We experiment with different types of reservoir layers, including convolutional and recurrent neural network-based ones.
• We show empirical evidence that the backward pass can be skipped in its entirety by approximating upstream gradients using an approach we call backskipping, which can reduce the training compute further without sacrificing performance.

Approach
This paper is based on a very simple idea.Neural networks are trained via backpropagation, which involves consecutive steps of matrix addition and multiplication, i.e., for some objective J, parameterization θ and learning rate η, with the gradient computed via the chain rule, where L i is the i-th layer of the neural network and x is the input.Let L = Transformer(X) be a single layer in a Transformer network (Vaswani et al., 2017), i.e., Now, during every "backward pass", we compute the Jacobian for parameters θ L at layer L, which are used to update the parameters of L, θ L t , as well as to compute the next layer's Jacobian, thus back-propagating the gradients.In this work however, for some of the layers, we still backpropagate through them to compute gradients for earlier layers, but we never apply the parameter update.As a result, these layers stay fixed at their initialization, saving computational resources.

Background
Naturally, never updating some of the parameters is computationally more efficient, as some matrix addition operations can be skipped in the backward pass, but why is this not detrimental to the performance of the network?
The theoretical justification for these approaches lies in two well-known results in machine learning: Cover's theorem (Cover, 1965) on the separability of patterns states that highdimensional non-linear transformations are more likely to be linearly separable; and the Johnson-Lindenstrauss lemma (Johnson and Lindenstrauss, 1984) shows that (most) random projections distort Euclidean distances very little.
Practically, random layers can be seen as a cheap way to increase network depth.There are interesting advantages to this approach.Fixed layers are known to have particularly low-cost hardware requirements and can be easily implemented on high-bandwidth FPGAs with low power consumption (Hadaeghi et al., 2017;Tanaka et al., 2019), or on optical devices (Hicke et al., 2013).This might yield interesting possibilities for training in a distributed fashion across multiple devices, as well as for neuromorphic hardware (Neftci et al., 2017).This approach also facilitates lower-latency deployment of neural networks to edge devices, since weights can be shared simply by sending the seed number, assuming the random number generator is known on both ends.

Reservoir Transformers
This work explores inserting random non-linear transformations, or what we call reservoir layers, into transformer networks.Specifically, we experiment with a variety of reservoir layers: • Transformer Reservoir: The standard transformer layer as described above, but with all parameters fixed after initialization, including the self-attention module.We find that all these approaches work well, to a certain extent.For clarity, we focus primarily on the first two reservoir layers, but include a broader comparison in Appendix A.
In each case, contrary to traditional reservoir computing, our reservoir layers are interspersed throughout a regular transformer network, or what we call a reservoir transformer.Since random projections are not learned and might introduce noise, subsequent normal transformer "readout" layers might be able to benefit from additional depth while allowing us to recover from any adverse effects of randomness.For example, previous work has shown that ResNets, with all of their parameters fixed except for the scale and shift parameters of batch normalization, can still achieve high performance, simply by scaling and shifting random features (Frankle et al., 2020).Adding some form of noise to the parameters is also known to help convergence and generalization (Jim et al., 1995(Jim et al., , 1996;;Gulcehre et al., 2016;Noh et al., 2017).

Evaluation
We evaluate the proposed approach on a variety of well-known tasks in natural language processing, namely: machine translation, language modelling and masked language model pre-training.
We set out to do this work with the main objective of examining any potential efficiency gains, i.e. the relationship between compute time and task performance.This is closely related to efforts in Green AI, which are concerned with the trade-offs between compute, data, and performance (Schwartz et al., 2019).We propose to measure this trade-off via the area under the convergence curve (AUCC): similarly to how the area under the receiver operating characteristic (Bradley, 1997, AUC-ROC) measures a classifier's performance independent of the classification threshold, AUCC measures a model's performance independent of the specific compute bud-get.Specifically, AUCC is computed as follows: where f is the network and g is the evaluation metric, measured until convergence time T , which is the maximum convergence time of all models included in the comparison.Note that time here is wall-clock time, not iterations.By convergence, we mean that validation performance has stopped improving, and hence the convergence curve whose area we measure plots the desired metric over time.Runs are averaged over multiple seeds and reported with standard deviation.We normalize raw AUCC scores by their maximum to ensure a more interpretable [0 − 1] range.
One potential downside of this approach is that the AUCC metric could lead to higher scores for a model that converges quickly but to ultimately worse performance, if measured in a small window.This can be solved by making sure that T is set sufficiently high.We include the raw validation curves in the appendix to demonstrate that the chosen window sizes are sufficient and the results are not a influenced by this limitation.In addition, we report the number of trainable parameters and the wall-clock training time until maximum performance (plus 95% and 99% convergence results in the appendix).Finally, we show test set generalization in each experiment.Overall, this gives us a wide set of axes along which to examine models.
We use 8 Volta V100 GPUs for WMT and en-wik8, 32 V100 GPUs for RoBERTa and a single V100 for IWSLT.The hyperparameters for IWSLT14 and WMT16 were set to the bestperforming values from Ott et al. (2018) and Kasai et al. (2020) respectively.The enwik8 experiment settings followed Bachlechner et al. (2020) and the RoBERTa experiments followed Liu et al. (2019).
All the experiments in this paper were run with 3 random seeds and the mean and standard deviation are reported.For the relatively small IWSLT, the T value in the AUCC metric was set to 4 hours.For the larger WMT, we set it to 20 hours.For enwiki8, it was 30 hours; and for the RoBERTa pre-training experiments, it was set to 60 hours.
The projection weights in random layers were initialized using orthogonal initialization (Saxe et al., 2013), since random orthogonal projections should ideally be maximally informationpreserving, and which was found to work well empirically for initializing fixed random representations in previous work (Wieting and Kiela, 2019).Biases and layer norm parameters were initialized using their respective PyTorch defaults (based on Xavier init; Glorot and Bengio, 2010).
We intersperse reservoir layers in alternating fashion starting from the middle.Specifically, we alternate one reservoir layer with one transformer layer, and place the alternating block in the middle.For example: a 7-layer encoder LLLLLLL in which we replace three layers with reservoirs becomes LRLRLRL, and with two becomes LLRLRLL.See Appendix C for a study comparing this strategy to alternative approaches (e.g., freezing in the bottom, middle or top).

Experiments
In what follows, we first show our main result, on a variety of tasks: reservoir transformers mostly have better AUCC metrics; less training time per epoch; less convergence time until the best validation performance is achieved; and even improved test set generalization metrics.As a strong baseline method, we compare to LayerDrop (Fan et al., 2019).LayerDrop can also be seen as a method that dynamically bypasses parts of the computation during Transformer training in an attempt to improve efficiency, and making it a strong comparison to examine our methods.Then, we examine whether we can minimize the expectation over the gradients of upstream layers in the network such that we do not at all have to pass gradients through the reservoir layers, skipping their backward pass.

Machine Translation
Machine translation (MT) is one of the core tasks of NLP.We demonstrate on two well-known MT datasets, IWSLT'14 German-English and WMT'16 English-German, that reservoir transformers obtain a better AUCC.For the raw validation plots over time that were used to calculate the AUCC, please refer to Appendix F.
Following Kasai et al. (2020), the architecture of the network is an N-layer reservoir transformer encoder, followed by a regular shallow one-or two-layer decoder.This design choice has been shown to lead to very good speed and efficiency trade-offs, and serves as a good baseline for our experiments.Moreover, shallow decoders make it easier to decide where to place reservoir layers (in the encoder) and makes it more straightforward to identify where performance gains come from.
Figure 1 shows the results for IWSLT (left) and WMT (middle).On the y-axis we show validation AUCC for the BLEU metric; on the x-axis we show the number of updatable layers in the encoder.The performance of a regular transformer encoder with 6 layers and a reservoir transformer encoder with 6 layers plus N additional reservoir layers are plotted for the same x-axis value to show the total number of updated layers.Plots for the total number of layers (updatable plus notupdatable, so essentially shifted versions of the plots) are shown in Appendix E.
WMT is much larger and requires a much deeper encoder, as illustrated by the fact that a certain minimum depth is required for reservoir transformers to achieve a comparable validation AUCC.At test time, reservoir transformers outperform regular transformers for almost all encoder depths.The FFN Reservoir seems to work best in both cases, which is surprising because it does not have any self-attention component at all.This finding shows that self-attention, or the mechanism to summarize context information, should be learned if present.Once the context features have been gathered, a random projection via a fixed FFN module appears to be beneficial.
Table 1 and 2 show the time it took to achieve the maximum validation BLEU score and how that relates to the regular transformer, demonstrating that reservoir transformers consistently converge faster in terms of wall-clock time.We save up to 22% convergence wall-clock time using reservoir transformers as much with the same number of updateable layers.We save as much as 27% time until convergence a 24 layer model on WMT, as shown in  GPT-3 (Brown et al., 2020).We observe that reservoir transformers consistently perform better than, or are competitive to, regular transformers, both in terms of validation BLEU AUCC as well as test time BLEU, for all examined encoder depths.

Language Modelling
To examine whether the same findings hold for other tasks, we evaluate on the enwiki8 (LLC, 2009) language modelling task.We examine the BPC (bits per character) rate for a variety of network depths (since the task is language modelling, these layers are in the decoder).The results show that except for the 64-layer regular transformer, which appears to be particularly optimal for this task, we obtain consistently better BPC for all depths.We observe similar trends during test time.

Masked Language Model Pretraining
We train RoBERTa (Liu et al., 2019) models from scratch at a variety of depths, both in the normal and reservoir setting.We find that these networks show minor differences in their best perplexity  and similar AUCC perplexity (see Appendix D).
We then examine the performance of these models when fine-tuned on downstream tasks, specifically the well known SST-2 (Socher et al., 2013) and MultiNLI-matched (Williams et al., 2017) tasks.When fine-tuning the reservoir models, we keep the reservoir layers fixed (also fine-tuning them did not work very well, see Appendix D).
Figure 2 shows the results of fine-tuning.We observe that the reservoir transformer outperforms normal RoBERTa at all depths in both tasks.At lower depth, the improvements are substantial.As a sanity check, we also experiment with freezing some of the layers in a regular pre-trained RoBERTa model during fine-tuning only (Transformer "frozen finetuned" in the Figure ) and show that this helps a little but is still outperformed by the reservoir transformer.
These findings suggest that we can train a RoBERTa model without updating all of the layers, achieving similar perplexity at a similar computational cost, but with better downstream performance.This strategy could prove to be beneficial in a wide variety of pre-training scenarios.
We follow Jawahar et al. ( 2019) and investigate what the frozen layers in the Reservoir Transformer have actually "learned" (while being frozen) as measured by probing tasks, reported in Table 4.The set of tasks comprises one surface task, three syntactic tasks, and five semantic tasks.
From the table, we can see that generally probing performance is quite similar between Transformer and the T Reservoir model.We also noticed that the representations collected after the reservoir layer (3, 5, 7, 9) in the T Reservoir actually have significantly better performance over the regular Transformer representations across all the probing tasks.Related to our findings, Voita and Titov (2020) show that the wholly-randomlyinitialized model representations can still have reasonable probing accuracy if they are contextualized, though the accuracy is strictly worse than a trained one.These findings raise interesting repercussions for the study of "BERTology", as it clearly shows that even completely random and frozen layers can represent linguistic phenomena.

Backskipping
With the reservoir transformers as described above, we obtain better efficiency by skipping the "gradient application" matrix addition step in some of the layers (i.e., updating the weights).One step further would be to investigate skipping the entire backward pass for reservoirs altogether, which would save us from having to do the much more expensive matrix multiplication for these layers that is required for the propagation of gradients through a regular layer.
We report on preliminary experiments where in the backward pass we replace the gradients for the layer L i going into the reservoir L i+1 with a noisy estimate (Jaderberg et al., 2017;Czarnecki et al., 2017).Promisingly, Oktay et al. (2020) recently asked "why spend resources on exact gradients when we're going to use stochastic optimization?"and show that we can do randomized autodifferentiation quite successfully.Here, rather than minimizing the actual gradients ∂L i ∂θ L i , we minimize their expectation and train via continuous-action REINFORCE (Williams, 1992).That is, L i becomes a policy π a : s → µ where we sample actions a ∼ N (µ, 1).We train to minimize the gradient prediction loss via MSE, i.e., 1 where the value network V acts as the baseline.R is defined as the mean of the gradients of the top layer L i+2 , with the sign flipped.Thus, simply put, we train to minimize the expectation of the true gradients at the layer directly following the reservoir.We employ an annealing scheme where we first train the value network and propagate the true gradients during warmup.Afterwards, we anneal the probability of backskipping instead of doing a true backward pass (multiplying the probability by 0.99 every iteration until we only backskip).We experimented with setting R to the negation of the total loss but found the mean upstream gradient reward to work better.We call this approach backskipping.
As shown in Table 3, the backskip reservoir approach leads to a higher maximum BLEU score than the regular transformer, with a much higher AUCC and much lower training time.The encoder depth is 8 with 2 frozen.Appendix G shows the raw validation BLEU curves over time.We observe that this approach helps especially during the earlier stages of training.This finding opens up intriguing possibilities for having parts of neural networks be completely frozen both in the forward as well as in the backward pass, while still contributing to the overall model computation.
The computational cost is heavily reduced given that we completely bypass the expensive backpropagation computation in the reservoir layers.Backskipping is shown to be a promising approach to further reduce computational costs, and would be even more efficient from a hardware perspective since the circuitry for such layers (which do not need to propagate gradients) can be hardwired.

Related Work
Recent work has shown that modern NLP models are able to function with different numbers of layers for different examples (Elbayad et al., 2019;Fan et al., 2019;He et al., 2021); that different layers specialize for different purposes (Zhang et al., 2019); that layers can be compressed (Li et al., 2020;Zhu et al., 2019;Shen et al., 2020;Sun et al., 2020); and, that layers can be reordered (Press et al., 2019).There is a growing body of work in efficient self-attention networks (Tay et al., 2020b), such as linear attention (Wang et al., 2020), on how to process long context information (Beltagy et al., 2020;Ainslie et al., 2020) and on approximations to make transformers more scalable (Kitaev et al., 2020;Katharopoulos et al., 2020).BigBIRD (Zaheer et al., 2020) provides random keys as additional inputs to its attention mechanism.Locality sensitive hashing (LSH) as employed e.g. in Reformer (Kitaev et al., 2020) utilizes a fixed random projection.Random Feature Attention (Peng et al., 2021) uses random fea-ture methods to approximate the softmax function.Performer (Choromanski et al., 2020) computes the transformer's multi-head attention weights as a fixed orthogonal random projection.Closely related to this work, Tay et al. (2020a) showed that randomized alignment matrices in their "Synthesizer" architecture are sufficient for many NLP tasks.While these works focus on random attention, we show that entire layers can be random and fixed.We also show that entire layers can be replaced by fixed random projections that do not have any attention whatsoever.
Beyond transformers, random features have been extensively explored.Examples of this include FreezeOut (Brock et al., 2017), deep reservoir computing networks (Scardapane and Wang, 2017;Gallicchio and Micheli, 2017), as well as applications in domains as varied as text classification (Conneau et al., 2017;Zhang and Bowman, 2018;Wieting and Kiela, 2019) or music classification (Pons and Serra, 2019).It is well known that randomly initialized networks can display impressive performance on their own (Ulyanov et al., 2018;Rosenfeld and Tsotsos, 2019;Ramanujan et al., 2020;Voita and Titov, 2020), which underlies, for example, the recently popularized lottery ticket hypothesis (Frankle and Carbin, 2018;Zhou et al., 2019).We know that learning deep overparameterized networks appears to help in general (Li and Liang, 2018;Du et al., 2019).Our method constitutes a way to add both depth and parameters to transformer networks without much computational cost.

Conclusion
This work demonstrated that state-of-the-art transformer architectures can be trained without updating all of the layers.This complements a long history in machine learning of harnessing the power of random features.We use the "area under the convergence curve" (AUCC) metric to demonstrate that on a variety of tasks, and in a variety of settings, "reservoir transformers" achieve better performance-efficiency trade-offs.We show that such reservoir transformers show better convergence rates and test-set generalization.We demonstrated that the backward pass can be skipped altogether, opening up exciting vanues for future research.Future work includes further investigating hybrid networks and backskipping strategies, as well as utilizing pruning.

A Hybrid Networks and Non-Transformer Reservoirs
We investigate whether reservoir layers need to be transformer-based (or transformers-withoutattention, i.e., FFN).We examine two different alternatives: bidirectional Gated Recurrent Units (Cho et al., 2014) and Convolutional Neural Networks (LeCun et al., 1998;Kim, 2014), specifically light dynamical convolutions (Wu et al., 2019).Figure 3 shows the results for these hybrids: depending on the setting, they may obtain a better AUCC than the regular transformer, but this is less consistent than with the other reservoir layers, most likely because these layers have different computational properties.It's possible that these hybrids simply require further tuning, as we found e.g.up-projecting to help for BiGRUs, but studying this is outside of the scope of the current work.

B Deep Decoders
We show that the same results hold for a 6-layer decoder on IWSLT (although less pronounced for AUCC, probably because the decoder is computationally heavier).See Figure 4

C Freezing Strategy
We explored different strategies for the placement of reservoir layers and found the "alternating" strategy reported in the main body of the paper to work best.Generally, we found repetitive applica- tion of reservoirs to yield diminishing returns, as might be expected.See Figure 5.

D RoBERTa Results
Here we present the additional results for RoBERTa , i.e., convergence plots and AUCCs for various depth settings, in Figure 7.As stated in the main paper, the differences in terms of AUCC and convergence between RoBERTa models with and without reservoir layers are limited.Moreover, we plot downstream task performance for SST-2 and MNLI compared to the pretraining wall-clock time in Figure 6.It can be seen that the FFN Reservoir can achieve up to 25% and 10% pretraining time savings while matching the best performance

E Reservoir Results for Total Layers
Here we present the shifted Reservoir Results for IWSLT14, WMT16, Enwik8 and RoBERTa finetuning in Figure 8, 9, 10, 11, respectively.We show the same results also hold when it comes to replace normal transformer blocks with Reservoir blocks at least for MT.

F Validation Plots
Here we present the validation plots for training a 8-layer encoder, 2-layer decoder model for IWSLT14, a 24-layer encoder, 1-layer decoder model for WMT16, a 48-layer decoder model for enwik8 and a 12-layer decoder model for RoBERTa for detailed steps to calculate the AUCC.It can be clearly observed that given the configurations from Section 3.1, all the models have converged.So when we compute the area under the convergence curve, this depicts the training efficiency of the model (basically time x performance) until convergence.Specifically, we set T sufficiently high for computing the AUCC, which is 4h for IWSLT, 20h for WMT, 30h for enwik8 and 60h for RoBERTa pretraning.From the training plot in the appendix, we can see that each model has converged at that point.The Reservoir model in Figure 12 has 2 layers frozen for IWSLT14, 8 layers frozen for enwik8, and 4 layers frozen for WMT16 and RoBERTa.

G Backskipping
Figure 13 shows the BLUE curves for IWSLT comparing regular vs reservoir vs backskipped transformers, with the latter performing surprisingly well.

Figure 3 :
Figure 3: IWSLT comparison of different hybrid architectures with different reservoir layers.

Figure 8 :Figure 9 :
Figure 8: Validation BLEU AUCC and test BLEU for IWSLT (high is good).Comparison of regular transformer and reservoir transformer with FFN or Transformer reservoir layers added.

Figure 10 :
Figure 10: Validation BPC AUCC and test BPC on the enwik8 language modelling task (low is good).Comparison of regular and reservoir transformers for varying depths.

Figure 13 :
Figure 13: IWSLT comparison of the regular, reservoir and backskipped transformer architectures (encoder has 8 layers with 2 frozen, if any).
Validation (top)and test (bottom) results for IWSLT (left), WMT (middle) and enwiki8 language modelling (right).IWSLT and WMT are BLEU (high is good); enwiki8 is BPC (low is good).Comparison of regular transformer (blue) and reservoir transformer with FFN (green) or Transformer reservoir (orange) layers added.

Table 2 .
One other noticeable point is that we can see that the T Reservoir achieves similar performance to LayerDrop on IWSLT and WMT in terms of wall-clock per epoch and wallclock time to the best performance.However, on both tasks, FFN Reservoir performs much better than LayerDrop in terms of efficiency per epoch

Table 1 :
Wall-clock time (averaged over multiple runs) saved for IWSLT for different model types and encoder depths.Max BLEU is for validation.Number of layers is for encoder, decoder depth is kept fixed at 2. The ratio is computed compared to the corresponding number of layers in the regular transformer case.

Table 2 :
Wall-clock time (averaged over multiple runs) saved for WMT for different model types and encoder depths.Decoder depth is kept fixed at 1. and achieves better/similar performance in less time in each case.As a point of reference, a half hour gain on IWSLT would translate to a gain of several days in the training of bigger transformer models like Figure 2: Downstream RoBERTa performance on SST-2 (left) and MultiNLI-matched (right).

Table 3 :
Validation max BLEU, AUCC at 4h and wallclock time per epoch (averaged over multiple runs, in seconds) on IWSLT comparing backskipping with regular and reservoir transformers.

Table 5 :
Wall-clock time (averaged over multiple runs) for IWSLT for different model types and encoder depths.Max BLEU is for validation.Number of layers is for encoder, decoder depth is kept fixed at 6. Ratio is computed compared to comparable number of layers in the normal case.
and Table 5.

Table 6 :
Wall-clock time (averaged over multiple runs) for IWSLT/WMT for different model types and encoder depths.95% Max BLEU is for validation.

Table 7 :
Wall-clock time (averaged over multiple runs) saved for IWSLT/WMT for different model types and encoder depths.99% Max BLEU is for validation.