Multimodal pretraining unmasked A meta-analysis and a unified framework of vision-and-language berts

Large-scale pretraining and task-specific fine-tuning is now the standard methodology for many tasks in computer vision and natural language processing. Recently, a multitude of methods have been proposed for pretraining vision and language BERTs to tackle challenges at the intersection of these two key areas of AI. These models can be categorized into either single-stream or dual-stream encoders. We study the differences between these two categories, and show how they can be unified under a single theoretical framework. We then conduct controlled experiments to discern the empirical differences between five vision and language BERTs. Our experiments show that training data and hyperparameters are responsible for most of the differences between the reported results, but they also reveal that the embedding layer plays a crucial role in these massive models.


Introduction
Learning generic multimodal representations from images paired with sentences is a fundamental step towards a single interface for vision and language (V&L) tasks. In pursuit of this goal, many pretrained V&L models have been proposed in the last year, inspired by the success of pretraining in both computer vision (Sharif Razavian et al., 2014) and natural language processing (Devlin et al., 2019). All of these V&L models extend BERT (Devlin et al., 2019) to learn representations grounded in both modalities. They can either be classified as (i) single-stream, where images and text are jointly processed by a single encoder (e.g., , or (ii) dual-stream, where the inputs are encoded separately before being jointly modelled (e.g., Tan and Bansal, 2019).
The differences in downstream performance between single-and dual-stream models are cur-rently unclear, with some papers claiming the superiority of one family over the other , while others arguing that it is hard to draw any conclusion (Qi et al., 2020).
The first goal of this paper is to understand the mathematical differences between single-and dual-stream models. Our analysis leads to a unified framework in which currently proposed architectures, both single-and dual-stream, are particular instances. We then implement several of the proposed encoders within this framework to empirically measure their differences in a controlled environment. We believe this comparative analysis is crucial to better understand and guide future research of massive models in this vibrant area of AI, ensuring progress is not blurred by confounds.
In fact, there are many differences in the protocols used to train V&L BERTs. In order to better understand these models, we conduct a series of controlled studies to investigate whether differences in downstream performance is explained by: (i) the amount of pretraining data and the pretraining objectives (e.g., Figure 2); (ii) the hyperparameters used to control the learning process; (iii) the variance caused by random initialization when pretraining (e.g., Figure 1); (iv) the variance due to fine-tuning multiple times on a downstream task; (v) being single-or dual-stream architectures; or (vi) the choice of the embedding layer.
In summary, our contributions in this paper are: • We introduce a unified mathematical framework in which currently proposed V&L BERTs are only a subset of the possibilities.
• We release code for VOLTA (Visiolinguistic Transformer architectures), 1 a PyTorch implementation of this framework in order to speed up research in multimodal pretraining. Figure 1: How does the amount of pretraining data affect downstream performance of V&L BERTs? We find that these models perform more similarly when trained in the same conditions. This plot shows the results from the papers (♦), and when each model is pretrained 10 times on the Conceptual Captions dataset and fine-tuned once on the NLVR2 verification task (•). The area of a marker is proportional to the amount of pretraining data. The result from the VISUALBERT paper is highlighted in a dashed box.
• We conduct a series of controlled studies 2 finding that several models perform similarly when trained under the same conditions.
• While we find that single-and dual-stream families perform equally well, performance can differ significantly between two models and the embedding layer plays a key role.
• However, these V&L BERTs are sensitive to weight initialization and state-of-the-art claims should not be made from single runs.

Vision-and-Language BERTs
Given a sequence of tokens {w 1 , . . . , w T } and a set of visual features {v 1 , . . . , v K }, a shared goal of V&L BERT models is to produce cross-modal representations that are useful for downstream tasks grounded in both modalities.
In this section, we first review how these models embed their inputs to the feature space. Next, we discuss the main differences in the encoders and, finally, highlight a variety of confounds that might affect the performance achieved by these models.

Input Embeddings
Language Input All V&L BERTs adopt the approach of BERT: The input sequence is first tokenized into sub-word units (Wu et al., 2016;Sennrich et al., 2016) and two special tokens [CLS] and [SEP] are added to generate the text sequence {[CLS], w 1 , . . . , w T , [SEP]}. The embedding of each token is then given by the sum of three learnable vectors, corresponding to its form, position in the sequence, and segment (Devlin et al., 2019). In addition, VL-BERT  also adds the visual feature of the entire image to each token.
Vision Input Typically, visual inputs are also very similar across all V&L BERTs. For a given image, a pretrained object detector is used to extract regions of interest, representing salient image regions. For each region, in addition to its feature vector, the object detector also returns the spatial location of its bounding box, which most V&L BERTs encode in different ways, analogously to the word position in the language modality. While most approaches present very similar ways to embed spatial locations, VL-BERT relies on a more complex geometry embedding and they are, instead, missing in VISUALBERT (Li et al., 2019). Some models also include a special feature [IMG] that denotes the representation of the entire image (e.g., a mean-pooled visual feature with a spatial encoding corresponding to the full image). Finally, PIXEL-BERT (Huang et al., 2020) does not rely on an object detector but directly extracts a set of visual embeddings from the raw image.

Encoders
Single-stream Encoders The majority of V&L BERTs follow the single-stream paradigm Li et al., 2019;Li et al., 2020a;Li et al., 2020b). Here, a standard BERT architecture is given the concatenation of the visual and linguistic features of an image-text pair as input ( Figure 3a). This design allows for an early and unconstrained fusion of cross-modal information.
Dual-stream Encoders VILBERT , LXMERT (Tan and Bansal, 2019), and ERNIE-VIL (Yu et al., 2021) 3 are based on a dual-stream paradigm. Here, the visual and linguistic features are first processed by two independent stacks of Transformer layers. 4 The resulting representations are then fed into crossmodal Transformer layers where intra-modal in teractions are alternated with inter-modal interactions (see Figure 3b and c). Interestingly, both VILBERT and LXMERT modeled inter-modal interactions in the same way: Each stream first computes its query, key, and value matrices, before passing the keys and values to the other modality. By doing so, these models explicitly constrain interactions between modalities at each layer, inhibiting some of the interactions that are possible in a single-stream encoder while increasing their expressive power by separate sets of learnable parameters.

Pretraining Objectives
V&L BERTs are pretrained by jointly optimizing multiple different self-supervised objectives over tokens and image regions through (weighted) scalarization: Here, θ denotes a model's parameters, L o is the o-th objective, and λ o is its corresponding weight. Commonly adopted objectives are of three types: language, vision, and cross-modal predictions.
For language prediction, BERT's denoising masked language modeling (MLM) objective is typically used. MLM replaces some tokens with a [MASK] symbol, which are then predicted by using bidirectional text context and image regions.
The MLM objective has been extended to image regions via masked region modeling objectives. These typically take the form of either object classification or feature regression, with some papers showing benefits when modeling both (e.g., . Some models, such as LXMERT, are also optimized over objects' attributes prediction.
Finally, interactions between the two modalities are explicitly enforced by means of cross-modal objectives. The typical task here is that of imagetext matching (ITM; e.g., , which extends BERT's next sentence prediction objective to V&L inputs: Given a sequence of tokens and a set of image regions, the model is tasked to predict whether the tokens describe the image.

Further Distinctions
So far, we have given an overview of the core components in V&L BERTs. However, there are several implementation differences between them.
For instance, LXMERT presents two main variations to the above description of dual-stream models. First, in its inter-modal layer, the parameters of the attention sub-layer are shared between the two streams. This results in the model learning a single function to contextualize image and text inputs, regardless of which modality plays the role of query or context. Second, its intra-modal layer only consists of the multi-head attention block.
Moreover, a wider range of choices can affect the performance of these models. From the object detector used (and whether it is also fine-tuned during pretraining), to the number of image regions and the maximum text sequence length, to the number of layers and their hidden sizes, to pooling methods and fine-tuning MLP sizes, to the use of text-only data, to optimization hyperparameters (such as the number of pretraining epochs).
Another important distinction is the size and type of pretraining data, which can affect task performance ( Figure 2). The size of pretraining datasets ranges from 3M-10M image-text pairs, over a range of pretraining tasks. The literature distinguishes between ''in-domain'' and ''out-ofdomain'' data, each of which may consist of multiple datasets. An in-domain dataset overlaps with common downstream tasks, for example, using VQAv2 (Goyal et al., 2017) as both a pretraining task and a downstream task, while out-of-domain datasets have no expected overlap, for example, Conceptual Captions (Sharma et al., 2018).
In this section, we unify the recently proposed single-stream and dual-stream architectures under the same mathematical framework. We start by reviewing the Transformer layer, which forms the core of these architectures, then we explain how this layer has been adapted to encode multimodal data in V&L BERTs, and introduce a gated bimodal Transformer layer that implements all of the architecture variants as special cases.

Transformer Layers
Transformer-based architectures consist of a stack of Transformer layers (Vaswani et al., 2017), each typically having a multi-head attention block (MAB) and a feed-forward block (FFB).

Multi-head Attention Block Given
an attention function Att(Q, K, V) maps queries to output vectors with a scaled dot-product: where ω denotes a row-wise, scaled softmax: Here, S = QK ∈ R N q ×N v is a score matrix that measures the similarity between each pair of query and key vectors. The output of Eq. (1) is a weighted sum of V, in which a value gets higher weight if its corresponding key has a larger dot product with the query. Multi-head attention (MHA) extends this function by first projecting Q, K, V into H different matrices and computing the attention of each projection (Eq. (1)). These H different output vectors are concatenated together ([ ]) and the concatenation is projected with a linear transformation given inputs X, Y ∈ R N ×d , a multihead attention block is defined as: (3) where LN is layer normalization (Ba et al., 2016).

Feed-forward Block
For an input matrix M ∈ R N ×d , the feed-forward block is given by: Standard Transformer Layer Let X ∈ R N ×d be an embedded input sequence, a standard Transformer layer performing self-attention is a parameterized function f θ : R N ×d → R N ×d such that: A stack of L Transformer layers that encodes an input X, such as BERT, is then seen as a sequence of L Transformer layers, each parametrized by θ l :

Single-stream Multimodal Transformers
Single-stream V&L BERTs extend BERT by concatenating the embedded visual inputs X V ∈ R N V ×d and the embedded textual inputs X L ∈ R N L ×d as a single input, hence the name ''singlestream'' ( Figure 3a).
and the attention is over both modalities ( Figure 4a). Hence, all single-stream models are of the type defined in the previous section: Encoder(X). The various approaches only differ in the initial V&L embeddings, the pretraining tasks, and the training data.

Dual-Stream Multimodal Transformers
Both VILBERT and LXMERT concurrently introduced inter-modal and intra-modal layers.
Inter-modal Transformer Layer The intermodal layer explicitly models cross-modal interaction via a cross-modal attention module. Specifically, let M ∈ {L, V} denote either the linguistic (L) or the visual (V) modality, and \M its complementary one. The inter-modal multi-head attention for modality M is given by (Figure 3c):  Note that the second input to the multi-head attention block (Eq. (3)) is taken from the complementary modality, which means the keys K and values V in scaled dot-product attention (Eq. (1)) operate across modalities (see Figure 4d and e). The remainder of this layer follows as from Eq. (4).

Intra-modal Transformer Layer
The intramodal layer, on the other hand, is a Transformer layer computing the attention of each modality independently (see Figure 3b). For a modality M: The rest of the layer follows as in Eq. (4) for VILBERT, while there is no FFB block in LXMERT.

Dual-stream Attentions as Restricted Single-stream Attention
Recall that in single-stream models the input to a Transformer layer is the concatenation of both modalities, X = [X L X V ]. Therefore, in each single-stream attention head, the query representation is given by: where · L · V are the language and visual submatrices of the input and the resulting output. A similar expression also holds for the keys K and values V. We note that the score matrix S can be defined in terms of four sub-matrices ( Figure 4a): Recall from Eq. (1) that the attention matrix is a normalised score matrix S, so each single-stream layer computes both intra-modal (diagonal of S) and inter-modal attention (anti-diagonal of S). In other words, the dual-stream inter-modal and intra-modal attention functions act as restricted versions of the attention function in any single-stream layer (see Figure 4). 5 As a result, by interleaving inter-and intra-modal layers, dual-stream models introduce an inductive bias towards which interactions the model enforces in each layer.

Gated Bimodal Transformer Layers
In the previous section, we showed that singlestream attention blocks capture both the intermodal and intra-modal interactions, separately modeled by dual-stream architectures. We now introduce a general gated bimodal Transformer layer (Figure 3d), in which both single-and dual-stream layers are special cases. By doing so, we can define existing V&L BERTs within a single architecture, which allows us to implement and evaluate several of these models in a controlled environment (see next sections). In addition to textual X L and visual embeddings X V , this layer takes a set of fixed binary variables {γ, τ } as part of its input: The γ values act as gates that regulate the cross-modal interactions within a layer, while the τ values control whether the parameters are tied between modalities.
The main difference in our gated layer is in its attention functions, originally defined in Eq. (1) and Eq. (2). Here, we extend them to bimodal inputs with controllable multimodal interactions as: where W O L and W O V are the language and vision output matrices. The attention output Att(Q, K, V), with a set of gating values γ is: Recall from Eq. (10) that the score matrix S γ can be defined in terms of intra-modal and inter-modal submatrices. Here, the gating values γ = {γ LL , γ LV , γ VL , γ VV } define the permitted intra-modal and inter-modal interactions. Let ε → −∞, S γ is given by: That is, when an attention gate γ is set to 1, the corresponding sub-matrix tends to −∞, while it is unaltered when γ is set to 0. By having a sub-matrix that tends to −∞, we can effectively compute the row-wise softmax (i.e., the attention) over the other sub-matrix, hence recovering the inter-and intra-modal attentions. 6 This is similar to the input masking applied in autoregressive Transformer decoders (Vaswani et al., 2017).
This formulation allows us to control the degree of inter-and intra-modal attention within a layer, allowing us to define existing architectures within a unified mathematical framework. We can recover an inter-modal block (Eq. (7)) by setting γ LV = γ VL = 0 and γ LL = γ VV = 1. Similarly, the single-stream block (Eq. (3)) can be recovered by setting γ = 0 and tying the learnable parameters (τ = 1) between the two streams (e.g., . Furthermore, the gated bimodal Transformer layer allows us to model a superset of the few combinations considered thus far for cross-modal fusion by multimodal transformer encoders. One may explore asymmetric streams in which the two modalities interact differently with the bimodal inputs, or explore different ways of interleaving conventional single-and dual-stream blocks, or even different levels of parameter sharing. For example, asymmetric vision-and-language layers might be beneficial for navigation (e.g., Hill et al., 2021) or language-conditioned image generation (e.g., Cho et al., 2020). An exploration of these possibilities is left for future work.

Experimental Setup
In this section, we present the experimental setup for our controlled studies on V&L encoders.
VOLTA In order to facilitate research and development of V&L pretraining, we release VOLTA (Visiolinguistic Transformer architectures), an implementation of our unified framework in Py-Torch (Paszke et al., 2019). Our code is built on top of the VILBERT-MT repository, 7 based on PyTorch-Transformers, due to its support to a wide range of V&L tasks. We stress that it is important, for this study, to have a unified implementation that allows us to remove possible confounds due to implementation details and effectively measure differences given by the proposed architectures.
Implementation Details V&L BERTs typically extract image features using a Faster R-CNN (Ren et al., 2015) trained on the Visual Genome dataset (VG; Krishna et al. 2017), either with a ResNet-101 (He et al., 2016) or a ResNeXT-152 backbone (Xie et al., 2017). The number of features varies from 10 to 100. Our models are trained with 36 regions of interest extracted by a Faster R-CNN with a ResNet-101 backbone (Anderson et al., 2018). Each model is initialized with the parameters of BERT, following the approaches described in the original papers. 8 Randomly initialized weights are initialized following the standard approach in PyTorch-Transformers (on which these models built on): Fully-connected and embedding layers are initialized from a normal distribution with mean 0.0 and standard deviation 0.02, bias vectors are initially set to 0.0, and the Layer Normalization weight vector to 1.0. We train all models on 4 NVIDIA P100 GPUs and rely on gradient accumulation to obtain larger batches when needed. The parameter sets giving the best validation performance based on the pretraining objective are used for downstream tasks.
Pretraining As discussed in §2.4, V&L BERTs have been pretrained on datasets of varying size and type. 9 In this paper, we pretrain all of our models on the Conceptual Captions dataset (CC; Sharma et al. 2018), which consists of 3.3M images with weakly associated captions automatically collected from billions of Web pages. This stands in contrast to other datasets, for example, COCO (Lin et al., 2014) or VQA (Antol et al., 2015), where the images are strongly associated with crowdsourced captions or question-answer pairs. The CC dataset is a good candidate for learning generic multimodal representations because of its size, that it was scraped from the Web, and that it has a broad coverage of subject matter. 10 Note that due to broken links, and a subsequent pruning phase, where images also found in the test sets of 8 Only Tan and Bansal (2019) reported slightly better performance when pretraining from scratch but they relied on large corpora of in-domain, human-annotated data. 9 VL-BERT also adds text-only data to avoid overfitting on short and simple sentences typical of V&L datasets. 10 We also expect this type of dataset will be easier to collect for low-resource languages in the future.

Dataset
Image

Results
We perform carefully controlled experiments to investigate the possible reasons for the reported difference in performance between V&L BERTs.

Unified Data and Reimplementation
We start by examining the performance of V&L BERTs pretrained on the same 2.7M CC dataset. Recall from Figure 2 that V&L BERTs have been pretrained on different combinations of datasets, which may explain most of the claimed differences in downstream task performance. Here, we evaluate three models with official released code: VILBERT, 13 LXMERT, and VL-BERT. 11 The datasets listed in Table 1 Same Data, Similar Performance Figure 5 shows the results of controlling the pretraining data and pretraining tasks. The results from the papers are reported (♦), alongside our training of these models using the official code ( ). There is a drop in performance for the models we trained on the VQAv2, NLVR2, and image retrieval tasks, compared to the performance reported in the papers. This is not surprising given that the models were pretrained on less data than the papers. In particular, given that VILBERT was also pretrained on CC but with more image-text pairs, our results corroborate previous studies showing diminishing returns with pretraining data size (e.g., Li et al., 2020a). However, the claimed performance gaps between these models narrows when they are pretrained on the same data. For instance, according to the literature, LXMERT was clearly the best model in VQA tasks, which is likely due to its use of large, in-domain data and a VQA pretraining objective. 14 VOLTA Implementation We also implemented these models in VOLTA and trained them using their official procedures and hyperparameters. Figure 5 shows that the performance of each of these mod-14 Surprisingly, for VQAv2, each of these models used different proportions of the validation set during training. In our experiments, instead, we use the official training set, which explains why the largest drops in performance are seen here. els (•) closely follows the official implementations in these downstream tasks, confirming the correctness of our framework. There are, however, some larger differences for some of the tasks: In VQAv2, we now see that VILBERT performs slightly worse than the other models (contrarily to what we obtained with the official code), and in GQA, LXMERT closes the gap with VILBERT. VILBERT's performance on NLVR2 and COCO image retrieval increases by 2-3 points in the VOLTA framework. As VOLTA is based on the VILBERT code base, these differences might be due to weight initialization, an hypothesis that we test in later sections.
With this first study, we have seen that the performance of these V&L BERTs is similar when they are trained on the same data. Moreover, we demonstrated the correctness of our implementations in VOLTA, in which these models are built following the unified framework introduced in §3. Nevertheless, there are still many possible confounds in the training procedures adopted by these models that might interfere with a fair comparison of these architectures. In the next section, we control these variables to unmask the true gains introduced by a number of multimodal encoders.
• Inputs: Each model used a different maximum number of tokens and LXMERT did not have an overall [IMG] feature. We fix the same maximum number of tokens and add the [IMG] feature to each architecture.
• Encoders: We noticed that VILBERT used higher dimensional representations for the visual stream. We fix the same dimension as in the linguistic stream for a comparison that is fairer comparison against LXMERT, and more intuitive with the single-stream models.
• Pooling: While VL-BERT is the only architecture that does not have a pooling layer, other V&L BERTs use it for the image-text matching objective. We fix the models to use use multiplicative pooling  for all the models in order to separately learn  Table 2: Results with our controlled setup. Each model is pretrained using the VOLTA framework with the same fixed hyperparameters on the 2.7M CC dataset, and fine-tuned on downstream tasks.
sentence-level and image-level representations and also model their interactions.
• Pretraining Objectives: Each model uses a different set of pretraining objectives. We fix them to three: MLM, masked object classification with KL-divergence, 15 and ITM.
• Fine-tuning: We fine-tune each model using the same protocols and sizes for the MLPs.
• Hyperparameters: While VILBERT and VL-BERT were originally pretrained for 10 epochs, LXMERT was pretrained for 20. We fix the number of pretraining epochs to 10, and set other hyperparameters (e.g., learning rate or its warm-up proportion) to a set of values to randomness in initialization from the original papers that led to smooth training of all the models, with training curves that closely followed the ones obtained with the original hyperparameters. 16 Results Table 2 shows the results of our controlled study. First, we note that the performance of VILBERT and VL-BERT is similar compared to training with their original hyperparameters. In fact, VQAv2 performance improves for VIL-BERT, showing that dual-stream models do not require different sizes in the two streams. VL-BERT also performs similarly to its official setup, showing that the additional ITM pretraining objective in our controlled setup does not hurt downstream task performance (contrarily to the results reported in their paper). We do, however, note that LXMERT performs worse on NLVR2 and VQAv2 15  showed that this object classification objective is the single best one for masked regions prediction. 16 Configuration files of this setup are part of our repository. in our controlled setup than with its original hyperparameters, suggesting that LXMERT may require more pretraining steps to converge. Overall, the results show that most of the examined models perform similarly in our controlled setup, compared to the official setups.

Fine-tuning Variance
We now turn our attention to the effect of finetuning variance on task performance. It has been observed that the fine-tuning of BERT is sensitive to randomness in initialization and data ordering (Dodge et al., 2020). Here, we investigate the sensitivity of the five models used in the controlled study. We fine-tune each model 10 times on the RefCOCO+ and NLVR2 tasks by varying the seed. This changes training data order and the weight initialization of the classification layer. Figure 7 shows violin plots of the distribution of results, in which the dots represent the experimental observations. We also report an average standard deviation of 0.3 points for these models across both tasks. However, the minimum and the maximum scores of a given model often differ by 1 or more points, showing how a single fine-tuning run of these models can lead to incorrect conclusions.

Pretraining Variance
In the previous section, we found substantial variance in the performance of V&L BERTs across 10 fine-tuning runs. We now investigate if the pretraining phase is similarly affected by different runs. Here, each model in our controlled setup is pretrained 10 times and fine-tuned once on four tasks: VQAv2, RefCOCO+, NLVR2, and Flickr30K image-text retrieval. By varying the seed, we modify training data order as well as all the layers that are not initialised from BERT (e.g., the visual embeddings, the masked object classification head and the ITM head in single-stream models). Figure 6 shows violin plots for each task. We start by noting that our first pretraining run (Table 2) of LXMERT was the worst one (its text retrieval recall on Flickr30K is 10 points lower than its mean). We also confirm that LXMERT has slower convergence rate, with its task performance after 10 epochs showing the largest variance among the V&L BERTs we tested. On the other hand, we find that some of these architectures are less prone to variance caused by pretraining  seed, such as VILBERT for VQA and retrieval tasks, and UNITER for referring expression. Nevertheless, the performance of all of these models can vary by more than 1 point in several tasks solely due to random initialization.

Evaluating Local Decision Boundaries
Previous work has shown that state-of-the-art systems can exploit systematic gaps in the data to learn simple decision rules that let them achieve high performance on test data (Gururangan et al., 2018;Geva et al., 2019;Ribeiro et al., 2019).
In an effort to more accurately estimate model performance, Gardner et al. (2020) proposed contrast sets: datasets in which existing test instances have small but label-changing modifications in order to characterize the correct decision boundary near them. Figure 8 shows the performance of our analyzed models on the NLVR2 contrast set. Similar to Gardner et al. (2020), we see that LXMERT loses around 15 points when evaluated on perturbed samples. Furthermore, models that performed much better on the standard test set now achieve comparable performance to LXMERT, showing that they exploited systematic gaps. That is, all of these V&L BERTs would perform similarly when evaluated on out-of-distribution data.

Single-or Dual-stream Architectures
One of the key design choices that distinguishes V&L BERTs is the number of ''streams'' used by the encoder to process visual and linguistic inputs.  showed how their single-stream baseline performed worse than their dual-stream VILBERT architecture, while  claimed single-stream UNITER outperformed VIL-BERT. Our controlled study across several tasks and different pretraining initializations allows us to provide an answer grounded with statistical tests. To do so, we split the models in dual-and single-stream architectures 17 and run a one-way ANOVA (Table 3). After Bonferroni correction, we only find statistical difference at p < 0.005 (Benjamin et al., 2018) between these two groups for the Flickr30K text retrieval task.  On the other hand, running the same test among the various V&L BERTs, without grouping them as single-or dual-stream architectures, returns statistical significance in each task (Table 3). This table tells us that the null hypothesis, the models have the same average performance, does not hold. However, it does not allow us to discern where statistical differences lie. To do so, we conduct a post-hoc exact test at significance level p < 0.005. Figure 9 shows the corresponding pairwise p-values and highlights significant differences between any two models after Bonferroni correction. For instance, VILBERT is significantly different compared to all other models in text retrieval on Flickr30k, while UNITER is significantly different on RefCOCO+.

The Importance of the Embeddings
Finally, our controlled setup leads us to an interesting finding: The embedding layer ( §2.1) plays a crucial role in the final performance of V&L BERTs. In fact, the only difference among VL-BERT, VISUALBERT, and UNITER in our setup is their embedding layer. Figure 6 and Figure 7 show that this can have a drastic impact on the downstream performance, although the literature has given little attention to this detail. For instance,  claim that the main contribution of UNITER is the set of pretraining tasks, while our results, wherein all the models are trained on the same pretraining tasks, highlight that their embedding layer is an important confound on final performance. Interestingly, VISUALBERT is the only model that does not encode the locations of regions of interest in its embeddings. This leads it to considerably lower performance on RefCOCO+, showing that this information is extremely useful for this task.
Given this result, we conduct one additional experiment to see whether the embedding layer biased our conclusion for dual-and single-stream performance. To test this, we swap the embedding layers of VILBERT (best dual-stream) and UNITER (overall better single-stream) with each other, which we pretrain and fine-tune once ( Figure 10). Similar to our previous results, embeddings are especially important for the tasks of referring expression and retrieval. However, no single embedding layer performs better, corroborating that dual-and single-stream architectures perform on par and showing that different embedding strategies are necessary to maximise performance in these two families of V&L BERTs.

Limitations
All the experiments in this paper are limited to models that use a specific type of pretrained and frozen visual encoder. While most V&L BERTs follow this paradigm, some studies find beneficial to jointly learn the visual encoder with language Huang et al., 2020;Radford et al., 2021;Kim et al., 2021). In addition, we only consider base architecture variants (initialized with BERT BASE ) and pretrained on CC. Studying the effects of visual encoders, pretraining data and larger models is left as future work.
Although we expect longer pretraining would be beneficial for every model, in our controlled setup, we pretrain each model for 10 epochs to reduce resource consumption. Here, we also constrain our hyperparameter search over a small grid of values that have been used in the literature. Finally, we leave a thorough, controlled study of the various pretraining objectives to future work.

Reproducibility and the Environment
From the perspective of reproducible research, there are several advantages to using the VOLTA framework for V&L encoders. First, VOLTA reduces confounds due to differences in implementations, while also enabling fair comparisons with related work. Second, visual and textual data only need to be preprocessed once instead of creating model-specific formats for every V&L BERT.
From a financial perspective, the costs involved in pretraining hampers contributions from many academic institutions and deters the evaluation of multiple trained models, which we showed Figure 9: Exact test between any two V&L BERTs. Each box shows the p-value for the corresponding pair of models. Green boxes denote statistical significance at 0.005 after Bonferroni correction. Boxes are dark green if the model in the y-axis outperforms the one in the x-axis, and vice versa for light green. Figure 10: Results of swapping VILBERT and UNITER embeddings ( ) compared to their performance when pretrained 10 times (box plots). to be extremely important for V&L BERTs. We estimate that pretraining a single model 10× in our controlled setup for 4 downstream tasks requires a 4-GPU machine on AWS for two months, at a cost of ∼$6,000, corresponding to 200 GPU-compute days. Fortunately, we had access to an internal server, but our experiments still required 1,500 GPU days for training and evaluation. While we were able to reduce the financial costs, there are severe environmental and carbon footprint costs in V&L pretraining (Strubell et al., 2019). 18 We hope that VOLTA will serve as a basis for research in V&L pretraining, enabling easy and fair comparisons across architectures, and ensuring that progress is not obfuscated by confounds.

Conclusion
We introduced and implemented a unified mathematical framework, under which recently pro- 18 We distribute many of our pretrained V&L BERTs in VOLTA to amortise the environmental costs. posed V&L BERTs can be specified as special cases. We conducted a series of controlled studies within this framework to better understand the differences between several models. We found that the performance of the considered models varies significantly due to random initialization, in both pretraining and fine-tuning. We also found that these models achieve similar performance when trained with the same hyperparameters and data. Notably, some models outperform others but we found that (a) single-and dual-stream model families are on par, and (b) embedding layers play a crucial role towards a model's final performance.
Our fast-paced field rewards the contribution of new methods and state-of-the-art results (Rogers and Augenstein, 2020), which often contrasts with controlled comparisons and training multiple models for variance estimation. In this paper, we showed that several methods for vision-andlanguage representation learning do not significantly differ when compared in a controlled setting. This finding echoes similar studies of variants of LSTMs (Greff et al., 2017) and Transformers (Narang et al., 2021) that are not significantly better than the original models. Looking to the future, we recommend that new V&L BERTs are pretrained on similar datasets, and that researchers report fine-tuning variance, in addition to their best performing model. We hope that our findings will encourage more controlled evaluations of newly proposed architectures for visionand-language and beyond.
European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement no. 801199 and by ''Research and Development of Deep Learning Technology for Advanced Multilingual Speech Translation,'' the Commissioned Research of National Institute of Information and Communications Technology (NICT), Japan.