A survey on knowledge-enhanced multimodal learning

Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation. Especially in the area of visiolinguistic (VL) learning multiple models and techniques have been developed, targeting a variety of tasks that involve images and text. VL models have reached unprecedented performances by extending the idea of Transformers, so that both modalities can learn from each other. Massive pre-training procedures enable VL models to acquire a certain level of real-world understanding, although many gaps can be identified: the limited comprehension of commonsense, factual, temporal and other everyday knowledge aspects questions the extendability of VL tasks. Knowledge graphs and other knowledge sources can fill those gaps by explicitly providing missing information, unlocking novel capabilities of VL models. In the same time, knowledge graphs enhance explainability, fairness and validity of decision making, issues of outermost importance for such complex implementations. The current survey aims to unify the fields of VL representation learning and knowledge graphs, and provides a taxonomy and analysis of knowledge-enhanced VL models.


Introduction
Multimodal representation learning has been an area of machine learning that increasingly draws the attention of the research community. Combining information from different modalities, such as images, and text, allows more informative representations, as they provide complementary insights for the same instances. Several works focus on using both vision and language modalities, introducing tasks such as visual question answering , visual reasoning , visual commonsense reasoning (Zellers et al., 2019), visual entailment (Xie et al., 2018), image captioning (Stefanini et al., 2021), image-text retrieval and inversely text-image retrieval (Dubey, 2021), referring expressions (Krishna et al., 2018), visual explanations (Hendricks et al., 2016) and grounding (Endo et al., 2017), visual-language navigation (Anderson et al., 2018), visual generation from text (Reed et al., 2016c), visual storytelling (Huang et al., 2016b) and its inverse task of story visualization (Li et al., 2019c), and visual dialog (El-Nouby et al., 2019).
Some of the first attempts that combine vision and language face several limitations due to the restricted capacity of sequential models for language, such as recurrent neural networks, LSTMs (Hochreiter and Schmidhuber, 1997) and GRUs (Cho et al., 2014), which struggle to represent long textual sequences. The area of multimodal learning has faced significant advancements especially since the introduction of the Transformer framework (Vaswani et al., 2017). Several powerful transformerbased variants, such as BERT (Devlin et al., 2019) and GPT-3 (Brown et al., 2020), set the foundations for the surge of visiolinguistic (VL) transformers. The extension of single-modality pre-training requires the introduction of visual features, which allow to infer masked linguistic components and vice versa, enabling learning cross-modality relationships from aligned data. In the meantime, pre-training tasks applied independently on the visual or linguistic components permit intra-modality learning. Fine-tuning on task-specific VL datasets, addressing the tasks of text-image retrieval (Lin et al., 2014), visual question answering (Ren et al., 2015;Hudson and Manning, 2019;Krishna et al., 2016;Gao et al., 2019), visual reasoning (Suhr et al., 2017), visual commonsense reasoning (Zellers et al., 2019) and others, follows the pre-training stage. Models such as LXMERT (Tan and Bansal, 2019), VisualBERT , ViLBERT (Lu et al., 2019(Lu et al., , 2020, VL-BERT (Su et al., 2020), UNITER , OSCAR (Li et al., 2020c), ViLT (Kim et al., 2021), CLIP (Radford et al., 2021), SIMVLM (Wang et al., 2021c) and many others have demonstrated state-of-the-art results in multiple VL tasks.
As for recent transformer-based approaches, despite pre-training on large amounts of aligned VL data, usually from Conceptual Captions, (Sharma et al., 2018) COCO (Ren et al., 2015) and Visual Genome (Krishna et al., 2016) datasets, the learned concepts remain limited and lack further explicit information regarding commonsense knowledge, abstract entities or real-world events. Relevant issues in the natural language processing field were addressed by leveraging knowledge graphs, thus resulting in knowledge-enhanced approaches for several tasks, such as Language Modelling , Natural Language Inference (NLI) (Chen et al., 2018b,a), Language Generation , Dialog Generation (Cui et al., 2021a), Entity Disambiguation , multilingual models , Contextualized Language Embeddings , and models such as KnowBert , E-BERT (Poerner et al., 2020), ERICA , ERNIE , ERNIE-NLI (Bauer et al., 2021), LUKE (Yamada et al., 2020) and others. Therefore, the incorporation of large-scale knowledge graphs and ontologies can be also critical to the quality of multimodal representations and the success of relevant models on the various downstream tasks.
While previous surveys (Baltrušaitis et al., 2017;Kafle et al., 2019;Guo et al., 2019;Mogadala et al., 2021;Uppal et al., 2020;Du et al., 2022) provide analysis and taxonomies over models, tasks and datasets regarding multimodal representations, they do not analyze knowledge-enhanced approaches. In contrast to those works, we focus on the integration and importance of external knowledge to VL models. Even though the current trends focus on transformer based implementations, for the sake of completeness, other techniques that have contributed to the field of knowledge-enhanced VL (KVL) learning are also included. Overall, we target to bridge the gap between knowledge representation and multimodal deep learning: we provide a broad and comprehensive analysis of both fields, and consequently collect models that have served the various KVL tasks. Finally, we discuss current challenges and limitations of existing datasets and approaches, upon which we suggest potential future directions of this evolving field. To the best of our knowledge, there are no extended works covering the intersection of those two very fundamental fields of AI, which have demonstrated promising directions in many aspects of state-of-the-art research when combined.
The current survey consists of four main parts. The first part covers the preliminaries of multimodal deep learning, analyzing trends, methods, models and tasks which set the basis for knowledge-enhanced VL (KVL) models. An analysis regarding graph structures and their representation follows in the second section. The third part is dedicated to a taxonomy of knowledge senses and types, as well as the presentation of popular knowledge bases. The fourth part provides a taxonomy and analysis as per KVL task where the usage of external knowledge has been attempted, accompanied with datasets and evaluation methods. Finally, some needs and possible future directions regarding the usage of knowledge in multimodal learning are identified and analyzed.

Background
Multimodal learning is a large and diverse field that involves a variety of data sources, architectural approaches and tasks. By focusing on VL tasks which exploit text and image data, we can identify a variety of relevant applications. The nature of each task defines the chosen backbone architecture, upon which all consequent approaches are built. More specifically, VL tasks can be divided in discriminative tasks, where the goal is either to provide a matching between modalities or understanding the one modality based on the other, and generative tasks, which target image or text generation. Discriminative VL tasks present a long line of research initially based on recurrent neural networks (RNNs) for text representation, with most contemporary approaches favoring the Transformer (Vaswani et al., 2017) framework for its indisputable advantages. On the other hand, generative tasks present an interesting variability in architectural approaches: while language generation tasks conditioned on image are addressed by architectures based on RNNs or Transformers, image generation tasks conditioned on text are mainly tackled by Generative Adversarial Networks (GANs) (Goodfellow et al., 2014), and more recently by Transformers (Esser et al., 2020) and Diffusion models (Dhariwal and Nichol, 2021).
Knowledge-enhanced VL models (KVL models) usually step upon existing approaches for VL representation and then employ various strategies to integrate knowledge. One common first step between most VL approaches is the independent encoding of text T and images I, followed by the interaction of encodings in order to acquire a joint representation. The choice of text encoding (RNN or Transformer) heavily influences the overall architecture of a VL model. On the other hand, image encoding adapts to the needs defined by text encoding, and variability in chosen image encoders serves specific improvements, such as performance boosting or reduction of trainable parameters.
Transformer-based VL models consist of two main stages: pre-training on large amounts of aligned imagetext data, such as images and their captions, and then fine-tuning on smaller task-specific datasets. Pre-training learns a generic joint VL representation from independent image and text encodings using certain pre-training tasks (or pre-training objectives) that enforce cross-modality in-teraction. Fine-tuning is performed independently on each task-specific dataset, leveraging the learned representation from the pre-training stage.
Transformer-based VL models consist of two main stages: pre-training on large amounts of aligned imagetext data, such as images and their captions, and then fine-tuning on smaller task-specific datasets. Pre-training learns a generic joint VL representation from independent image and text encodings using certain pre-training tasks (or pre-training objectives) that enforce cross-modality interaction. Fine-tuning is performed independently on each task-specific dataset, leveraging the learned representation from the pre-training stage.
There is a variety of ways to incorporate knowledge K, with most approaches favoring external knowledge sources in the form of widely used knowledge graphs. In this case, a representation based on graph neural networks (GNNs) is the most popular approach towards providing an appropriate encoding for knowledge. However, other knowledge types can be integrated as well, either in the form of knowledge stored in neural network weights (implicit knowledge) or linguistic knowledge from the web, embedded with a text encoder. Enhancements to VL model performance can be also realized via self-acquired or internal knowledge: without leveraging external knowledge sources, automatic extraction of specific characteristics from either modality boosts and guides learning, improving knowledge-free baselines. Knowledge K can be fused either in early stages together with text T and images I resulting in a KVL representation, or alternatively in later stages and independently of the VL stream, refining and correcting the predictions of the model. KVL models can either target one task at a time (singletask models), or multiple tasks simultaneously (multitask models). Single-task models present a large variety of architectural implementations so far, while multi-task models are exclusively built upon multimodal transformers, as they heavily rely on the pre-training fine-tuning scheme. Discriminative tasks are tackled by both singleand multi-task models. On the other hand, generative tasks can be only handled by single-task models, as they are harder by nature. The evaluation of the overall model performance is realized as per task, based on appropriate task-specific evaluation metrics.
A broad overview of KVL models is provided in Figure  1. The joint representation module is shown as a black box, as most architectural variations analyzed in following sections are realized within this stage.

Multimodal representation learning
The core of multimodal deep learning revolves around the ways the various modalities are represented independently and interact with each other. Especially for text and images, several representation techniques have been developed throughout the years, due to the advancements of image classification Simonyan and Zisserman, 2015) and object detection (Girshick, 2015;Ren et al., 2017) models for vision, as well as distributed language representations (Mikolov et al., 2013;Pennington et al., 2014), recurrent neural networks (Hochreiter and Schmidhuber, 1997;Schuster and Paliwal, 1997;Cho et al., 2014), and attention-based models (Vaswani et al., 2017;Devlin et al., 2019;Brown et al., 2020) for text. Apart from vision and language, more modalities can potentially contribute in a joint representation, such as speech, music, graphs and others.

Text representation
Text representation offers several variations over implementations, imposing significant influences towards multimodal learning in total. The two major categories of language embeddings include recurrent architectures (RNN/LSTM/GRU) and Transformers. The architectural choices for language representation guide the taxonomy of models per KVL task, due to the diversity they impose in the resulting implementations.
Distributed word representations Traditional and widely used representations such as Word2Vec (Mikolov et al., 2013) have contributed in several components of the various VL tasks, often used as initialization vectors for other methods. Doc2vec (Le and Mikolov, 2014) extends word2vec, achieving a vector representation of a group of words. GloVe, a distributed word representation model, is trained on a global word-word co-occurrence frequency matrix and successfully captures both local and global statistics. (Pennington et al., 2014) Fast-text represents each word with a bag of character n-grams in order to capture the internal structure of words. This approach can successfully utilize the morphology of words, thus it is able to represent rare formations which may occur in morphologically rich languages. (Bojanowski et al., 2017) Word semantics cannot be successfully captured by static word representation methods, so contextualization is needed especially in cases of polysemy, an issue tackled in CoVe (McCann et al., 2018). ELMo (Peters et al., 2018) is another deep word contextualization model which exploits character level information to form robust representations based on morphological information. Uni- versal Language Model Fine-tuning (ULMFiT) (Howard and Ruder, 2018) achieves robust inductive transfer learning on a variety of downstream NLP tasks, where only fine-tuning is required.

Recurrent neural networks
The basic idea behind recurrent neural architectures (RNNs) is the sequential processing of elements of a finite sequence (x 1 , x 2 , ..., x T ), one x t at a time, while retaining context information from the previous elements in the form of the previous node's hidden state h t−1 . Both x t and h t−1 contribute to the calculation of the current hidden state h t , which consequently participates in defining the current output y t . Feed forward neural networks implement each time step as a layer, with shared weights across layers. Backpropagation through time (BPTT) is utilized to train recurrent neural networks. (Lipton et al., 2015) This sequential nature of RNNs fits naturally to language processing, inspiring several relevant implementations in the NLP domain, as well as in multimodal tasks. Other recurrent neural architectures such as Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) dominated the field of language embeddings in many approaches, with more refined variants such as bidirectional LSTM (BiLSTM) (Schuster and Paliwal, 1997), Gated Recurrent Unit (GRU) (Cho et al., 2014) and bidirectional GRUs (BiGRUs) following in later works.
Earlier works in VL architectures heavily rely on distributed word embeddings together with sequential models for language representation. However, limitations tied with sequential processing such as vanishing gradients and inability of parallel processing directed research interest towards novel frameworks, such as the Transformer (Vaswani et al., 2017).
Language transformers The introduction of attention mechanisms and especially the Transformer (Vaswani et al., 2017) framework opened a whole new world of possibilities for language embeddings and several downstream linguistic tasks. The Transformer model relies on encoder-decoder structure and uses an attention mechanism to construct dependencies between input and output data. The layers of the encoder and the decoder are stacked the one upon the other, containing sub-layers with Multi-Head self-attention and position-wise fully connected feed-forward layers. Residual connections followed by layer normalization are used between the sublayers of the encoder. The decoder has an additional encoder-decoder multi-head attention sub-layer that helps focusing on the appropriate parts of the encoded input sequence. Moreover, the decoder's self-attention modules are modified so that they force the prediction at a certain position to be based only on the known predictions of previous positions. Transformer architectures prove that there is no need for convolutions or recurrent units to achieve state-of-the-art performance in linguistic tasks. Currently, most state-of-the-art VL architectures utilize attention mechanisms within their implementation.
Transformer models for NLP tasks consist of a pretraining stage on a large corpus of data, followed by fine-tuning on certain downstream tasks. Language transformers can be divided in two major categories: autoregressive (AR) and autoencoding (AE) language models, depending on whether pre-training is performed in a unidirectional or a bidirectional way. AR language models attempt to estimate the probability distribution of a text corpus, while AE models learn to reconstruct manipulated inputs with the help of surrounding information. (Yang et al., 2019b) BERT (Devlin et al., 2019) is a popular bidirectional transformer-based language representation model, able to handle a variety of natural language processing tasks by just fine-tuning one additional output layer. It uses masked language modeling (MLM), randomly hiding some input tokens in order to be inferred from the surrounding words. The pre-training stage is based on unlabeled data, which enable parameter learning. Those parameters are then finetuned with labelled data corresponding to some certain tasks. RoBERTa  offers an optimized extension to BERT, suggesting that longer training, more data and larger batch size, as well as training on longer sequences, dynamically altering the masking patterns and removing the next sentence prediction loss are factors that contribute to advanced performance of the original BERT model. XLNet (Yang et al., 2019b) combines AR and AE language modelling in a single approach. It introduces the pre-training objective of permuted language modelling, attempting to collect information from permutations of the factorization order of text tokens with respect to AR likelihood estimation, practically inducing bidirectional capabilities to the learning process.
Generative NLP models have demonstrated impressive results in recent literature. GPT-2 (Radford et al., 2019) is pre-trained on a very large dataset (40GB of text data from over 8 million documents) and utilizes 1.5 billion parameters in order to construct a powerful language model. GPT-3 (Brown et al., 2020) is an AR language model of 175 billion parameters which achieves zero-shot, one-shot and few-shot capabilities. It is able of synthesizing results in a human-like level, so that writing articles or code in some programming language, learning and reusing new words, unscramble words and other tasks can be realized indistinguishably to humans. T5 (Raffel et al., 2020) introduces a unified format where both inputs and outputs are text. Without changing the model architecture, the hyperparameters or the loss function, T5 generates text to address tasks such as question answering, text summarization, machine translation, as well as classification tasks. BART (Lewis et al., 2019) utilizes a bidirectional encoder, enabling attention from both left and right directions, and an autoregressive (unidirectional) decoder, which allows attending to past tokens only. It is able to handle generative tasks such as question-answering, text summarization, conditional text generation, mask filling, and also classification tasks.
As for text encodings for VL models, BERT (Devlin et al., 2019) has become a golden standard for several transformer-based approaches, while fewer implementations utilize variants such as RoBERTa . GPT2 (Radford et al., 2019), T5 (Raffel et al., 2020) and BART (Lewis et al., 2019) have also served as language encoders.

Visual representation
There is much less diversity in the representation of the visual modality comparing to text encoding. Most works rely on widespread convolutional architectures without significant variations, and only recently some works attempted encoders based on image transformers, which however do not enforce architectural modifications.
Convolutional Neural Networks Representation of images can involve object level or image level features, depending on the granularity of information that needs to be imbued in the representation. A global image representation can be achieved by employing widely used image classification models as feature extractors. Many works rely on CNN based classifiers such as VGG (Simonyan and Zisserman, 2015) and ResNet , while others prefer more fine-grained local representations supported by object detectors, such as Fast-RCNN (Girshick, 2015) and Faster-RCNN (Ren et al., 2017). The fixed pre-trained models for object feature extraction somehow limit the expressivity of VL transformers, while being slow. Solutions to this limitation include the usage of grid features as visual tokens (Huang et al., 2020b), or discretized grid features .

Image Transformers Recent advancements in Visual
Transformers as an extension of the aforementioned language Transformers (Vaswani et al., 2017) have influenced the field of image representation. ViT (Dosovitskiy et al., 2021) suggest patch-like parsing of images for feature extraction, resulting in more powerful image representations. Swin Transformer (Liu et al., 2021c) is a more efficient image Transformer due to the usage of self-attention in local image patches contrary to global self-attention of several image transformer implementations, which results in quadratic computation complexity comparing to image size. Swin Transformer achieves linear complexity by hierarchically merging larger and larger image patches across layers, with self-attention acting only within each patch. Similar to NLP Transformers scaling capabilities, Swin Transformer V2 of 3 billion trainable parameters serves as the largest dense vision model so far (Liu et al., 2022).

Sequential models for VL tasks
Even though Transformer-based approaches have dominated the field of multimodal learning, sequential models have offered a variety of interesting solutions in several tasks, and still serve as the way to go in some of them. Most sequential-based techniques address the vision-language co-operation through encoding-decoding schemes which utilize a CNN for images I and an RNN/LSTM/GRU structure for text T. There are different ways for visual and language embeddings to interact, depending on the downstream task: for example, a CNNbased image encoding can be fed as a conditioning to an RNN/LSTM/GRU decoder structure for tasks requiring text generation from image. Alternatively, input text can be embedded using an RNN/LSTM/GRU encoder, and a feed forward neural network can learn the correlatiions between text embeddings and CNN-based image embeddings. This variant can also serve text-image matching tasks. Sequential structures for VL learning remain rather popular in tasks that require language generation, especially in underexplored tasks where more refined architectures have not been attempted yet.

Multimodal Transformers
Multimodal transformers have revolutionized the field of multimodal learning, with almost any state-of-the-art model built upon them. There are some certain steps followed from receiving the input data until the final result on multimodal tasks, as presented in Figure 2. Initially, given a multimodal dataset, for example a VL dataset D comprising of image-text pairs (I, T ) that consist of visual features v and textual features w respectively, an appropriate encoding scheme, such as the one described in sections 3.1 for T and 3.2 for I should be decided. The input embedding contains a tokenized text representation, an image encoding and other special embeddings. All transformer-based architectures, either targeting vision, language or both consist of the pre-training and fine-tuning stages. For this reason, a multimodal encoding module is designed to receive the embedded input, enabling the interaction between modalities by jointly learning complementary information with the help of well-designed cross-modal pre-training tasks. Finally, a fine-tuning stage adapts the pre-trained model to the final downstream task by training on a smaller labelled dataset.

Special input tokens and embeddings
Special tokens need to be appended to the input before entering the VL transformer model in order to indicate the different modalities, as well as the start and sometimes the end of the sequence. An input embedding is formed by combining text representation, visual representation, special tokens and other embedding information to guide training. Despite different VL transformers following slightly different strategies regarding input representation, the main constituents are analyzed below. Figure  3 provides a general overview of the input tokens and embeddings.
The input token denoted as [CLS] is a special classification token that defines the start of the input sequence. Linguistic information from text T is appended after the [CLS] token. Usually wordpiece (Wu et al., 2016c) tokenizer, a sub-word based tokenizer framework, transforms words into tokens. Afterwards, a numerical format of tokens is obtained by assigning a unique embedding per subword token so that it can be further processed. The text embedding will then be w = w 1 , w 2 , ..., w n ∈ R d , where n corresponds to the textual sequence length and d is the embedding dimension. A segment token [SEP] is appended after w to separate different modalities.  Segment embeddings indicate the source of each input element by assigning a unique label to each of them. For example, input textual and visual features would be assigned with a different segment embedding s w and s v respectively, as they come from different modalities. Therefore, sequences s w = s w 1 , s w 2 , ..., s wn are going to be summed with the text embeddings and similarly, s v = s v 1 , s v 2 , ..., s v k will be summed with visual embeddings. Alternatively, if the input contains two sentences (such as question and answer pairs), each of those will be aligned with different segment labels. Position embeddings p i indicate the position of tokens in a sequence {w 1 , ..., w i , ..., w n }, as originally used in BERT (Devlin et al., 2019).
The summation of token embeddings, segment embeddings, position embeddings and visual embeddings can act as starting point towards the input representation, followed by an appropriate transformer structure. The input embedding generally follows the {[CLS],ŵ 1 ,ŵ 2 , ... ,ŵ n , [SEP ],v 1 ,v 2 , ... ,v k , [EN D]} format, whereŵ andv are the final text and visual embeddings respectively, after the summation of w and v with the segment and/or position embeddings.

Vision and Language joint encoding
Encoded modalities need to be projected to the same vector space and interact in order to achieve a meaningful joint representation. A concise separation of crossmodality encoding is provided in (Du et al., 2022), where two main categories of encoding schemes are identified: fusion encoder and dual encoder. Those two approaches can be even combined. Fusion encoder concerns an abundance of VL transformers, and can be further divided in single-stream encoding and two-stream encoding.
Two-stream fusion encoder utilizes two separate transformer modules to process images and text respectively. First VL approaches such as ViLBERT (Lu et al., 2019) and LXMERT (Tan and Bansal, 2019) fall into this category, where they naturally extend BERT to also process images. Specifically, ViLBERT decomposes images in non-overlapping patches, similar to how tokens serve as inputs in the case of BERT. Text is tokenized and fed together with positional embeddings to its transformer stream. A co-attention module enables interaction and alignment between modalities given their intermediate representations. An extension of ViLBERT to 12 downstream tasks was presented in (Lu et al., 2020). LXMERT (Tan and Bansal, 2019) refines the cross-modal part to achieve advanced performance in downstream tasks. More recent models employing a two-stream fusion encoder are ALBEF , Visual Parsing (Xue et al., 2021) and WenLan (Huo et al., 2021). In general, two-stream encoders demand training of two transformer models for each stream, which is computationally inefficient. Therefore, succeeding approaches focus on single stream encoders.
Single-stream fusion encoder concerns the majority of state-of-the-art VL models. A single transformer network, usually stepping on the BERT (Devlin et al., 2019) backbone is used to process images and text representations simultaneously, where alignments are discovered via self-attention. Segment tokens to separate modalities, together with position tokens to indicate aligned token pairs are added to input data before they are concatenated and fed into the encoder. VisualBERT , VL-BERT (Su et al., 2020), PixelBERT (Huang et al., 2020b), InterBERT , VLP (Zhou et al., 2019a), Unified-VLP (Zhou et al., 2019b), B2T2 (Alberti et al., 2019), UNITER , XGPT (Xia et al., 2020), ViLT (Kim et al., 2021), VL-T5 (Cho et al., 2021), SOHO , SimVLM (Wang et al., 2021c) belong to the single-stream encoder category. Some of the latest models even present zero-shot learning capabilities, enabling more out-of-the-box capabilities in VL tasks. Dual encoder is employed by a small family of recent VL models which exploit contrastive learning to provide image-text representations. Separate encoders embed each modality independently, and their representations are projected on the same vector space with the goal of learning similarity and dissimilarity properties. This is why contrastive learning naturally fits: paired image-text samples are trained to stay close together, while being apart from the rest. CLIP (Radford et al., 2021) implements this strategy utilizing more than 400 million image-text pairs for training, achieving even zeroshot capabilities in retrieving previously unseen matches. ALIGN (Jia et al., 2021) follows the same recipe using over one billion image-text pairs, demonstrating that large-scale uncurated data can compensate for the presence of noise. FLORENCE (Yuan et al., 2021) exploits both image-to-language contrastive loss and its reverse, language-to-image contrastive loss, actually forming a bi-directional contrastive loss applied on image-labeldescription triplets.
Fusion and dual encoder A couple of models leverage the benefits of both encoding approaches (Du et al., 2022).
FLAVA (Singh et al., 2021) is a holistic approach that utilizes a dual encoder to integrate visual and text representations into a multimodal one, providing unimodal and multimodal reasoning capabilities in the same model. Additionally, VLMo ) offers a flexible format by providing a fusion encoder for classification task, and a dual encoder for retrieval tasks.
3.4.3 Self-supervised Pre-training VL pre-training datasets A VL model can be pretrained on unimodal (text and/or visual data independently) and multimodal data (paired image-text data). Unimodal data can be unlabelled, noisy and abundant, such as sets of documents or images, while multimodal data are labelled and cleaner, so that each modality presents another point of view for the same instance. Large-scale datasets containing visual and linguistic information are necessary for pre-training. Pre-training aspires to instill a generic understanding of the visual world, natural language and their in between interactions. Widely used datasets can either be in-domain, meaning that their data distribution is very close to task-specific datasets used in the fine-tuning stage, or out-of-domain, containing less similar data to the downstream tasks, but usually being much larger in size. Most VL transformers leverage a corpus consisting of COCO (Lin et al., 2014), Visual Genome (Krishna et al., 2016), SBU (Ordonez et al., 2011) and Conceptual Captions (CC) (Sharma et al., 2018) for pre-training, with fewer models either excluding some of those, or adding more datasets to this corpus.
COCO (Lin et al., 2014) consists of around 106K images/533K captions in the train split, and 25K images/5K captions in the test split, with 5 sentences provided per caption from different annotators. Those sentences are designed to provide an overall (global) understanding of their corresponding images, which represent scenes of multiple objects and their in-between relationships. As most VL tasks are built atop COCO, it is considered an in-domain dataset. Because of the usage of COCO for downstream tasks, subsets of COCO used in pre-training and fine-tuning must be mutually exclusive to avoid any data leakage. Visual Genome (VG) (Krishna et al., 2016) is another in-domain dataset, from which several VL tasks emerge. It contains more than 100K images of complex scenes, providing multiple annotations regarding per image scene graphs, objects (3.8M instances), relationships (2.3 M), attributes (2.8 M), visual questions and answers (1.7 M), region descriptions (5.4 M) and region scene graphs (3,7 M). Region descriptions are captions grounded to image regions, acting as dense captions per image. As VG images have a partial overlap with COCO images, any COCO image used in downstream tasks should be excluded from the VG during pre-training. SBU Captions (Ordonez et al., 2011) consists of 990K images/990K captions in the train split, and 10K images/10K captions in the test spit, with each image corresponding to one caption. It is an out-of-domain dataset, larger in size than COCO and VG. Conceptual Captions (CC) (Sharma et al., 2018) is the largest out-of-domain dataset used for pre-training, consisting of more than 3 million images for training and 14K for validation with 1 caption each.
Pre-training objectives Uni-modal pre-training objectives concerning either T (language objectives) or I (vision objectives) at a time, as well as cross-modal objectives which take into account both modalities at once are used for pre-training. Such objectives teach the models to infer missing information by understanding their surroundings. Self-supervised learning is the most prevalent technique, enabling learning with the help of a corrupted part of the input or adversarially matched pairs in a contrastive fashion.
Language objectives are designed to implicitly learn linguistic rules and patters so that a model pre-trained on them acquires some 'understanding' of natural language. Some language modelling objectives are analyzed below.
Masked Language Modeling (MLM) is a pre-training task introduced in BERT ( Devlin et al., 2019). Tokens in the input sentence are masked at random with a special [MASK] token. The model needs to uncover the actual token by learning the context from its surroundings, thus permitting contextualized representations. The default probability of masking out a token is 0.15. This objective function is bidirectional, which means that the token can be predicted from either its right tokens, or its left ones. Prefix Language Modeling (PrefixLM) (Wang et al., 2021c) differs from standard MLM such that it enables bi-directional attention on the prefix sequence. Next Sequence Prediction (NSP) refers to the objective that given a pair of sentences, the model learns to predict whether they could serve as consecutive sentences in a corpus or not.
Vision objectives apply similar ideas on the visual modality. Due to the more high-dimensional nature of images comparing to text, the masking tasks are challenging in design.
Masked Region Modeling (MRM) applies zeros over image regions, asking the model to infer missing parts. Semantic and pixel level information can be retrieved, naturally leading to two sub-tasks: • Classification of regions can be used to obtain a discrete signal corresponding to semantic entities.
• Regression of region features provides a more continuous, pixel-level understanding of the missing part.
Random Pixel Sampling (Huang et al., 2020b) tackles overfitting similarly to the Dropout mechanism: in each iteration, a subset of pixels are chosen to be inserted in the transformer network. The corrupted image that the model receives each time enables more robust representations, as relying on semantics instead of single pixels is encouraged.
Cross-modal objectives exploit information from both text and image modalities jointly for learning. The design of cross-modal pre-training tasks is more challenging comparing to their unimodal counterparts, as it is required to ensure that the model does not rely learning on a single modality exclusively. Wang et al., 2021c;Du et al., 2022).
meaning that a masked token can attend only to previous tokens and not to future ones. Seq2seqLM attempts to directly maximize the likelihood of a text, image x ∈ (W, V) pair from dataset D: Masked Language Modeling with Image (MLMI) attempts to recover the corrupted tokens by also consulting the image apart from the linguistic part exclusively. Inferring the masked token is achieved by minimizing the negative log-likelihood, where w m are the masked tokens, w \m are the unmasked ones, W is a text sample, V a visual sample, D the dataset, and (W, V) ∈ D: Whole Word Masking (WWM) (Kim et al., 2021) is an extension of MLMI which masks out entire words rather than subword tokens, so that the model is encouraged to consult the visual part to infer the corrupted word instead of guessing the missing part from its unmasked constituents. This procedure enforces a more difficult task to the transformer, achieving better cross-modal alignment. Masked Region Modelling with Language (MRCL) is the complementary task of MLMI. Instead of text tokens, image region features are masked out with a probability (by default 0.15). Masking does not include a special token for the visual modality; it is implemented by filling the corresponding image regions with zeros. The model is tasked to reconstruct visual regions v \m based on information provided from the unmasked features v m and the text W, targeting to optimize the objective: Reconstruction of visual features can yield two tasks, offering different insights to the high-dimensional problem of feature prediction: • Masked Region Feature Regression (MRFR) refers to producing features of the same dimensionality as the visual region. This is achieved by applying L 2 regression between the predicted and the ground truth visual vector to minimize their distance. Considering the transformer prediction F C(h v i ), acquired after passing the output of the masked region h v i through a fully connected (FC) layer, and the region featureÊ v (v i ) of region v i , the minimization of L 2 can be written as: • Masked Region Classification (MRC) aims to predict the semantic class of the masked image region. The transformer prediction is compared with the output of an object detector, by considering the highest confidence object label which serves as the actual target. The cross-entropy (CE) loss between the object detector label c(v i ) and the transformer prediction F C(h v i ) for m regions needs to be optimized: An extension of MRC can consider the overall distribution of the object detector label predictions instead of exclusively focusing on the top class. In that case, the objective function would attempt to minimize the distance between the object detector distribution and the transformer's distribution of predictions, which is actually equivalent of minimizing the KL divergence between those two distributions. The objective function for Masked Region Classification with KL-Divergence (MRC-kl) for a distribution of object detector labelsc(v i )) instead of top object detector label c(v i ) can be written as: Image-Text Matching (ITM) enables learning visiolinguistic matches in a global level. It can be viewed as a multimodal extension of next sentence prediciton (NSP), where the model needs to recognize whether a given pair of text and image is in fact matched or not, as both positive and negative pairs are sampled. The alignment probability between text and image is provided by a score function s θ , and the binary cross-entropy (CE) needs to be optimized: Word-Region Alignment (WRA) is a more fine grained version of ITM, where words have to be grounded to image regions.
Contrastive objectives act upon data pairs projected on the same semantic space, so that the model learns a representation based on their similarity. Cross-Modal Contrastive Learning (CMCL) learns to place close together matching image-text pairs, while pushing apart any mismatched pairs. The contrastive loss for the i-th and j-th pairs sampled from D, where v i refers to the image of the i-th pair and w j refers to the text of the j-th pair, s θ (v i , w j ) = v i w j is a scoring function which is maximized when matching pairs occur, and σ serves as a learnable temperature parameter:

Task-specific fine-tuning
Current VL tasks are usually created by extending existing tasks in NLP or vision domains.They may either be discriminative or generative, and usually there are variants addressing both problems. (Mogadala et al., 2021) Visual Question Answering (VQA) Given an image i ∈ I and a natural language question q ∈ T , a VQA model is tasked to predict the correct answer a ∈ T . VQA is an extension of NLP question answering (QA) to include the visual modality. It can be viewed as a classification task, where the predicted answer can be selected among a set of candidate answers. Alternatively, free-form answers can be generated, forming a generative VQA task. Widely used datasets for VQA, containing complex scenes of objects and relationships, are the original VQA , VQAv2 (Goyal et al., 2016), GQA (Hudson and Manning, 2019), Vi-sual7W , Visual Genome QA (Krishna et al., 2016) and COCO QA (Ren et al., 2015). Additionally, datasets providing explanations for answer selections, such as VQA-E  have recently emerged.
Visual Entailment (VE) extends the default task of textual entailment by answering whether an image i ∈ I acting as the premise semantically entails the given textual hypothesis h ∈ T . The hypothesis h can either entail, contradict or remain neutral with respect to the premise, providing an answer a to the visual premise i. SNLI-VE (Xie et al., 2019) is a dataset used for VE. Moreover, e-SNLI-VE (Do et al., 2020) corrects label errors present in SNLI-VE due to its automatic assembling while providing human-written explanations in natural language for the corrected SNLI-VE corpus.

Visual Referring Expressions (VRE) or Visual
Grounding extends NLP referring expressions attempting to ground a textual phrase s to an image object or region for each image i ∈ I. Datasets used for VRE are CLEVR-Ref+ (Liu et al., 2019a), RefCOCO, RefCOCO+, RefCOCO-g (Yu et al., 2016) and GuessWhat (de Vries et al., 2016). VRE also presents a generative counterpart, Visual Referring Expressions Generation, where natural language generation can be used to create the textual phrases.
Visual Dialog (VD) is the analogue of chatbots, aiming to maintain a meaningful conversation by responding to consecutive textual inputs q ∈ T . VD is tasked to create such a dialog upon a given image i ∈ I. VisDial (Das et al., 2016) is a dataset for VD that was proposed together with the introduction of the task. Other datasets for multimodal dialog are GuessWhat (de Vries et al., 2016) and CLEVR-Dialog (Kottur et al., 2019).
Image retrieval from text or Text-Image Retrieval (TIR) attempts to return the most suitable image i ∈ I within a database according to a natural language description c ∈ T . There is also the inverse task of text retrieval from image or Image-Text Retrieval (ITR) that searches the optimal c ∈ T given an image i ∈ I. Crossmodal retrieval, referring to retrieving any modality from the other one is an extension of the NLP task of document retrieval. Common datasets used in TIR/ITR are COCO (Lin et al., 2014) and Flickr8k/30k (Young et al., 2014) which contain images paired with 5 captions each.
Image Captioning (IC) is a generative task, extending natural language generation (NLG) to describe images: given an image i ∈ I provide a sentence c that describes it. IC can be viewed as a generative counterpart of text retrieval from image (ITR). Dense Captioning is a finegrained analogue of IC that requires generation of descriptions for image regions instead of providing a global visual caption. Conceptual Captions (CC) (Sharma et al., 2018), SBU (Ordonez et al., 2011), COCO (Lin et al., 2014) and Flickr8k/30k (Young et al., 2014) are widely used datasets for image captioning.
Visual Storytelling (VIST) is the extension of image captioning to a sequence of N related images i 1 , i 2 , ..., i N ∈ I. Generated captions c 1 , c 2 , ..., c N ∈ T should be consistent with each other throughout the sequence, forming a textual 'story'. Datasets related to VIST are Visual Storytelling Dataset (VIST) (Huang et al., 2016a) which models social language regarding visual concepts, and New York City Storytelling (NYC-Storytelling) (Park and Kim, 2015), which contains narratives from blogs.
Multimodal Machine Translation (MMT) is assisted from the visual modality to translate between two lan-guages, as an extension of the machine translation task. Multi30K-MMT (Elliott et al., 2016) contains multilingual descriptions of images in English, German, French, and Czech languages.
Visual Reasoning (VR) extends visual perception tasks, such as object detection and classification, semantic segmentation and others. VR needs to predict meaningful relationships between image entities, which is similar to creating a scene graph. Compositional reasoning refers to the task where attributes need to be combined so that the identity of the whole can be inferred. Popular datasets for VR are Compositional Language and Elementary Visual Reasoning (CLEVR) (Johnson et al., 2017), Relational and Analogical Visual rEasoNing (RAVEN) , Natural Language Visual Reasoning (NLVR) (Suhr et al., 2017) and Natural Language Visual Reasoning for Real (NLVR2) (Suhr et al., 2019). In order to test the ability of VR models in novel attribute combinations, CLEVR-CoGenT dataset was proposed as an extension of CLEVR. Real world compositional questions of GQA (Hudson and Manning, 2019) can also be utilized for VR models. Finally, AQUA (Garcia et al., 2020) is a visual reasoning dataset dedicated to the artistic domain.
Visual Commonsense Reasoning (VCR) attempts to understand an image i ∈ I by incorporating commonsense knowledge relationships to explain the answer a ∈ T derived for each question q ∈ T . It can be also viewed as an extension of the VQA task, where instead of merely providing an answer a to a given visual question q, a rationale r ∈ T justifying the choice of answer needs to be returned as well. The answer a and rationale r can be also generated, thus yielding the Visual Commonsense Generation (VCG) task. Widely used datasets are VCR (Zellers et al., 2019) and Visual COMmonsense rEasoning in Time (Visual COMET) .
Vision-and-Language Navigation (VLN) can be the equivalent of either visual navigation or linguistic navigation in the multimodal domain. Some datasets for VLN are Cooperative Vision-and-Dialog Navigation (CVDN) (Thomason et al., 2019), Action Learning From Realistic Environments and Directives (ALFRED) (Shridhar et al., 2020) and others.

Image generation
There is a lot of architectural variation when the modality to be generated is the visual one, greatly diverging from architectures employed for discriminative visual tasks.
Image generation can be performed by using Generative Adversarial Networks (GANs), Transformers, Diffusion models or a combination of them.

Conditional image generation tasks
Conditional Image Generation (cIG) addresses the synthesis of an image i guided by textual information s ∈ T , extending uncoditional image generation. Traditionally, text to image generation is performed by conditional GANs (cGANs) (Mirza and Osindero, 2014), which attempt to generate realistic images corresponding semantically to a text description. Image generation from text can be considered as the generative counterpart of image retrieval from text (TIR), and can be also regarded as the inverse task of image captioning (IC). Text to image GANs have been benchmarked on a plethora of datasets, either containing simpler images with simple conditioning such as ImageNet (Deng et al., 2009), Oxford Flowers (Nilsback and Zisserman, 2008), FFHQ (Karras et al., 2018) and CIFAR (Krizhevsky, 2009), longer conditioning such as captions in natural language, as in CUB (He and Peng, 2020;Reed et al., 2016b), or even more complex scenes accompanied with captions as conditioning, like COCO (Lin et al., 2014).
Story Visualization (SV) refers to the synthesis of a visual sequence i 1 , i 2 , ..., i N based on an input story c 1 , c 2 , ..., c N ∈ T , the inverse task of visual storytelling (VIST). Once again, GAN architectures are leveraged to produce the sequence of images, which should not only be realistic, but also maintain the serial progression from frame to frame while remaining relevant to their textual description. Datasets accompanying SV research are CLEVR-SV (Li et al., 2019c), Pororo-SV (Kim et al., 2017), Flinstones-SV (fli) and DiDeMo-SV (Hendricks et al., 2017).

Generative VL architectures
Text-to-image GANs Adversarial text to image synthesis has demonstrated a long line of impressive results by converting textual inputs such as captions, interactive dialogs, sequential story-like captions or structured formats of textual inputs such as scene graphs and layouts to plausible images. Some of the first ventures (Reed et al., 2016a,d) achieve synthesizing images, even though their quality is fairly low. To resolve this limitation, subsequent implementation synthesize images in stages, increasing image resolution step by step. StackGAN  and StackGAN++ (Zhang et al., 2018a) exploit stacked generators and discriminators, dedicated to coarser or finer resolutions. AttnGAN focuses on indi-vidual words rather than whole sentences to synthesize finer details of the image . This idea is extended by SEGAN , which attends on relevant keywords from the sentence rather than all existing words. The most significant parts of the sentences with respect to the image achieve in synthesizing images with reduced fuzzy details, as reported in DM-GAN (Zhu et al., 2019). StoryGAN (Li et al., 2019c) was the model that introduced the Story Visualization (SV) task, which concerns synthesizing a sequence of related images maintaining consistent across story frames. The main StoryGAN architecture consists of a generator and two discriminators (image and story discriminators) guided by an RNN structure responsible of textual story encoding. Improvements on the basic model were performed in (Zeng et al., 2019;. More recent models substituted the RNN encoding scheme with transformers , following the same trend as in other multimodal tasks. Generative VL Transformers Surprisingly, VL transformers in their original form are not able of generating realistic images despite their impressive capabilities on other visual tasks. This issue is attributed to the fact that regression based training objectives, like the one used in LXMERT, are not able to handle feature generation in high dimensional spaces. Getting VL transformers one step further, X-LXMERT as an extension of LXMERT (Tan and Bansal, 2019), generates caption-conditioned high fidelity images consistent to their descriptions. X-UNITER follows the same extension logic as X-LXMERT, based on UNITER architecture . The results in image generation are comparable to X-LXMERT, showcasing the general applicability of this approach. (Cho et al., 2020) DALL-E (Ramesh et al., 2021) uses a 12 billion parameter version of GPT-3 to generate fine-grained and highly diverse images based on corresponding textual descriptions. It demonstrates a large range of conditional synthesis capabilities, such as controlling certain visual attributes, as well as object positioning in an accurate way, capturing and visualize 3D scenes, performing several natural effects like reflections, inferring missing details from descriptions, combining unrelated concepts in one image, performing zero-shot reasoning in the visual domain, incorporating external spatial and temporal knowledge and others. Once again, a connection between model scale and advanced performance is observed in terms of zero-shot generation and on the range of generalization capabilities.
Text-to-image diffusion models are setting new baselines for state-of-the-art conditional image generation (Dhariwal and Nichol, 2021). They follow a process of gradually adding Gaussian noise on an image and then learn to reconstruct it. During the last year, the field of diffusion-based image synthesis has received a variety of interesting works. Stable Diffusion (Rombach et al., 2021) enables previously computational demanding image synthesis even within limited resources scenarios by applying the diffusion training process in the latent space of autoencoders rather than pixel-level operations in the image space. DALL-E2 (Ramesh et al., 2022) extends the high-quality result of its predecessor (Ramesh et al., 2021) by using learned text-conditioned image embeddings obtained from CLIP (Radford et al., 2021) as conditionings to a diffusion model acting as the decoder. Synthesized images are photorealistic and faithful to their conditioning, while zero-shot language-guided manipulation on a source image is also possible. The concurrent work of Imagen (Saharia et al., 2022) exploits large pre-trained language models, such as T5 (Raffel et al., 2020) for language encoding and proceeds with image synthesis based on the diffusion process. DreamBooth (Ruiz et al., 2022) steps upon Imagen to contextualize image synthesis, given variable context described in text. Therefore, different variations of a visual subjects can be obtained, maintaining high synthesis quality.
Combined GAN/Transformer models Hybrid architectures utilize existing GAN generators together with powerful VL transformers such as CLIP (Radford et al., 2021). Given an input text prompt, CLIP is responsible of guiding image synthesis in the latent space, based on the text-image similarities it has learned. FuseDream (Liu et al., 2021b) follows this paradigm by optimizing the latent space of pre-trained GANs for efficient navigation. Similar hybrid text to image implementations are Big Sleep 1 and VQGAN+CLIP 2 .
3.6 Evaluation metrics for VL models 3.6.1 Classification metrics Discriminative VL tasks that provide an answer chosen between pre-defined candidates can be evaluated via classification metrics. Such tasks include visual question answering (VQA), visual referring expressions (VRE), visual reasoning (VR), visual entailment (VE).
Accuracy@k measures the proportion of correct answers over all answers, where an answer is considered to be correct if it belong in the top-k answers. It can serve as a generic measure of quality for discriminative VL tasks, irrespectively of the modality that is predicted as the answer: for example, in the case of VQA, accuracy refers to the linguistic modality, as a textual answer needs to be selected. When bounding boxes have to be predicted, as in the case of VRE, Intersection over Union (IoU) provides a measure of success regarding the overlap between ground truth and predicted bounding boxes. Higher values are better, indicating a larger percentage of overlap.

Ranking metrics
Ranking metrics (Manning et al., 2008) provide further insights regarding the success of retrieving the right answer by providing information related to the position the ground truth answer was ranked. Tasks commonly using ranking metrics are image captioning (IC), visual storytelling (VIST), visual dialog (VD), machine translation (MT). R@k -Recall@k measures the proportion of total ground truth instances that were found in the top-k rank, without taking into account their ordering. Higher Recall@k scores are better. P@k -Precision@k is the percentage of ground truth answers in the top-k over all retrieved top-k items, without taking into account their ordering. Higher Precision@k scores are better. When ordering is to be considered, MRR -Mean Reciprocal Rank acts as a useful performance measure. As reciprocal rank of an answer is considered the inverse of the rank position of the -first-right answer. By averaging over all instances, MRR is derived, with higher values indicating better performance. For N instances, if rank i denotes the position of instance i, MRR can be written as: Another order-sensitive metric is Discounted cumulative gain (DCG), which measures the gain an answer offers based on its ranking by taking into account a graded relevance scale. The graded relevance value is supposed to reduce logarithmically with respect to the position, thus penalizing highly relevant answers that appear lower in the rank. For rel i the graded relevance at position i and p a particular rank position, DCG score for p is defined as: is a normalized rank quality score that represents the ratio between the DCG score of the rank returned by an algorithm divided by the ideal order DCG (iDCG).
Median Rank is the median position of ranked ground truth answers when all answers have been considered. In a similar fashion, Mean Rank denotes the average position of ranked ground truth answers when all answers have been considered. Lower median and mean rank values are better.

Similarity metrics
WUPS -Wu-Palmer similarity (Wu and Palmer, 1994) reports the degree of similarity between two words based on their least common subsumer on WordNet (Fellbaum, 1998) taxonomy. Various thresholds can define the agreement between two candidates, with the most typical being 0.0 and 0.9.

Language generation metrics
Language generation tasks such as IC, VIST and generative versions of VQA and VCR (VCG) evaluate produced sentences using the following language generation metrics. One of the oldest metrics is BLEU -Bilingual Evaluation Understudy (Papineni et al., 2002), originally developed to evaluate machine translation, compares how much a machine generated linguistic output matches text written by a human in a precision-oriented way. Therefore, it is supposed to demonstrate high agreement with human perception regarding the quality of generated text, with higher BLEU score indicating higher perceived quality and maximum BLEU score being equal to 1 or 100%. BLEU can measure not only individual word marches, but also n-gram (grouped words) matches, providing BLEU-N scores for different N (most usually N = 1 for unigrams, 2 for bigrams, 3 for trigrams, 4 for quadrigrams). Nevertheless, it cannot assess linguistic diversity of generated text and sometimes cannot appropriately evaluate generated text quality in practice. BLEU seems to be rather reliable for long sentences, but not very helpful in short/monolingual ones.
ROUGE -Recall Oriented Understudy for Gisting Evaluation (Lin, 2004) is designed to evaluate the quality of summarization comparing to a human-made summary. Similarly to BLEU, n-grams of varying N are utilized to calculate ROUGE-N scores. ROUGE-N offers not only recall, but also precision and F1 evaluation. Specifically, ROUGE-N Precision measures how many overlapping n-grams were found over the total generated n-grams: ROUGE-N precision= count of common n-grams in generated and reference count of n-grams in generated ROUGE-N Recall assesses how many overlapping ngrams were found comparing to the human-made reference: ROUGE-N recall= count of common n-grams in generated and reference count of n-grams in reference Finally, ROUGE-N F1 score takes into account both ROUGE-N precision and recall: Additionally, there is ROUGE-L score, which measures the longest common subsequence (LCS) between generated and ground truth text, with a longer LCS implying higher similarity. ROUGE-L can operate in sentencelevel or summary-level and demonstrates precision, recall and F1-score variants. Specifically, ROUGE-L Precision measures the LCS length comparing to generated n-grams count:

ROUGE-L precision=
LCS (common n-grams) count of n-grams in generated Moreover, ROUGE-L Recall measures the LCS length comparing to ground truth n-grams count:

ROUGE-L recall=
LCS (common n-grams) count of n-grams in reference ROUGE-L F1 score takes into account both ROUGE-L precision and recall: There are also some other rarely used variants of ROUGE score. ROUGE-W searches for the Weighted Longest Common Sub-sequence. ROUGE-S (Skip-Bigram Co-Occurrences Statistics) measures overlap of non-consecutive n-grams between generated and reference sequences. Similar to BLEU, ROUGE scores are more effective in long sentences rather than short ones.
METEOR -Metric for Evaluation of Translation with Explicit Ordering (Banerjee and Lavie, 2005) tackles the need for exact word matching as in BLEU score, and instead enforces semantic matching, taking into account possible synonyms and paraphrases of words in the reference text. Semantics matches are made possible due to the usage of WordNet (Fellbaum, 1998). Unigram alignment between ground truth and generated sequences contribute to the METEOR score. Multiple possible alignments between the two sequences are resolved by selecting the alignment with the fewest crosses, when ground truth and generated unigrams are matched. METEOR presents high agreement with human perception regarding the similarity of generated text sequences.
CIDEr -Consensus-based Image Description Evaluation (Vedantam et al., 2014) is a metric inspired from human agreement when ground truth and generated sentences are compared. Such similarity can be expressed via TF-IDF score of n-grams across reference sentences, instructing frequently occurring n-grams in the whole dataset to be have lower weight in the final score, as it is more possible to be less informative (IDF term), while in the same time increasing the weight of frequent n-grams within a reference sentence (TF term). Similar to ME-TEOR, CIDEr exploits semantic matching by comparing stemmed versions of words in sentences. A cosine similarity score between generated and ground truth vectors computed from TF-IDF n-gram weights provides the final CIDEr score.
SPICE -Semantic Propositional Image Captioning Evaluation (Anderson et al., 2016) is an automated evaluation metric for language generation operating over scene graphs, which are synthesized from ground truth captions and generated captions via dependency parsing. Word-Net is utilized for disambiguation and synonym detection. Even though it resolves shortcomings of previous methods related to n-gram overlaps, the scene graph construction stage is possible to induce errors early in the evaluation process.
BLEURT (Sellam et al., 2020) is a BERT-based learned evaluation metric that addresses the shortcomings of traditional BLEU and ROUGE metrics in order to better correlate the resulting scores with human perception. It combines the merits of learning pure linguistic associations, as well as human assessments over language metrics. Therefore, pre-training on synthetic sentence pairs, which are designed to capture semantic, syntactic and lexical information assists in identifying possible dissimilarities between real and generated text. Afterwards, fine-tuning adapts automatically captured disagreements to actual human ratings.
3.6.5 Image generation metrics IS -Inception Score (Salimans et al., 2016) evaluates the quality and diversity of generated images by utilizing a pre-trained image classifier, such as Inception V3 (Szegedy et al., 2015). The classifier is tasked to provide a probability of whether an image is generated or real. IS can have a lowest value of 1.0 and a highest value equal to the number of classes the pre-trained classifier has seen, which in case of Inception V3 are the 1000 classes of ImageNet (Deng et al., 2009). Higher IS implies better quality and more diversity, therefore higher IS is better.
FID -Fréchet Inception Distance (Heusel et al., 2017) compares the distribution of generated images with the distribution of the real ones by taking into account the mean and the standard deviation of the distributions. A pre-trained Inception-V3 (Szegedy et al., 2015) model is utilized to extract summary statistics for real and generated images. Lower FID scores are better, indicating higher similarity between generated and real images.
LPIPS -Learned Perceptual Image Patch Similarity (Zhang et al., 2018b) is a metric designed to reflect the perceptual similarity between real and generated images using deep features of image classifiers. Lower LPIPS values indicate more similar images.
R-precision is another widely used metric to evaluate synthesis quality. Despite primarily being a retrieval metric, it can serve conditional generative models: generated images are used as queries to ground truth descriptions, with higher scores indicating better quality of generated images.

Graphs
A graph structure G = V, E consists of a set of nodes V interconnected with weighted or unweighted edges from a set E. Edges also can be either directed or undirected depending on the constrains of relationships they express.
Nodes and edges can additionally contain features. There are some distinct subcategories of graphs mentioned below, often present in real world data representation scenarios. (Hamilton) Different types of relationships in data raise the need for distinct edge representations, which is satisfied via Multi-relational graphs. The edge notation needs to be altered to contain the edge type τ , so that the multirelational edge notation being (v i , τ, v j ) ∈ E. Heterogeneous graphs extend multi-relational graphs by introducing varying node types, so that nodes can contain labels forming non-overlapping node sets. Therefore, the node set V can be expressed as a union of node labels, i.e. V = V 1 ∪ V 2 ∪ ... ∪ V n for n distinct and disjoint node categories. Those categories most often impose constrains on the edge types as well in order to remain meaningful. Multipartite graphs are heterogeneous graph that exclusively contain edges that connect nodes of different types.

Knowledge graphs
A knowledge graph (KG) G is a structured representation of facts F, which consists of entities E, relationships R and semantic descriptions. Entities can describe either abstract concepts or actual objects, relationships form the meaningful connections between entities and semantic descriptions incorporate types and properties of those objects and relationships. KGs are directed and heterogeneous structures that describe human knowledge in the form of triplets head -relationship -tail, often denoted as (h, r, t) ∈ F, or equivalently subject -predicate -object, denoted as (s, p, o) ∈ F.  While existing edges express known facts, there are two scenarios for missing edges:  Open World Assumption (OWA) assumes that unobserved facts are either missing or false; Closed World Assumption (CWA) assumes that all unobserved facts are false.

Graph representation
Graph representation has been a field of increasing interest, as it affects numerous applications in artificial intelligence. Similar to language representations, KG embeddings are low-dimensional mappings of the graph entities and relationships. These vector representations capture the semantic information contained in KGs, which can consequently be used for a variety of downstream tasks. Hamilton et al., 2018) Node embeddings constitute a family of graph embedding algorithms aiming to represent the nodes' position and context in a vector space. Popular methods regarding node embeddings are based on random walks, presenting widely used implementations such as node2vec (Grover and Leskovec, 2016), DeepWalk (Perozzi et al., 2014), Large-scale Information Network Embedding (LINE) (Tang et al., 2015), Graph2vec (Narayanan et al., 2017) and others. However, standalone node embeddings are inadequate for multi-relational graph representations, which require an overall representation of nodes, edges and attributes. To this end, shallow translational models exploit geometric capabilities of the vector space to achieve multirelational graph representations. (Hamilton) Graph Neural Networks (GNNs) refer to a broad framework that achieves graph structure and feature representations using deep learning. The basic idea behind GNNs is neural message passing, meaning the exchange and aggregation of neighboring node information through their connections, resulting in updating the node embedding itself. The process repeats by expanding in more distant neighborhoods (hops) in every step, integrating more information regarding graph structure and neighboring features. (Hamilton) More specifically, each node v ∈ V of a graph G is initialized with a random embedding h 0 and the following hidden embedding states h v are computed via a local transition function f such that: where x v are the features corresponding to the node v, x co [v] are the edge features connected to v, h Nv are the embedding states of the neighbors of v and x Nv their features. The output is given from a local output function g: Graph Convolutional Networks (GCN) are a variant of convolutional neural networks operating on graphs. However, different than images, graphs contain unordered nodes with varying numbers of connections. Given a graph G = (V, E, X ), the GCN receives as an input the feature matrix X containing node features, and an adjacency matrix A representing the graph structure. The convolution operation is generalized so that the representation of a current node is obtained by aggregating its own features, as well as the features of its neighbors. A nonlinear transformation is then applied on the aggregated features. Many approaches stack multiple convolutional layers so that feature information from further neighborhoods can be received. (Kipf and Welling, 2016; GCNs can effectively be applied on relational data and serve a variety of relevant downstream tasks. Relational Graph Convolutional Networks (R-GCNs) consist a GCN variant for relation-specific encodings for KGs, i.e. encodings depending on edge directivity and type. Therefore, different weights will be assigned to different relationships. (Schlichtkrull et al., 2017) Graph Attention Networks (GATs) are based on attention mechanisms, which have the ability of handling variable sized inputs. Therefore attention mechanisms could be applied successfully for the representations of graphs containing nodes with different edge degrees. Indeed, self-attention allows each node to attend to its neighbors, thus computing the hidden encoding. This process can be parallelized for neighboring pairs, independently of graph structure. (Veličković et al., 2018) GNNs are able to learn representation on fixed and homogeneous graphs, but they are not so powerful on graphs containing heterogeneous information on nodes and edges. In such cases embedding extraction mechanisms should be adjusted appropriately to encode heterogeneous information. Various types of relationships create meta-paths between nodes, containing diverse and rich semantic information. GATs are extended by applying a hierarchical attention mechanism with node-level and semantic-level attention, assigning different importances to nodes and meta-paths. Neighboring features are aggregated in a hierarchical fashion forming the final node embeddings.  Graph Transformer Networks (GTNs) can also handle heterogeneity and generate reasonable meta-paths connecting the various nodes, as well as effective node rep-resentation. Meta-paths may be of any length up to the number of Graph Transformer layers and contain arbitrary edge types. Multiple generated meta-paths can simultaneously be considered, defining multiple learned graphs. Due to continuously generating new graph structures from adjacency matrices contained in data, more powerful node representations are learned through convolutions, as there are higher chances of finding more useful meta-paths. A GCN is applied on each meta-path, yielding an ensemble of GCNs when considering all the generated meta-paths. (Yun et al., 2020) The Heterogeneous Graph Transformer (HGT) is designed to model web-scale heterogeneous graphs using heterogeneous attention. Node and edge-level dependent parameters are exploited in order to obtain dedicated node and edge representations. HGT extracts all related pairs from a sampled heterogeneous subgraph, so that each pair contains a source node and a target node. The information from source nodes is aggregated to provide a contextualized representation for a target node. This process can be decomposed in three parts, namely Heterogeneous Mutual Attention, Heterogeneous Message Passing and Target-Specific Aggregation. The weight matrices for the heterogeneous mutual attention, message passing, and propagation steps are parameterized using the meta-relations of the graph. Dynamic representation graphs are handled using the relative temporal encoding technique, which captures arbitrary long dynamic structural dependencies.

Graph representations in KVL models
Apparently, one important aspect towards KVL models is how knowledge is incorporated with vision and language modalities. First attempts to incorporate knowledge stored in KGs were based on exact matching between extracted visual and textual concepts from the given input and existing KG nodes.
However, exact matching will result in errors even when there is a small discrepancy between KG entities and extracted concepts, limiting the contributions of additional knowledge. In order to better exploit semantically rich structured knowledge sources, more refined strategies, such as the usage of GNN representations are explored in recent literature. The variability of GNN implementations as described previously allows dedicated representations for different graphs, with incorporation of node, edge and feature information. Therefore retrieved knowledge is more accurate and informative, capable of boosting VL models.
In recent years, there has been an ever increasing interest about incorporating external knowledge K in VL models. We can identify the merits of such an approach: additional knowledge can offer performance boosting, extendability and potentially explainability of existing VL tasks. Most common senses of additional knowledge offering those benefits are analyzed below.
Hierarchical knowledge refers to is-a relationships forming a tree structure, with the root serving as the most generic concept and parent node of all the rest, while leaves constitute the most specific concepts. For example, cat is-a mammal is an instance that represents such hierarchical relationships.
Lexical knowledge serves as a structured dictionary, offering linguistic rules and can resolve issues such as word sense disambiguation. Lexical knowledge can be combined with hierarchical knowledge, providing hypernym/hyponym relationships.
Named entities cover a variety of proper names as instances of entities, and include names of people, locations, companies, organizations etc (Grishman and Sundheim, 1996), for example Joe Biden is the president of the United States Factual knowledge includes encyclopedic information of the world such as the historical fact WW2 lasted from 1939 to 1945. Such knowledge can also refer to more specific scientific facts, like knowledge in medical, biological, chemistry domains and many more. Facts can also be combined with named entities, forming statements like Zebras live in Africa.
Commonsense knowledge is the self-evident perception of the world according to humans; sugar is sweet and if I go out in the rain I'll get wet are obvious commonsense statements. We can identify several discrete senses of commonsense knowledge, affecting aspects of the world a human experiences. Such subcategories refer to similarity/dissimilarity concepts, like synonyms and antonyms of words. Another commonsense variant includes part-whole (part-of) relationships representing concepts belonging to more generic ones or consisting members of a group, for example the bark is a part of a tree, the tree is part of the forest. Part-whole in terms of lexical knowledge is expressed via meronyms (part) and holonyms (whole). Utility relationships describe usage scenarios, such as a fork is used for eating or capability (wheels can rotate). Spatial information offers knowledge about usual locations of objects in the physical world, for example boats are situated near water, or even geographic information, such as Italy is located at Europe, which sits on the intersection with factual knowledge and named entities. Comparative knowledge provides rules of comparison between objects, for example leopards are larger than cats. Numerical knowledge addresses common enumerations in real life, providing facts such as humans have two eyes. Intents, desires and plans constitute another sense of commonsense knowledge, including facts like hungry people want to eat and a hungry person cooks to eat. Behavioral knowledge results from logical reasoning over commonsense facts, forming rules like a child cannot drink 10 liters of water in one day. Statements like a song is created by a musician or bread is made from wheat belong to creator knowledge.
Event/temporal knowledge contains chronological information and order of events, blending factual and commonsense knowledge. Events can refer to a large number of chronologically distinct time periods from widely known events such as world wars, significant political events, sports, social/scientific movements and a lot more, to more specialized events known to smaller audiences. Temporal sequences can contain chronologically ordered events. For example, COVID-19 started in 2019. Vaccines for COVID-19 were developed during 2020 is a factual sequence of events. Commonsense sequence of events could contain information such as Spring comes after winter. Sequences can also refer to causal relationships with the cause preceding the event, such as the boy dropped a glass of water and then the glass broke, which can also be transformed to hypothetical if/then statements, for example if a boy drops the glass of water, the glass will break, or even counterfactual statements expressing what would have happened if an alternative scenario occurred, like if the boy had not dropped the glass of water, the glass would not have been broken.
Visual knowledge contains images and possibly additional annotations to connect visual perception with commonsense. Attributes of objects, such as shape, color, texture and others can be connected with their visual counterpart, visualizing commonsense situations such as tomatoes are red and round. Visual knowledge is ideal for learning instances of the world about object relationships and attributes, paving the way for more complex reasoning required in several multimodal tasks. Spatial relationships are naturally combined with images; for example apples placed inside a bowl, bowl placed on a table. More types of relationships can further be visualized, including actions between visual entities such as a girl is holding a tennis racket, object details like black and white stripped hat, part-whole has-a relationships such as woman has long blonde hair or scene text like a truck with Coca-Cola logo. Those rather obvious statements can be extended to commonsense assumptions, such as the temperature is low, when an image of an icy landscape is provided. More complex visual instances can provide information about intents (a customer enters a restaurant to eat, a person holding a suitcase and a passport plans to travel), causes/effects (a biker cycling out in the rain will get wet), factual instantiations (an ancient Greek temple of the 5th century BC, girls with Japanese kimono dresses), similarity reasoning (the dog's toy looks similar to a plate), similarity including named entities (A man looking similar to Brad Pitt), creator knowledge (the painting was created by a person holding a paintbrush), capability (a cat can jump on the tree branch).

Types of knowledge sources
We divide external knowledge in three categories: implicit knowledge, present in non-symbolic form, explicit knowledge, typically stored in structured knowledge bases, and web-crawled knowledge, acquired from various online sources, usually in unstructured form.
Moreover, we can recognize the category of internal knowledge or self-knowledge, which does not rely on external sources, but rather obtains extra knowledge from the existing data.
Implicit knowledge refers to information stored in a non-symbolic form, such as neural network weights. The indisputable popularity of neural architectures in recent deep learning literature has led to numerous relevant contributions, even if their primary goal deviates from knowledge representation. Unsupervised or self-supervised pretraining of transformer models can provide implicit knowledge in several linguistic (Safavi and Koutra, 2021) or multimodal tasks. Therefore, incorporating large-scale linguistic and visual data in the pre-training stage can seemingly form unstructured knowledge bases.
Insightful studies have attempted to discover the optimal pre-training regime and whether it serves as a necessary prior for overall performance. The right dataset and design choices are crucial for achieving successful representations, which can become a prerequisite for the overall success to the final task, after the proper fine-tuning stage. First, the amount and type of pre-training data needs to be examined. Specifically, the relevance of the selected pre-training dataset with respect to the downstream task has been proven more influential rather than dataset size, an observation that remains valid even in the case that generated datasets are used over out-of-domain natural ones. Only scaling up in-domain data seems to positively impact the final model, encouraging the scalability of implicit knowledge bases. In some cases, especially when selected pre-training datasets are not diverse enough, pre-training transferability towards downstream tasks is low, so that in fact pre-training knowledge is deemed insufficient. In the same fashion, even though the pre-training scenario which demonstrates the lowest losses can be regarded as the best prior, it can be proven suboptimal without the proper finetuning. Data relevance seems to be more significant than model size, even though deeper single stream models mitigate the semantic gap between images and text in later layers. However, for two-stream models earlier layers present more narrow semantic gap, disagreeing with the single-stream observation. Singh et al., 2020) Other investigations question if all participating modalities contribute equally to the learned representations, proving that during inference, the contribution of language is more prominent than vision in both single-stream and two-stream encoder architectures. Nevertheless, rich visual knowledge is encoded in pre-training, effectively capturing visual relationships . Such observations can offer valuable insights in the complex pre-training procedure and provide enhancements on the knowledge acquired, towards more high-quality, flexible and robust implicit knowledge bases.
Pre-training holds the advantage that it can exploit unlabelled data for achieving a generic understanding of language (Devlin et al., 2019;Brown et al., 2020), which can serve as an initial point for linguistic or VL tasks. In most multimodal cases however, a level of supervision is required, such as the need for paired images and captions. The current abundance of paired image-text samples resolves the labelling issue up to some point, at least for general-purpose domains. The raw nature of linguistic, visual and paired image-text data that are typically used for pre-training alleviate the need for a strict representation, which would limit the flexibility of incorporated knowledge in many aspects, including storage, expressivity and accessibility (Safavi and Koutra, 2021). Additionally, the automatic incorporation of extra information by repeating pre-training or performing fine-tuning can help extending and refining the required implicit knowledge, contrary to handcrafted knowledge bases which are hard to be extended at scale. Even though such pretraining procedures are very expensive computationally, luckily pre-trained transformer models are offered to the research community in ready-to-use models. Transfer learning can then effectively leverage existing implicit knowledge sources achieving impressing results in various tasks, which accounts for many advancements in VL tasks and transformer-based implementations in general.
Nevertheless, implicit knowledge is not always sufficient to answer questions requiring general, factual and commonsense knowledge, especially when rare information is requested. Additionally, its black box nature raises concerns about how and what a pre-trained model has learned; for example, biased or erroneous data received during pre-training will be reflected in all later stages, resulting in decreased performance of the downstream VL model. Tracing back the source of such a problem is not possible due to the lack of interpretability tied to implicit knowledge bases.
Explicit knowledge is based on clear, structured facts in the form of a knowledge graph and it is able to explicitly fill the gaps that cannot be covered via transfer learning. Even though most contemporary multimodal approaches, including transformer-based ones, have acquired a certain understanding of language, visual concepts and their in between relationships, they cannot effectively handle concepts and relationships they have never seen during training (Ilievski et al., 2021) (excluding implementations that present zero-shot capabilities (Radford et al., 2021;Wang et al., 2021c)). Consequently, even the best VL models will fail in cases the data distribution is significantly different from the one they have been trained on. The same discrepancy may apply even when an implicit knowledge source is used, if the implicit distribution remains rather distant. For example, a model trained on pairs of generic images and corresponding captions will inevitably present much lower metrics when asked to infer from medical images accompanied by relevant captions with scientific vocabulary. The same limitation is prevalent when there is a shortage of training data (Ilievski et al., 2021). Although an intuitive scenario would suggest to repeat the pre-training procedure, so that this extra information will be reflected via updated neural weights, the pre-training cost is in reality computationally unaffordable (Sharir et al., 2020) for the majority of research institutions. Even in that case, repeated occurrences of out-of-distribution data would demand from scratch pretraining or at least fine tuning each time, preventing the scalability of related tasks.
Another issue strongly interconnected with massive pretraining is the lack of explainability (Kafle et al., 2019), as it is difficult to track what and how a pre-trained model has learned from data. This black box nature poses questions regarding how rare concepts or rare combinations present in data are handled and whether they can be represented with equal success as the more common ones. In the same time, possible data biases, errors and inconsistencies will be reflected in learned representations, with those issues being hard to be captured and resolved beforehand.
On the contrary, in the case of explicit knowledge bases, the degree of contribution of the knowledge base can be measured and evaluated, offering valuable transparent insights regarding the role of knowledge. Such out-ofdistribution information is well represented in structured knowledge graphs. Large scale knowledge can complement pre-trained models by extending their understanding to previously unseen concepts, either by substituting the need for extra training, or by enriching existing datasets to achieve more informative, fair and high quality representations, if (re-)training is necessary. Even in that case, pre-training demands can be reduced, achieving similar representation capabilities to larger models pre-trained without additional explicit knowledge. The quality of such representations is somehow controllable, a benefit which can be attributed to the explicit and transparent nature of KGs: issues regarding biases, errors, concept drifts and inconsistencies can be captured and resolved easily, exploiting automatic techniques or manual interventions.
In any case, KGs should contain relevant information to the downstream task in order to be beneficial (Ilievski et al., 2021).
There are some downsides regarding the usage of explicit knowledge in the form of KGs. First of all, many KGs may require manual labor for data collection and curation. The same disadvantage also applies on the construction and maintenance of the graph itself. In certain cases, such as in the medical domain, experts are necessary in order to design and construct dedicated KGs. Moreover, there are difficulties regarding alignment and co-operation between different KGs (Ilievski et al., 2021), thus sometimes limiting in practice the improvements they can offer.
Combining implicit and explicit information can offer advanced capabilities to downstream tasks, as implicit sources can fuse large-scale general knowledge to a model, while explicit sources can fix errors, enrich existing knowledge and increase a model's transparency.
Web-crawled knowledge refers to unstructured knowledge obtained from the web, which is able to combine benefits present in implicit and explicit knowledge bases. There is no need for labelled data, but also no need for expensive pre-training, which is one major limitation of implicit knowledge. Online sources can be accessed easily, while the amount and the content of retrieved knowledge is easily controlled and customized to the task's needs. A questionable part of web-scrapped knowledge is its quality, as it is hard to validate any available web source. Low-quality data may deteriorate the final performance of the model, therefore time and effort has to be invested in techniques that ensure high-quality data automatically. Web knowledge can offer some amount of transparency, as a sentence leading to the final prediction can be tracked, even if reasoning is not as fully explicit as in cases of structured graphs.
Internal or self-knowledge is a type of knowledge that does not rely on any external source. On the contrary, it can be obtained from the existing textual and visual data themselves. For example, producing a scene graph can enable learning more fine-grained representations comparing to just utilizing VL data in their original format . Self-knowledge has demonstrated improvements in downstream model performance, especially when detailed disambiguation is necessary, as it enables better associations between existing data. However, selfknowledge does not extend the knowledge a model has already acquired from the data it has been trained on. Furthermore, it is prone to errors associated with the knowledge acquisition process, such as scene graph generation errors.
An overview of the available types of knowledge sources is provided in Figure 4.

Widely used Knowledge graphs
Widely used knowledge graphs in literature are analyzed below. WordNet (Fellbaum, 1998) is a large-scale lexical database which provides cognitive synonyms, called synsets, for words (nouns, verbs, adjectives and adverbs) of the English language, representing their conceptual and lexical relationships. In total there are 117K Word-Net synsets that form a tree-structure through their relationships. Verb and noun synsets are interlinked with transitive hierarchical (is-a) relationships, bringing semantically related synsets close together. Therefore, distance on the WordNet graph provides a measure of concept similarity. There is also a distinction between types and instances, with types (common nouns) expressing more specific meanings of a concept (cat is a type of animal), while instances are specific persons, countries and geographic entities (Rome is an instance of a city). The root node of the WordNet tree belongs to the {entity} synset, the most generic concept and parent of all other synsets. Traversing from root to leaves leads to more and more specific concepts, with instances always being located at the leaf level. Synsets easily offer word sense disambiguation, as certain words that have many distinct meanings are mapped to a different synset, so that finally all synsets represent different meanings. Part-whole relationships are also present and parts of a current node can be inherited from parent nodes, but not vice versa. Adjectives are linked via semantic similarity and dissimilarity links, with antonyms and synonyms directly related.
DBPedia (Auer et al., 2007) is a multi-lingual knowledge base that extracts factual structured information from Wikipedia. Is covers a variety of domains and evolves together with Wikipedia, thus handling concept drifts. DB-Pedia also provides SPARQL endpoints to enable queries.
Wikidata (Vrandecic and Krötzsch, 2014) is a free collaborative multilingual knowledge base of more than 6.7k relationships and more than 97 million data items, available to everyone for editing. It provides links of entities to their sources and other databases, endorsing the verifiability of contents.
WebChild (Tandon et al., 2014) is a large commonsense knowledge base automatically collected from the web. It presents high accuracy of statements, with finegrained entries involving part-whole, comparative, property, activity and spatial relationships. All entries are disambiguated by mapping on WordNet synsets. In total, WebChild 2.0 (Tandon et al., 2017) contains more than 2 million concepts, interconnected by 18 million relationships.
HasPartKB (Bhakthavatsalam et al., 2020) contains part-whole statements extracted from a large generic corpus, covering numerous common terms. Due to the huge possible instances of part-whole relationships in real world, salient parts are preferred, referring to instances that are most probably useful to be stored. to all Entities are linked with WordNet and Wikipedia.
YAGO4 (Tanon et al., 2020) is a general purpose knowledge base storing knowledge about people, places, Figure 4: Overview of knowledge sources movies, organizations and others. It consists of 2 billion triples and 64 million entities automatically extracted from Wikipedia. WordNet is leveraged for entity and relationhsip disambiguation. YAGO4 actually forms a cleaner and more human-readable version of Wikidata, satisfying logical constrains, therefore allowing automated reasoning on the data.
ATOMIC (Hwang et al., 2021) is a commonsense knowledge graph with 1.33M tuples regarding events and entities, incorporating both social and physical senses of everyday experiences. In total, it contains 23 relationship types, with 9 types referring to social interactions, 7 types concerning physical entities and 7 types representing event relationships.
Visual Genome (Krishna et al., 2016) can be also viewed as a knowledge base, thanks to the scene graph annotations that explicitly showcase entities, relationships and attributes present in a scene, accompanied by the actual visual information of each image. Mappings to Word-Net synsets offer disambiguation as well as hierarchical relationships between visual instances. In general, scene graphs can act as spatial and visual knowledge bases.

Multimodal Tasks with knowledge
Recent efforts of integrating internal and external knowledge sources in VL tasks are analyzed in this section, together with datasets and evaluation methods they follow. As mentioned in section 2, we can divide existing KVL approaches in single-task and multi-task models. Starting from single-task models, we first present works in the most developed field of discriminative KVL tasks, and more specifically understanding tasks such as visual question answering (VQA), visual reasoning (VR), visual commonsense reasoning (VCR), followed by generative tasks, such as image captioning (IC), visual storytelling (VIST), story visualization (SV), conditional (text-to-image) generation (cIG). Finally, multi-task models, targeting more than one downstream tasks at once are analyzed.
In Figure 5 an overview and taxonomy of KVL tasks is presented. Some tasks such as visual referring expressions (VRE), visual question answering (VQA), visual commonsense reasoning (VCR) and visual dialog (VD) can be either discriminative or generative, even though their discriminative variants are more widespread, being comparatively easier. Single-task models are focusing either on generative or understanding tasks, while there is no single-task model focusing exclusively on a retrieval task such as cross-modal retrieval (TIR/ITR) and visual referring expressions (VRE). Those two tasks are exclusively tackled -among with others-by multi-task models. In the same fashion, visual entailment (VE) is only addressed by multi-task models. Finally, a couple of tasks such as visual-language navigation (VLN) and multimodal machine translation (MMT) still lack a knowledge-enhanced counterpart.
The upcoming sections are primarily organized starting from single-task approaches as per task, followed by multi-task approaches in the end. Datasets, methods and evaluation metrics are provided for each independent task. A more detailed division of methods is achieved based on the knowledge type (external/internal) and the language representation scheme (sequential models/transformers) per approach. VQA with Explanations (VQA-E)  pursues the tractability of the reasoning process leading to an answer. In total, it contains around 108k images, and more than 269k explanations assigned to an equal number of QA pairs. Based on VQAv2 (Goyal et al., 2016), it automatically constructs explanations with the help of COCO captions (Lin et al., 2014), as they are connected with VQAv2 images. Captions embeddings and question/answer embeddings are coupled, forming pairs of highest cosine similarities, thus assigning captions to images. Resulting explanations are highly diverse, with more than 171k unique instances, although they cannot cover images with subjective questions, such as emotional ('Do you think this pony is cute?'), commonsense ('Can you cross the street?') or behavioral ('Could you eat all these bananas by yourself?') ones. Human evaluators asses the quality of explanations, measuring is they are fluent, correct, relevant and complementary to the answer.
DAQUAR (Malinowski and Fritz, 2014) is a dataset of real world indoor scenes containing fine-grained object categories. Questions and answers related to the images are very rich regarding the objects they express: 573 unique nouns are mentioned within the whole corpus of questions and answers. Questions requiring commonsense knowledge such as 'Which object on the table is used for cutting? are included in DAQUAR, while even spatial questions such as 'What is above the desk in front of scissors?' can be benefited from additional knowledge.
COCO-QA (Ren et al., 2015) addresses shortcomings of the DAQUAR (Malinowski and Fritz, 2014) datasets, such as its small size in terms of train/test samples and the limited number of object classes. COCO-QA contains 123,287 images, together with 78736 train and 38948 test questions obtained from COCO image descriptions (Lin et al., 2014). Questions are divided in 4 types with varying numbers of questions in each of them: Object, Number, Color and Location questions.
KB-VQA (Wu et al., 2016b) has been constructed in order to evaluate VQA models on questions that need visual information as well as external knowledge to explicitly infer the right answer. It includes images from COCO (Lin et al., 2014) containing approximately 150 object classes and 100 scene classes, question-answer pairs following pre-defined templates and question labels. The questions involved in KB-VQA are divided in three categories: visual questions can be answered by extracting information from the image (such as Is there a dog in this image?); common-sense questions rely on external knowledge contained in commonsense knowledge bases (How many road vehicles are in this image?); finally, KBknowledge questions require information form Wikipedia or similar sources (When was the home appliance in this image invented?).
Factual VQA (FVQA) ) is a dataset addressing factual VQA, based on images sampled from COCO (Lin et al., 2014) and ImageNet (Deng et al., 2009) which form three types of visual content (object, scene and action classes), together with structured visualrelated knowledge extracted from DBpedia (Auer et al., 2007), ConceptNet (Speer et al., 2017) and WebChild (Tandon et al., 2014). All this information is stored in a graph of RDF triplets. Annotators construct questions and answers which require both selected visual content and associated facts. In total, FVQA contains 2,190 images of 326 object classes and 221 scene classes, 5,826 questions of 32 categories, which correspond to 4216 unique facts.
Knowledge-aware VQA (KVQA) (Shah et al., 2019) targets world knowledge-aware VQA by filling the gap of named entities knowledge. It relies on knowledge present in Wikidata (Vrandecic and Krötzsch, 2014) knowledge graph, resulting in 183K question-answer pairs which involve more than 18K named entities and 24K images.
Outside-knowledge VQA (OK-VQA) contains more than 14K diverse and difficult questions of 10 mutually exclusive categories which cannot be answered without involving external knowledge. More than 14K images were sampled and filtered from COCO (Lin et al., 2014). In contrast with previous works on datasets, OK-VQA does not consult a fixed knowledge graph to guide answer prediction, but dynamically recognizes what knowledge is needed, either structured or unstructured. (Marino et al., 2019) Text-KVQA is a very large dataset addressing scenetext recognition for knowledge enabled VQA. It contains images from book covers (Iwana et al., 2017) and movie posters (mov), as well as Google scraped images of 1000 business brands. All images are evaluated so that they contain for sure scene text relevant to the content. Knowledge bases corresponding to each of those 3 scene types were constructed based on Wikidata (Vrandecic and Krötzsch, 2014) for business scenes, IMDb (imd) for movie posters and (Iwana et al., 2017) for book covers. The train/validation/test splits enable zero-shot capabilities, as there is no entity overlap between them. The supporting facts are not tied with their corresponding entities, but instead are dynamically mined from the knowledge bases.  Visual7W+KB is an extension of the Visual7W  test split which further contains knowledgebased visual questions guided from ConceptNet (Speer et al., 2017). However, the dataset is not tied with a specific knowledge graph, even though ConceptNet is indeed preferred in practice. In total it consists of 16,850 opendomain question-answer pairs and 8,425 images from Visual Genome (Krishna et al., 2016). The questions belong to 7 categories (what, where, when, who, why, which and how), the answers are in multiple-choice format.  One of the major challenges in knowledge-enhanced VQA is that questions should encourage exploitation of all participating modalities, therefore data-related weaknesses arise in existing benchmarks. In the meanwhile, information leakage between train and test set answers often promotes guessing rather than reasoning. S3VQA is a dataset aiming to tackle those issues, by containing questions that can be answered only with the help of a knowledge graph, as well as visual and textual information from the image. (Jain et al., 2021) Zero-shot Fact VQA (ZS-F-VQA) extends F-VQA  for zero-shot learning settings. It considers the image-question-answer triples whose answers belong among the 500 most frequent. The filtered dataset is split in train (seen) and test (unseen) triples which contain non-overlapping answers. In total, 5 splits of the original F-VQA dataset are performed, yielding on average 2732 train and 2760 test triples. (Chen et al., 2021c) Art QUestion Answering (AQUA) (Garcia et al., 2020) is a visual reasoning dataset for the art domain. There are many challenges tied with analyzing and reasoning over artworks. First, there are different levels of abstraction regarding common objects and entities, as many paintings deviate from realism. Therefore, recognizing objects and reasoning about them is much harder comparing to scenes existing in most datasets. Moreover, domain knowledge regarding artists, art movements, historical periods and other cultural influences can only be recognized with the help of a knowledge source. This information also affects the interpretation of a painting. QA pairs are generated automatically based on paintings and descriptions of the SemArt (Garcia and Vogiatzis, 2018) dataset, which form the knowledge source. In total, after cleansing AQUA contains more than 69K QA training pairs, from which around 29K are visual and 40K are knowledge oriented.

Methods 6.1.2.1 Keyword-based explicit KG querying
First attempts target the construction of a scalable multimodal knowledge base which aims to answer visual queries that require real-world knowledge. Image classes, attributes and actions are extracted from the images, forming logical rules. The knowledge base built upon those rules contains nodes of visual and textual entities, as well as edges of diverse types between the entities. (Zhu et al., 2015) However, this constructed knowledge base remains limited to the visual information present specifically in the SUN (Xiao et al., 2010) dataset. Most subsequent methods utilize already constructed large knowledge bases, targeting a broader range of concepts, commonsense knowledge and more complex questions to be answered.
Towards this direction, early approaches focus on handling open ended questions regarding contents of a scene with the assistance of provided external knowledge. Attributes extracted from images using a fine-tuned VGG-16 (Simonyan and Zisserman, 2015) model act as SPARQL queries to knowledge bases such as DBPedia (Auer et al., 2007), and contribute to caption generation. Retrieved knowledge embedded via Doc2Vec (Le and Mikolov, 2014), together with attributes and LSTM-based caption representations are fed in another LSTM model which generates the final answer. (Wu et al., 2016b) An improvement of this version followed in (Wu et al., 2016a), extending the framework to two more datasets, namely DAQURA-ALL (Malinowski and Fritz, 2014) and its reduced version DAQUAR-REDUCED.
Even from early works in knowledge-enhanced VQA, explainability is addressed as an important topic to define how a model actually learns from visual content and external knowledge towards concluding to an answer. Therefore, Wang et al developed a knowledge-enhanced VQA framework that provides the reasoning path from which the answer is inferred. Objects are detected using Fast-RCNN (Girshick, 2015) object detectors trained on ImageNet (Deng et al., 2009) and MS-COCO (Lin et al., 2014), scene classes are extracted from a VGG-16 (Simonyan and Zisserman, 2015) pre-trained on MIT-Places (Zhou et al., 2014), and scene attributes are captured via a VGG-16 pre-trained on ImageNet and fine-tuned on MS-COCO. All those visual concepts form RDF triples and are linked with corresponding DBPedia (Auer et al., 2007) entities. Questions are parsed so that key-phrases are extracted and mapped to the knowledge base entities. The same work introduced the KB-VQA dataset.  Consequent works further proceed towards avoiding SPARQL queries on the knowledge graph, but rather fully utilize embedding representations for fact selection and reasoning to provide an answer.

Sequential language models for question encoding
Instead of SPARQL querying from plain keyword extraction, vector representation of involved modalities set the basis for improved performance and state of the art results in knowledge-enhanced VQA. Initially, fact ranking based on embedding similarity metrics paved the path of successful approaches, upon which graph neural network reasoning further advanced the contribution of external knowledge and overall performance.
Embedding-based fact retrieval from KG Traditional VQA models utilizing RNNs for language encoding, focus on learning a question-answer mapping. Due to the limited and opaque reasoning capabilities of this approach over diverse answers, a more scalable solution that involves learning the mapping between questions and KB-queries using LSTMs was proposed. This new approach is explainable, as the fact connecting a question and an answer reveals the reasoning procedure.  Projecting question-image pairs and facts on a common embedding space poses advantages over previous approaches, such as extendability to different knowledge bases and error elimination by avoiding explicit query generation. Images and questions are embedded using CNNs (for objects, scenes and actions) and an LSTM respectively, and they are projected in a common space using a Multi Layer Perceptron. Another LSTM is used to retrieve facts from the knowledge base, which are then encoded in GloVe (Pennington et al., 2014) vectors. The dot similarity between the question-image representation and fact embeddings provides a fact ranking, from which the final answer is inferred.  Narasimhan et al, building upon , argue that considering muiltiple relevant facts instead of a single top-ranked fact at a time leads to better generalization. During the fact retrieval stage, a subset of highly-relevant facts is obtained with the help of LSTM-based question embeddings. In the answer prediction stage, each node is represented by concatenating the selected entity representation from the previous stage, visual features from the image and the question embedding. The subgraph formed from all the relevant facts is jointly assessed by a graph convolutional network (GCN) (Kipf and Welling, 2016), followed by a multi-layer perceptron that decides if each entity constitutes the final answer or not.  Based on the KVQA dataset, a memory network (mem-Net) framework sets the baseline for VQA with knowledge of named entities. Specifically, entities extracted from the question and the image are used to obtain facts from Wikidata (Vrandecic and Krötzsch, 2014) knowledge graph. Retrieved facts together with corresponding entity coordinates from the image are used to produce memory embeddings via a BiLSTM network, and a similar procedure is followed for the question embeddings. Both representations contribute to the final answer, which is defined by a multi-layer perceptron. (Shah et al., 2019) Text present on an image can provide further information towards inferring the correct answer. Extracted text and image areas are fused together with the given question to retrieve relevant facts during the fusion stage, and a multi-relational graph is constructed based on all those components. The text recognition part relies on word proposals assisted by the knowledge graphs accompany-ing the text-KVQA dataset. Scene proposals were created with the help of Places dataset for scene recognition (Zhou et al., 2018) and a fine-tuned VGG-16 (Simonyan and Zisserman, 2015). A gated graph neural network (GGNN) performs one-hop reasoning on this graph to derive the final answer.  Multimodal graphs Unexpected noise in the answer inference process can be attributed to the absence of detailed selection of information during modalities fusion. Considering multiple views of the same image offers a new perspective that is closer to human cognition. Multiple knowledge graphs provide visual, semantic and factual information derived from corresponding images, text and facts respectively, while the visual and the semantic graph can be considered as instances of the factual graph. Intramodal graph convolutions focus on the most relevant parts of each modality. Consequently, cross-modal knowledge reasoning on the fact graph iteratively aggregates information from the visual and semantic graphs using a recurrent module, and after multi-step reasoning the multimodal knowledge is fused in each entity. The final answer is returned by applying a GCN (Kipf and Welling, 2016) over those entities. This approach offers interpretability by revealing the entity and the modality graph which contributed to the answer.  A similar approach to  utilizes visual, semantic and factual graphs for image representation to eliminate noise during multimodal fusion. The Multi-Modal Heterogeneous Graph Construction stage is responsible of constructing those modality graphs, followed by the Cross-Modal Heterogeneous Graph Reasoning which selects intra-modal knowledge and then performs crossmodal reasoning. Information relevant to the question is extracted from the three graphs via a modality-aware heterogeneous graph convolutional network. Cross-modal convolutions define complementary relevant information transmitted from visual and semantic graphs to the fact graph. The final answer is returned after reasoning over aggregated fact information.  A major challenge in knowledge-enhanced multimodal tasks is its supervised nature, as a possible absence of ground truth facts may hinder the inference of a proper answer in several approaches. A local subgraph is constructed based on concepts present in the image and the question, aiming to bridge the gap between questionimage context and external knowledge. Those subgraph concepts act as anchor points to a knowledge graph, such as ConceptNet (Speer et al., 2017) and Wikidata (Vrandecic and Krötzsch, 2014), enabling the expansion to their immediate neighbors. Moreover, a global sub-graph is constructed in a similar fashion for all the candidate answers. In each subgraph the information of neighboring nodes is aggregated to produce embeddings of the anchor concepts, and their similarity to the query embeddings drive the final answer. (Li et al., 2020b) Multiple feature spaces Addressing the zero-shot setting of knowledge-enhanced VQA, a knowledge graph can help capturing semantics outside training data. Multiple feature spaces are used for independent alignment between image/question input and KG entities. The semantic space focuses on the linguistic information of the input (image, question) pair, representing a feature space of relationships; the object space acts as a support entity feature space, capturing visual and textual salient features; finally, knowledge space is dedicated to answer representation. (Chen et al., 2021c)

Transformer-based models
Transformer based approaches form end-to-end architectures that utilize single-modality or joint representations rather than creating queries to knowledge bases and then injecting the retrieved entities. Thus, we can classify the transformer-based approach in two categories: the first includes architectures which use transformer architectures for text encoding, while the second utilizes multimodal transformers to jointly encode vision and language.
Transformer architectures for language encoding Similarly to , the usage of dedicated graphs for different modalities is also followed in , attempting to represent relationships between visual objects and semantic entities present in a scene graph and a knowledge graph respectively. The scene graph is constructed from visual and question embeddings which form the graph nodes and relationships. In the meanwhile, joint image and question embeddings select the most relevant knowledge graph node embeddings to construct the concept graph. Both image-question and knowledge representations are obtained via pre-trained language models for sentence similarity, such as sentence-BERT (SBERT) (Reimers and Gurevych, 2019) and Universal Sentence Encoder (USE) (Cer et al., 2018). The most relevant nodes of both scene and concept graphs are selected via a Graph Attention Network (GAT) (Veličković et al., 2018), which decides the edge weights with respect to the question. A joint embedding incorporates the question embeddings together with the scene graph and knowledge graph outputs.
Even though explicit and structured knowledge bases constitute the majority of the approaches analyzed so far, implicit and unstructured knowledge can also boost the VQA task. GPT-3 (Brown et al., 2020) can retrieve knowledge based on text prompts and effectively reason over it in a few-shot manner: no fine-tuning is required and instead only a few examples during inference are provided. Captions are first extracted from images using VinVL  to form GPT-3 inputs. Regarding sample selection for the few-shot inference stage, both improving the quality and increasing the number of samples have been explored. The top-n most similar prompt examples comparing to the inference-time question to be answered are defined by CLIP (Radford et al., 2021), thus maximizing sample relevance. On the other hand, multiple queries corresponding to one inference-time example can be used to retrieve answers from GPT-3 using n example prompts each time, and their ensembling results in the final answer. Contrary to most works on knowledgeenhanced VQA, inferring the answer is a generative task and not a discriminative one among pre-selected candidate answers or graph nodes.  Unimodal pre-trained transformers yield better generalization capabilities over multimodal approaches of comparable size when external knowledge is necessary. Language models employed for the knowledge-enhanced VQA task are sufficient even to compensate for the limitations of image captioning models, which often fail to fully capture visual semantics. To this end, a pre-trained image captioning system, in this case the multi-task OSCAR (Li et al., 2020c) transformer, is used to extract linguistic information from an image, while a language model such as BERT, acting as an implicit knowledge source, receives the caption and the question to infer an answer. Moreover, text-only and multimodal approaches have complementary capabilities, therefore their combination can yield even more powerful models. (Salaberria et al., 2021) VIKING is a framework accompanying AQUA dataset (Garcia et al., 2020) for visual QA on the artistic domain. As questions in AQUA may either be visual or knowledge oriented, a modality selector first decides the right category by receiving the encoded image and question. Visual oriented questions do not require external knowledge in order to be answered. For knowledge-oriented questions, a two stage fact retrieval strategy is followed, pairing the given question with the most relevant painting description, which corresponds to the external knowledge fact needed. The first stage utilizes TF-IDF to rank descriptions according to the question, and in the second stage re-ranking is performed using BERT. Finally, a fine-tuned XLNet (Yang et al., 2019b) model provides the final answer.
Joint multimodal encoding with attention-based fusion A slightly different technique is followed in , where the visual modality is not captioned, but fused with the BERT-embedded question. More specifically, a knowledge graph for artistic VQA is construcyed based on YAGO (Tanon et al., 2020) knowledge graph and AQUA (Garcia et al., 2020) dataset. A Hierarchical-Knowledge Embedding module is responsible of retrieving relevant relationships r from the knowledge graph that can form a (h, r, t) triple with question related entities serving as the head h of the triple, and answer related entities as the tail t. A Network-Based Representation Learning module extracts visual and textual features and fuses them together in order to obtain a VL representation. The fusion part first applies local attention per modality, and then global attention on both text and image, where locally 'attended' text features form the query, and visual features the key and value. Query, key and value are inserted in a multi-head attention unit which further promotes the joint representation to consequent layers, until a global representation is obtained. Then, a Knowledge-Based Representation Learning module injects hierarchical-knowledge embeddings to the networkbased representation. This representation is inserted into a relational module, which performs meta-training: a representation learned on the training data is transferred to a support set of disjoint labels. Finally, the relational module derives the answer.

Joint multimodal encoding with VL transformers
ConceptBERT is one of the first attempts towards the endto-end transformer-based direction, where all modalities are jointly exploited for learning. The first step includes obtaining representations for each individual modality. Visual features are extracted using a pre-trained Faster R-CNN network (Ren et al., 2017) and BERT (Devlin et al., 2019) provides the question representation. ConceptNet (Speer et al., 2017) acts as the commonsense knowledge source, and is encoded using the ConceptNet embedding (Malaviya et al., 2019) method, a Graph Convolutional Network (Kipf andWelling, 2016) variant that relies on message passing from node to node in order to obtain the ConceptNet graph representation. Two modules receive the embedded inputs: A vision-language module consisting of two streams in a ViLBERT (Lu et al., 2019) fashion, and a concept-language module based on the bidirectional Transformer architecture are proposed to model the interactions between the relevant modalities. Both outputs of these modules are joined to form a conceptvision-language representation, which finally concludes to the answer via a classifier. (Gardères et al., 2020) Knowledge obtained from the web according to the given question and respective answer can act as a large external implicit knowledge source for the OK-VQA dataset, covering knowledge 'gaps' in several domains without manual human effort. The proposed weakly-supervised framework consists of two phases: the first one (Retriever) retrieves relevant knowledge, which guides answer prediction in the second stage (Reader). Two different approaches are followed for representing the question-image pair inputs: either the question and image are encoded using an LXMERT (Tan and Bansal, 2019) transformer, resulting in a multimodal representation, or the image caption and the question are encoded via BERT (Devlin et al., 2019), leading to an exclusively linguistic representation. The BERT-based linguistic encoding can contribute to both a neural based retriever and a term based (Robertson and Zaragoza, 2009) retriever. A similarity score defines the relevant knowledge for the neural based retriever, which is further concatenated with the question representation, and consequently with the image using again an LXMERT model.  KRISP framework addresses the scenario where essential external knowledge is absent during training, as well as test time. Both implicit and explicit knowledge sources are utilized: Explicit knowledge combines DBPedia (Auer et al., 2007), ConceptNet (Speer et al., 2017), VisualGenome (Krishna et al., 2016) and hasPart KB (Bhakthavatsalam et al., 2020) in a knowledge graph of 36,000 edges and 8,000 nodes after filtering out irrelevant concepts, while pre-training using BERT can offer implicit knowledge. Explicit visual symbols from images are extracted to constrain the knowledge graph entities corresponding to image-related concepts, including objects, parts of objects, attributes and places. Likewise, symbols are extracted from question to contribute to the formation of a graph for all explicit symbols. A Relational Graph Convolutional Network (RGCN) is used for graph representation, allowing dedicated processing for different edge types and directions. After reasoning, a symbolic prediction vector is returned. Regarding the implicit information stream, a multimodal BERT (MMBERT) model incorporates visual and textual embeddings to produce an implicit prediction vector. Finally, the top-ranked prediction from both vectors defines the answer. (Marino et al., 2021) The presence of scene text can offer valuable information for properly predicting the correct answer. The detected text, among with the image, the relevant knowledge from Google Knowledge Base (GKB) and the question representation are fed into a multimodal transformer, enabling interaction through attention mechanisms between the different modalities. The OCR extracted text act as a query to GKB to retrieve candidate entities, which are then disambiguated based on the visual context. External knowledge not only boosts the understanding of scene-text even in unseen instances, but also tackle biases present in training data. (Dey et al., 2021) While employing an abundance of knowledge sources can cover more visual topics, a lot of noise may be introduced, as more irrelevant information is retrieved. MAVEx utilizes multi-granular queries to retrieve external knowledge with the purpose of validating and correcting predicted answers among suitable candidates with the help of various knowledge sources. Specifically, a finetuned ViLBERT (Lu et al., 2019) model creates a pool of candidate answers, and together with the corresponding question, extracted keywords and phrases from the question and the possible answers are utilized to query external knowledge. Wikipedia, ConceptNet and Google images act as knowledge sources regarding different views of knowledge. Finally, retrieved knowledge instances are matched with the queries to acquire the highest ranked supporting fact, which returns the degree of agreement with respect to candidate answers, guiding decision towards the most trustworthy knowledge source.  Passage retrieval can serve as an answer selection technique to VQA instead of choosing among pre-defined candidate answers. Both sparse and dense retrieval are investigated. For sparse retrieval, given a question and an image, visual clues such as object names and captions are extracted from the image and BM25 is used to return the k most relevant passages. For dense retrieval, questions and images are jointly encoded in dense vectors using LXMERT. In any case, retrieved external knowledge can be integrated dynamically from diverse and generic sources, without using a fixed knowledge base. As positive passages are considered the ones containing exactly the ground truth answer. LXMERT (Tan and Bansal, 2019) is used to encode the question and the image jointly, while BERT encodes the passage, and a dot similarity between them defines the k most relevant passages. (Qu et al., 2021) Based on the E-BERT (Poerner et al., 2020) strategy of knowledge injection without expensive re-training, LXMERT (Tan and Bansal, 2019) language encoder input is modified to incorporate factual knowledge from Wikipedia by aligning Wikipedia2Vec embeddings with BERT wordpiece vectors. No other change is required within the language encoder's architecture, while the visual encoder remains entirely intact. Only fine-tuning is required for achieving advanced accuracy due to knowl-edge injection. In the meanwhile explainability regarding visual and textual modalities is also enhanced. For this purpose, BM-GAE (Chefer et al., 2021) is employed to extract visual and token explanations that help identify in which parts knowledge injection was helpful. (Garcia-Olano et al., 2021) S3, presented with the S3VQA dataset, targets to answer visual question based on all the participating modalities simultaneously. Entity spans from the question are selected to be matched with objects of scene graphs corresponding to images. This match can be often guided by external knowledge sources, which enables answering more complex questions that require multi-hop reasoning. BERT identifies those appropriate question spans, while object detectors propose the objects that most likely fill the spans. Wordnet (Fellbaum, 1998) synsets are mapped to the objects, and their hierarchical positions are represented via structural embedding methods. Finally, Google search is used to retrieved the top results of the enriched question representation. Alternatively, the answer can be provided via classification of possible candidate answers. (Jain et al., 2021) 6.1.3 Evaluation Classification/Ranking metrics are widely used in K-VQA, following the paradigm of knowledge-free VQA, with most works rely on the top-1 accuracy metric for comparison. Accuracy metric is further decomposed to explain the contribution of subcomponents in many works: object, counting, color, and location accuracies, as well as accuracy per question type are reported (Wu et al., 2016b;. Other accuracy reportings include performance as per selected knowledge source, per visual concept and per answer source (image or knowledge) ; the individual accuracies of involved stages of the reasoning process, which together contribute to answer prediction ; accuracies as per question category (Shah et al., 2019;Gardères et al., 2020). Moreover, preci-sion@k and recall@k have appeared in fewer works, as well as ranking metrics such as MRR. Some early works (Wu et al., 2016b; utilize WUPS with typically used thresholds of 0.0 and 0.9.
Human evaluation provides further insights comparing to solid accuracy-based scores that fail to fully describe the success or the shortcomings of a metric. Therefore, many authors propose human evaluation experiments to grade the model's response comparing to human perception, and count the number of agreement instances over all results .
Evaluation of reasoning paths is followed by employing human judgement, as being one of the most trustworthy indicators . Due to the transparent reasoning process, failure cases can be traced down, revealing the exact stage where the prediction deviated from the intended one. Thus, shortcomings can be attributed to architectural choices, encoding techniques, or even incorrect data annotations. Other metrics regarding explainable reasoning include top-k fact retrieval accuracy for different k values, a crucial step for returning the correct answer in several approaches . Fact recall can also assess the fraction of relevant facts retrieved for a given question .
Generally, benchmarking knowledge-enhanced VQA approaches is not trivial. The plethora of combinations between available knowledge based datasets and external knowledge sources is rather large comparing of the number of available implementations in the field. Additionally, various choice of metrics in literature makes comparisons of model performance even harder. Only recently implementations started becoming more consistent, focusing on evaluating results with plain accuracy and leveraging OK-VQA as a widely used dataset.
6.2 Knowledge in Visual Reasoning (K-VR) 6.2.1 Datasets High-order Visual Question Reasoning (HVQR) (Cao et al., 2019) is a knowledge-based dataset endorsing interpretable visual reasoning using commonsense knowledge. Given an image and a question, an answer is inferred, as well as a reasoning path as explanation. Even thought this is similar to rationales used in knowledge-free datasets for VCR (Zellers et al., 2019) in fact the format of HVQR explanations differ. Instead of textual rationales, rules for the whole reasoning path are returned, combining visual and knowledge oriented triples, derived from the scene graph and the commonsense knowledge graph respectively. HVQR contains questions that require multi-step reasoning to infer an answer. Moreover, each knowledge triplet appears only once per question, in order to avoid frequency-based biases. An evaluation scheme validates each step of the reasoning process based on the commonsense and scene graphs provided. More than 157K QA pairs comprise the dataset, from which 289,720 pairs are unique, together with approximately 32K images and corresponding scene graphs from Visual Genome (Krishna et al., 2016). Based on the reasoning steps required for the answer, first-order and second-order questions can be recognized, corresponding to 68,448 and 88,753 questions respectively. Another split defines 87K KB-related questions and 70K KB-not-related questions. Additionally, 193,449 facts from WebChild (Tandon et al., 2014), ConceptNet (Speer et al., 2017), and DBpedia (Auer et al., 2007) formulate the knowledge base. Scene graphs per image are combined with related entities from the knowledge base, constituting image-specific knowledge graphs.
Compositional Language and Elementary Visual Reasoning -CLEVR (Johnson et al., 2017) is a synthetic dataset of 3D objects which contain annotations regarding their position and attributes. Those attributes describe the size (small, large), color (red, brown, yellow, green, blue, cyan, purple, gray), shape (cube, cylinder, sphere) and material (rubber, metallic) of each object. Positions can belong in 4 types, namely left, right, behind, in front. Highly compositional questions form 5 question categories: Exist, Count, Compare Integer (equal, less, greater), Query Attribute (size, color, material, shape) and Compare Attribute (size, color, material, shape). Moreover, CLEVR contains 90 question families following different program templates, as well as text templates, so that natural language questions can be derived. The questions are translated in natural language by filling the template with template parameters. CLEVR is not connected to some external knowledge, although due to the limited semantics and the nature of the task they target CLEVR CoGenT is a benchmark derived from CLEVR (Johnson et al., 2017) that assesses the ability to capture unseen combinations of attributes during testing, thus showcasing a model's generalization capabilities.
6.2.2 Methods 6.2.2.1 Sequential language models External knowledge KM-net (knowledge routed modular network) is introduced in the same work with the HVQR dataset (Cao et al., 2019), addressing multi-step (compositional) reasoning using visual and commonsense knowledge. Each question is decomposed into consecutive subqueries via LSTM encoder-decoder schemes, passed to a visual reasoning module and a commonsense reasoning module to extract different types of knowledge accordingly. The subqueries form a query layout, i.e. a tree structure revealing the relationships of subqueries, with leaf nodes belonging to distinct words of the queries. A bottom-up attention R-CNN provides visual features for the image. The subqueries are processed sequentially starting from the most specific ones, driven by the KMnet reasoning module. First, the knowledge reasoning module receives question entities from the query layout and returns the most probable candidate entities from the knowledge base. Then, the visual reasoning module receives entities from the scene graph, together with the candidate entities of the knowledge module and fuses the candidate entities, image features and query embedding to derive the answer.
Internal knowledge Self-knowledge includes the usage or construction of a scene graph based on the detected objects, relationships and attributes. In the same time, the question can be parsed in a structured program via an LSTM, producing subqueries that form a tree structure. Given those two graph representations, an encoding is derived for the query. Node attention and edge attention are calculated based on the query embedding. Combining a node attention vector and an edge attention matrix, new objects can inferred due to the graph structure; basically starting from the attended node vector and traversing over an attended edge, a new node vector will be provided. The same procedure can be followed for all subqueries, respecting the structure of the query tree. Logically relating subqueries, results in logical operations (such as and, or, not) over attended scene graphs. Finally, based on the question type and the final scene graph, the answer can be provided. (Shi et al., 2018)

Evaluation
Classification metrics are commonly used for benchmarking, with answer accuracy providing a general measure of performance (Cao et al., 2019;Shi et al., 2018). Compositional commonsense reasoning heavily relies on the evaluation of reasoning paths that provide the final answer (Cao et al., 2019). The accuracy score is further decompose to present KB-related and KB-not-related accuracies depending of the need for external knowledge; those can be decomposed into first-order and secondorder accuracies, regarding the number of reasoning steps required; finally, a more fine grained categorization provides question-type accuracy, based on the template the query components follow.
Ranking metrics such as average recall are used to evaluate the retrieval success of supporting facts for explanations. Average recall is further decomposed to KBrelated and KB-not-related fact recall.

Knowledge in Visual Commonsense Reasoning (K-VCR)
Various rich in information external knowledge sources can provide insights of unseen concepts that humans would effortlessly infer from the information provided in a scene. This missing commonsense knowledge is able to guide answer explanation towards the right rationale, revealing if a more accurate reasoning process is followed by VCR models when knowledge is added.

Datasets
Visual Commonsense Reasoning (VCR) (Zellers et al., 2019) is the dataset which introduced the task and serves both knowledge-free and knowledge-enhanced versions of VCR. It contains 110k unique images from movie scenes, 290k multiple choice challenging questions, with 290k correct answers and rationales. Images contain annotations which are anchored over questions, answers and rationales. The technique of adversarial matching is chosen for the answers in order to minimize biases; each correct answer appears four times in the whole dataset, once as a positive answer and three times as negative answer. Therefore, a VCR model will not favor more frequently appearing answers which would endorse guessing rather than reasoning. ). This dataset originally is knowledge-free, therefore not necessarily requiring external knowledge, nor is it associated with any knowledge base. Nevertheless, the questions existing in the various VCR (Zellers et al., 2019) question categories can be greatly benefited by the introduction of external knowledge sources which can explicitly incorporate senses like the ones described in section 5.
Visual Commonsense Graphs (VCG)  is a large scale dataset that provides information regarding temporal commonsense relationships, such as what may have happened before, what may happen in the near future and what are the intents of the people present based on static images. In total it contains more than 59K images and more than 139K textual descriptions of event at present. Additionally, around 295K intents at present, as well as more than 584k events before and 586k events after complete the dataset, resulting in more than 1,4 million commonsense inferences. People and locations appearing in the images are grounded with their mentions in the textual descriptions.

Transformer-based models
External knowledge Some of the first knowledgeenhanced transformer-based attempts stepped upon BERT (Devlin et al., 2019) framework to introduce knowledgevision-language (KVL) learning as an instance of multimodal learning. In the KVL-BERT architecture , ConceptNet (Speer et al., 2017) is leveraged to enrich sentences with relevant commonsense information. The knowledge-enriched linguistic input will be inserted in a BERT-like multimodal transformer. The preservation of semantic structure is achieved by using relative position embeddings. However, injected information should be only visible to their corresponding textual entities of the sentence and not other tokens or visual features, a need that is satisfied via a 'weakening' visible matrix. Moreover, it is possible that different enriched textual tokens in the sentence share the same relative position embeddings, which would make unrelated tokens obtain high self-attention scores, implying that they are related. This contradiction is resolved by imposing a maskself-attention mechanism via the visible matrix, with the purpose to restrict the area a token can attend. After those treatments, the input is in a form suitable to be fed in a VL transformer, in this case VL-BERT (Su et al., 2020). It was observed that KVL-BERT outperforms its multitask baselines, as well as models dedicated to the VCR task, even though it cannot trespass the performance of knowledge-free VL transformers that invest on additional pre-training.
A somehow different strategy is employed in the case of Vision-Language-Knowledge Co-Embedding (ViLaKC) (Lee and Kim, 2020): the three modalities are first embedded independently and afterwards are fused together. Initially, a knowledge extraction module (KEM) retrieves relevant knowledge from ConceptNet based on concepts appearing on the image, question and candidate answers. The encoding of modalities is performed in the two-stage VLKEM module: first, the independent modality encoding embeds images using ResNet , language using BERT (Devlin et al., 2019) and knowledge using GCN (Kipf and Welling, 2016). The second stage consists of the co-embedding submodule which aligns and integrates the three vectors via a multi-head selfattention mechanism. The co-embedder is pre-trained in two phases, the first being task-agnostic, such as in several VL transformer models, and the second taskspecific, utilizing significantly less data (200K samples) coming from all three modalities. The task-specific pretraining stage introduces novel pre-training tasks, such as masked language modeling with image and knowledge (MLMIK), masked object classification with text and knowledge (MOCTK), and vision-language-knowledge matching (VLKM), in order to enforce co-learning. This joint embedding is then inserted in an answer determination module (ADM) consisting of a fully connected layer followed by a softmax.
The CKRM framework (Wen and Peng, 2021) consists of two stages, the first used for knowledge retrieval and the second one for reasoning. SWAG (Zellers et al., 2018), is a dataset containing pairs of situations which describes a situation (context) and possible endings, serves as the commonsense knowledge source, aiming to transfer knowledge regarding everyday situations to the target task of VCR. A source and a task encoder are responsible of receiving (context, ending) pairs and (question, answer) pairs respectively to perform knowledge transfer in different granularity layers. The encoders first use BERT (Devlin et al., 2019) followed by a BiLSTM structure to model temporal interactions of words. Cell-level knowledge transfer refers to the most fine-grained information fusion from source to target task, with layer-level and attention-level knowledge corresponding to coarser aspects of information. This strategy offers acquiring knowledge from various perspectives for a more enriched representation. The knowledge based reasoning module incorporates the multi-level knowledge from the previous stage together with visual features in the Knowledgeenriched visual attention module. Finally, a reasoning composition module combines all aspects of knowledge derived from the multi-level transfer procedure and the enriched visual representations to derive the answer.

Evaluation
Classification metrics, especially classification accuracy is employed for evaluating K-VCR results that follow the multiple-choice format for answers (A) and rationales (R). Accuracy is decomposed by evaluating independently each of the following aspects: 1. Q −→ A: given a question Q, choose as A one of the 4 available answers and compare if it matches the real answer or not.
2. QA −→ R: given a question Q and the correct answer A, select as R one out of the 4 rationales and compare if it matches the real rationale or not.
3. Q −→ AR: given a question Q select as A one of the 4 answers, and depending on selected A choose one of the 4 rationales. The result is regarded to be correct if and only if both right A and R are chosen.
6.4 Knowledge in Image Captioning (K-IC)

Datasets
There are no dedicated knowledge-enhanced or knowledge demanding datasets for K-IC. Knowledge-enhanced models are using COCO captions (Lin et al., 2014) as described in section 3.4.3. Moreover, Flickr30k (Young et al., 2014), a dataset containing 31,783 scene images accompanied by 5 human annotated sentences each is widely employed for K-IC.

Sequential language models
External knowledge First attempts for knowledgeenhanced image captioning propose the extension of existing implementations by injecting commonsense knowledge from external sources. Specifically, the backbone image captioning architecture extracts visual features from images via a CNN, which are then inserted in an LSTM to generate a knowledge-free answer. To enhance this baseline with knowledge, objects extracted from the image are used as queries to ConceptNet (Speer et al., 2017). Related ConceptNet entities, either regarding individual objects (direct terms) or the remaining image areas (indirect terms), are fed to a pre-trained LSTM which provides semantic representations for each of those two points of view. Then, visual features, direct terms representations and indirect terms representations are concatenated to form the initial state of another LSTM model, which finally generates the knowledge-enhanced caption (Zhou et al., 2019c). Both visual and commonsense knowledge for image captioning are used in (Hou et al., 2019). The first step includes dense region sampling from images in order to acquire visual and knowledge mappings. Dense visual feature extraction includes the definition of candidate regions, which are clustered together to provide a more concrete representation: the cluster center points for each dense region cluster serve as the corresponding visual feature. Consequently, the knowledge mapping receives visual features and knowledge embedding vectors from Visual Genome (Krishna et al., 2016) and returns a knowledgerelated representation per region cluster. Both visual and knowledge embeddings resulting from the two mapping procedures are concatenated and then inserted in a commonsense reasoning module. This module projects the two inputs in the same semantic space, from which a semantic graph is constructed under the guidance of commonsense knowledge. In the relational reasoning module a GCN (Kipf and Welling, 2016) operates on the semantic graph to obtain relation-aware node features. Finally, a LSTM receiving as inputs the knowledge-aware node embeddings generates the caption.
Inferring words not appearing in the image remains a challenge in image captioning, as there is no guidance regarding how those unseen words should be inferred to be used in captions. Such unmatched elements can be solved with internal self-knowledge based on more fine-grained alignments between individual words and image regions, which is achieved by attention mechanisms, and external commonsense knowledge to capture implicit information that cannot be derived from the existing data. Objects detected on the image are used to retrieve knowledge from ConceptNet (Speer et al., 2017). Region features extracted from the image via a region proposal network and wordlevel attention on the sentence part co-operate towards attending to the most salient features of the image. This visual attention guided by language attention, together with the corresponding word embedding are inserted in a LSTM, which feedbacks each previous hidden state to update the word-level attention signal that contributes to the visual attention in every round. The external knowledge is incorporated in a later stage, when the answer is generated; therefore, it can tune the probabilities of LSTM-generated words to be added in the sentence towards more meaningful results. A reinforcement learning training strategy is followed by setting the LSTM as an agent, the words and visual features as the environment, and the generation of the best next word from the captioning model as the policy. (Huang et al., 2020a) Even though local information is well-represented based on detected objects, image captioning is generally not interpretable and therefore not explicitly controllable. An external knowledge source can help in grounding detected objects with semantic entities from the graph, which in turn provides enriched semantic labels for the objects present in the image. In order to control objects appearing in the caption, an attention-based human-interpretable mask is introduced, which assists in diverse caption generation. This masked can be dynamically tuned by a human to influence the resulting caption. (Aditya Mogadala, 2020) Off-the-self object detectors have served several image captioning architectures. However, some tough situations such as very small objects, occlusion or rare object classes can result in error propagation and negatively impact all consequent components until the final caption generation. Commonsense constrains and semantic correlations extracted from Visual Genome (Krishna et al., 2016) can act as priors to guide a more accurate representation. A semantic graph is constructed upon extracted image regions, allowing GCN-based (Kipf and Welling, 2016) reason-ing which. Specifically, visual semantics such as objects, attributes, relationships are captured by extracting candidate region proposals. CNN-based region features can satisfy object and attribute representations, while features from regions union areas provide relationship representation. Visual features are projected in the same high-level semantic space as Knowledge embedding derived from Visual Genome. Therefore, knowledge-enhanced visual triplets are formed, respecting rules imposed by knowledge. The semantic graph is built upon those triples. Then, relational reasoning is performed on the semantic graph using a GCN, the output of which is inserted in the LSTM module that generates the answer. (Hou et al., 2020) Internal knowledge Visiolinguistic priors are naturally connected with describing images, in the sense that humans logically infer unseen entities given a partial description of a visual situation. Obtaining such priors from existing images and captions is a way of 'creating' knowledge and facilitate reasoning of image caption models without adding external sources.
Scene graph generation is a widely used technique for self-augmentation of information present in the dataset. Both images and text need to be represented in graph structures to bridge the two modalities. The Scene Graph Auto-Encoder (SGAE) framework utilizes this graph conversion to instill language priors into the encoder-decoder image captioning structure. More specifically, a learnable dictionary maps the relationships between a sentence and its corresponding scene graph iteratively, reconstructing the initial text from the generated graph in each round. For scene graph generation from text, a pre-trained scene graph parser is utilized, while for the reverse procedure, a trainable RNN decoder converts the dictionary back to text. During this procedure, the dictionary achieves to capture the necessary language prior to be transferred for captioning. The learned dictionary can then be inserted in the image-involving pipeline: a scene graph parser converts the image to a scene graph, which is then passed to the dictionary encoded by a GCN (Kipf and Welling, 2016). Finally, the decoding of the dictionary provides the final caption.  Attention mechanisms are able to identify such structured visiolinguistic priors and highlight connections between text and images, therefore augmenting image captioning implementations. Conditional Latent Topic Attention (CLTA) in combination with sentence prior are able to fuse the model with prior knowledge without the need of constructing scene graphs. Latent topic models are able to recognize semantically significant topics which are driving attention mechanisms to capture local and global dependencies in images. Thus, salient visual features emerge through words, and also more candidate salient regions are discovered and re-weighted accordingly, if they are associated with a topic contributing to an existing salient region. CLTA implements this re-weighting procedure to construct a context vector. Moreover, a sentence autoencoder acting as the sentence prior encourages the extraction of more context information and enhances generalization. Both the context vector and the sentence prior are inserted in a LSTM that generates the answer. (Goel et al., 2020)

Transformer-based models
Recent knowledge-enhanced image captioning models are implemented based on Transformers as an expected substitution of sequential models.
External knowledge Named entities and event knowledge has not been studied in previous image captioning works. This type of information is widely available in news articles, with raw sources being too complicated for language models to infer the right semantics. Special datasets are crafted for this purpose, providing an appropriate form of information for named entity/event-aware image captioning. The heart of the proposed method is the cross-modal entity matching, which incorporates information from various sources. Sub-graphs are extracted from the image and the article text descriptions forming structure representations for the input. The nodes of the text sub-graph belong to named entities, and the edges to their in-between relationships, while the image sub-graph is more generic, by representing objects present on the image. The two sub-graphs are linked via similarity between image sub-graph objects and text sub-graph named entities in the cross-modal entity matching module. This module is trained with the help of multimodal external knowledge from Wikipedia. As a result, a multimodal knowledge graph is produced containing visual, textual and knowledge information. Embedding representations are obtained for each modality: a GAT (Veličković et al., 2018) produces a multimodal knowledge graph embedding, RoBERTa  encodes news captions and image features are derived from a pre-trained ResNet-152 ). An entity-aware captioning model receives the visual, textual and multimodal knowledge graph representations, feeding them to a Transformer (Vaswani et al., 2017) decoder to produce the caption.  BART transformer (Lewis et al., 2019) can provide further advancements towards the refined task of Visual Commonsense Generation (VCG)  lying on the intersection of the generative image captioning task and the non-generative visual commonsense reasoning task. To this end, knowledge-enhanced Multimodal BART (KM-BART) was developed, able to incorporate both visual and linguistic information with the help of modality and task-relevant tokens in the transformer input. More specifically, task-relevant tokens are added in the beginning of the input sequence denoting the task type. For example, for VCG <before>, <after>, or <intent> tokens, representing temporal sequence of events (what happened before, what may happen next) and intents of people present in the image. Furthermore, the pre-training task of Knowledge-based Commonsense Generation (KCG) fuses commonsense knowledge from structured sources early in the pipeline, actually achieving in implicitly integrating explicit knowledge. Also Attribution Prediction (AP) and Relation Prediction (RP) pre-training tasks are used for the first time in knowledgeenhanced VL learning. COMET  is a transformer model trained on knowledge bases such as ATOMIC (Hwang et al., 2021) and ConceptNet (Speer et al., 2017) that generates commonsense descriptions, and acts as a knowledge source for KM-BART. Two possible settings are examined for KM-BART, one containing the image and the event description (i.e. some textual information about the image that provides context of the depicted situation) and a harder one that omits the event description. (Xing et al., 2021) Internal knowledge Transformer based captioning poses some challenges, one of those attributed to the AR training procedure which is based on the maximum likelihood estimation (MLE). The main issue stemming from MLE is that when the generated sequence does not match the ground truth one, there is no discrimination between different 'failed' predictions. Therefore, words that are totally unrelated to the ground truth match are treated the same as semantically similar words. For this reason, a KL divergence term is added to weight semantic relationships between generated words, with respect to their ground truth match. Moreover, a knowledge graph is used to enrich the transformer input embeddings infusing contextual information from neighboring entities in the graph. This knowledge graph is constructed from the linguistic information itself, by leveraging cosine similarity between embedded words to position them within a vector space. The original Transformer (Vaswani et al., 2017) architecture is leveraged for the task, with image features word embeddings representing the visual modality.
Human Evaluation can qualitatively evaluate generated sentences. The human evaluation experiment in  compares the quality of generated captions from different models according to the perception of 30 evaluators. Even though such an experiment is rather subjective, it indicates the importance of language priors. In  human preference is measured comparing with the previous best-performer in the before, after, intent generated sentences.
6.5 Knowledge in Visual Dialog (K-VD)

Datasets
VisDial (Das et al., 2016) is a dataset used in both knowledge-free and knowledge-enhanced versions of VD. It consists of 133k dialogs and an equal number of images from COCO, with train and validation splits (125k dialogs) assigning 10-round dialogs -QA pairs-per image. In the test split (8k dialogs) random rounds are paired with each image. Some important aspects of this dataset is the presence of coreferences, endorsing the coherence of the conversation in linguistic level, and temporal continuity in topics, which supports the preservation and consistency of semantic meaning across the dialogs. The questions mostly follow a concrete and rather exploratory pattern: starting from asking about entities involved in COCO captions, then diving into details, trying to define a categorization of the whole scene or the most appropriate setting description, questioning about the weather of the scene, exploring key semantics not mentioned previously and finally validating and expanding the understanding of elements provided in the answers.

Transformer-based models
Internal knowledge Very recently a knowledgeenhanced implementation for visual dialog was introduced, inspired from the fact that commonsense related questions are ignored. A visual dialog model requires two necessary inputs: an image and dialog history. Visual graphs have assisted the task by providing object relationships explicitly, even though this knowledge is not adequate for commonsense inferences. The integration of commonsense knowledge can be well-represented with graph-level facts and sentence-level facts. Then, facts from a commonsense knowledge graph such as Con-ceptNet (Speer et al., 2017) can be extracted based on the calculation of cosine similarity between their word embedding representation comparing to embedding representations of the words in the sentences and the detected objects. Those graph level facts can complement entities from the visual graph. Therefore, an enriched vision-fact graph can be produced after individual graphs are purified by removing redundant information. The sentence-level facts are derived from the dialog sentences in the form of (subject, relation, object) triples, forming a graph structure. Similarly to the visual stream, the sentence graph is cleaned and enriched with commonsense knowledge. Finally, a transformer-based fusion module receives the enriched graphs, as well as the question embedding to provide the answer, exploiting a generative and a discriminative decoder. (Zhang et al., 2022)

Evaluation
Ranking metrics such as NDCG, MRR, R@1, R@5, R@10, Mean position provide the quality of answer retrieval for visual dialog for both generative and discriminative answer prediction. (Zhang et al., 2022) Human evaluation is used in the generative setting of (Zhang et al., 2022). Specifically, two metrics are provided: the first one indicates the percentage of responses passing the Turing test, thus providing the amount of generated sentences that could be perceived as human-written; the second metric measures the number of generated responses that were perceived as of equal or better quality compared to specific human responses as baselines.

Knowledge in Visual Storytelling (K-VIST)
Visual Storytelling presents many situations where hypothetical concepts can be driven from commonsense and temporal reasoning. Unseen events can enrich or even be necessary for appropriate and coherent textual stories. For example, some sequential inferences were presented in the event/temporal knowledge analysis 5, providing knowledge such as the boy dropped a glass of water and then the glass broke. Not all concepts mentioned in this sentence may be explicitly apparent on a frame of the visual sequence. However, a knowledge graph can guide inference by searching for possible connections between concepts appearing on images, and thus acquire imaginary concepts.

Datasets
There are no dedicated datasets for K-VIST. On the contrary, relevant literature relies on datasets used for the knowledge-free version of the task, such as VIST (Huang et al., 2016a). This dataset contains more than 81K unique photos in around 20K sequences with corresponding textual stories. Textual stories are following a narrative style imposing more high level inference capabilities comparing to literal visual descriptions. This requirement is an extension against the majority of visual description tasks, which do not directly focus on sequential coherence and even abstract meanings. Two extra descriptions are provided per frame in order to bridge literal description with narratives: descriptions of images-in-isolation (DII) and images-in-sequence (DIS).
6.6.2 Methods 6.6.2.1 Sequential language models External knowledge A two-stage structure was proposed in (Yang et al., 2019a), consisting of a reasoning and a generation module. The vision aware commonsense reasoning module is responsible of extracting the most relevant knowledge from an external knowledge base. Objects detected on all images of a sequence are fed in a GRU which provides a semantic and temporal representation. In the same time, candidate ConceptNet entities are fetched based on the detected objects. Attention modules finally select the most relevant ConceptNet candidates, which after passing through a GRU provides the final commonsense representation. The knowledge augumented generation module receives the extracted commonsense knowledge together with the visual information, as well as the previously generated sentences.
A prevalent issue in VIST is the monotonous and repetitive generated stories. This can be attributed to the limited vocabulary of the VIST dataset. In KG-Story (Hsu et al., 2019) the first stage (distill) gathers words from images using object detection and GRUs for word prediction. Potential relationships between pairs of concepts throughout images are searched on external knowledge graphs, and if multiple candidates occur, a scoring function is used to rank their relevancy. This is the enrich stage. Finally, the generate stage utilizes a Transformer which imposes a repetition penalty to mitigate redundant narration. Further modifications in the default Transformer structure is the introduction of an anaphoric expressions generator to enhance coreferences and usage of pronouns, as well as positional encodings of variable length to enable representing stories of different lengths.
Addressing again the coherence and novelty of gener-ated stories, authors of (Xu et al., 2021) propose a threestage structure corresponding to imagination, reasoning and writing capabilities of human. The first stage (imagine) focuses on the sequential consistence by extracting the visual topic of a frame through the combination of the current visual features and the sentence generated in the previous step. The knowledge part targets the content of narratives and consists of three graph types: a general commonsense knowledge graph, a scene graph and an event graph. A GCN applied on each graph selects the most suitable knowledge parts, which are combine to form the second stage (reason). Both imagine and reason outputs are fed to the third stage (write), which is responsible for generating the story.

Transformer-based models
External knowledge Towards informative and more diverse stories, (Chen et al., 2021a) is the first knowledgeenhanced approach that utilizes a generative transformer to produce the story output. The concept enrichment stage connects concepts present in images with ConceptNet. Then, a graph attention network (GAT) operates on the graph and image features in order to integrate information of the most appropriate candidate concept nodes, which will be passed in the next selection module. The concept selection module utilizes two different selection methods: a Sequential Selection Module (SSM) that operates in an encoder-decoder fashion, outputting selected concepts after encoding the embedded candidate concepts; a Maximal Clique Selection Module (MCSM) outputs a maximal clique containing all appropriate concepts for story generation given the concept graph. Finally, the concept to story module uses either an RNN structure or a BART language model, with BART demonstrating more diverse stories while preserving quality.

Evaluation
Language generation metrics BLUE, ROUGE, ME-TEOR and CIDEr are widely used automatic metrics that evaluate the linguistic quality of generated stories. Diversity of generated stories is measured via the Distinct-n (Dist-n) score . This metric indicates the originality of generated text by calculating the frequency of n-grams throughout the whole corpus of generated stories. Higher Dist-n scores represent more diverse stories. (Yang et al., 2019a;Chen et al., 2021a) Human Evaluation is very important for generative tasks, as automatic evaluation metrics cannot assess the full range of linguistic capabilities, especially when it comes to evaluating sequential quality. However, differ-ent implementations perform varying human evaluation experiments, which somehow impedes the direct comparison of models.
In (Yang et al., 2019a) four aspects were examined: Fluency checks the linguistic quality, relevance measures the success of textual description in describing visual concepts, informativeness measures the diversity of produced stories and coherence evaluates the semantic continuity of stories in a sequence. Each aspect receives a score from 1 (worse) to 5 (best) from three evaluators, and their average values serve as the final results. Similarly, in (Xu et al., 2021) relevance, coherence and informativeness are regarded, receiving scores from 0 (worse) to 2 (best) from five evaluators. A different human evaluation strategy is followed in (Hsu et al., 2019): comparative experiments between VIST models are performed, asking users to rank generated stories from different models either with or without revealing the corresponding images. This is an indirect evaluation of linguistic quality and coherence, when only text is regarded, and also semantic relevance, when corresponding images are provided. The comparative approach is also used in (Chen et al., 2021a), with two evaluators declaring their preference (or tie) between two models regarding three aspects: relevance and informativeness similar to (Yang et al., 2019a;Xu et al., 2021), together with logicality which measures the logical coherence over story sequences. Additionally, overall indicates the evaluator's preference in general between the two models.

Datasets
A variety of datasets have been used in visual generation, which however do not contain some certain sense of knowledge, and are widely used in knowledge-free settings. Datasets used in conditional image synthesis are ImageNet (Deng et al., 2009), CIFAR (Krizhevsky, 2009), FFHQ (Karras et al., 2018), Oxford Flowers (Nilsback and Zisserman, 2008), CUB (He and Peng, 2020) and many others.
Sequential synthesis (story visualization) greatly utilizes Pororo-SV cartoon dataset (Kim et al., 2017). It contains more than 16k pairs of scenes and dialogs extracted from 20 hours of video, 27,328 fine-grained scene descriptions in natural language provided by human annotators, and 8,913 QA multiple-choice pairs related to the story. In total, 10 main characters appear in the frames. Questions are divided in 11 types: Action, Person, Abstract, Detail, Method, Reason, Location, Statement, Causality, Yes/No, Time. FlinstonesSV (fli) is also based on cartoon frames. It is composed of 25184 densely annotated videos, each of which containing 75 frames. The annotations include bounding boxes with labels for characters and items of the frames, as well as segmentation masks. Another emerging dataset for Story Visualization is DiDeMo-SV (Maharana et al., 2022;Hendricks et al., 2017), a dataset based on video captions that contains 10,000 with more than 40,000 temporally localized textual descriptions. 6.7.2 Methods 6.7.2.1 Knowledge in Conditional Image Generation (K-cIG) Internal knowledge Even though GANs have been powerful in synthesizing novel images, they cannot handle combinations of attributes they have not encountered in the training data. Therefore, if the textual condition refers to such unseen combinations, the synthesized image has sacrificed some of the semantics, in order to produce a result that remains within the learned distribution. The insertion of additional knowledge can expand the generated distribution to enhance the consistency on the condition without sacrificing resulting fidelity. This can be translated in two needs regarding a GAN model: the generator should become more flexible, and the generator more tolerant. KG-GAN meets those requirements by introducing a second generator, trained on domain knowledge by utilizing a novel knowledge loss. This second generator shares parameters with the original one, which is responsible of synthesizing images conditioned on text. A regression network receives the synthesized images from the seen-image generator and the ones from the knowledge generator, imposing constrains regarding the plausibility of unseen combinations. The semantic vector produced by the knowledge generator is redirected to the seen-image generator to guide generation outside the predefined classes. KG-GAN does not exploit external knowledge sources, but with this simple distribution enhancement it achieves some preliminary zero-shot capabilities.

Knowledge in Story Visualization (K-SV)
External knowledge Story Visualization is another task with limited contributions in knowledge-enhanced settings. Structured information from text can be obtained via parse trees which can permit hierarchical encoding of longer phrases. Missing information regarding visual details in text can be filled out with external knowledge from ConceptNet (Speer et al., 2017). Moreover, conceptually similar sentences that are phrased in a different way need to be placed closed in an embedding space, an issue that external knowledge can again effectively re-solve. Spatial knowledge is also underrepresented in most sentences, even though scene synthesis needs detailed information of object positions. Dense captioning as a form of self-augmenting knowledge provided detailed positioning information due to the usage of region bounding boxes. The combination of internal spatial and external semantic knowledge is able to better guide sequential synthesis, resolving all involved aspects such as textimage consistency, visual quality and sequential continuity. A Memory-Augmented Recurrent Tree-Transformer (MARTT) encodes the parse trees for the text, while a Graph Transformer (Yun et al., 2020) embeds the commonsense knowledge. Both embeddings are inserted in the story encoder, which outputs contextualized embeddings for the image generator. The generated images are passed to image and story discriminators, which redirect synthesis based on individual and sequential aspects. Spatial knowledge from dense captioning enforces additional loss functions while training, to provide more explicit information about positions and detailed grounding of characters on the images with respect to their descriptions in the text.  The groundbreaking success of DALL-E (Ramesh et al., 2021(Ramesh et al., , 2022 inspired the usage of massive zero-shot transformer-based generative models in Story Visualization; StoryDALL-E (Maharana et al., 2022) achieves generalization of visual synthesis to unseen textual stories, also extending the task to Story Continuation: in this case, a source image is included in the conditioning, requesting from the model to continue the visual story in a consistent way. Story Visualization has been a task lacking sufficient datasets, due to the increased effort needed to construct appropriate ones, either manually or automatically. To this end, external unstructured knowledge obtained from pre-trained DALL-E (Ramesh et al., 2021) enables even zero-shot sequential synthesis based on input 'story' text.

Metrics
Image generation metrics (section 3.6.5) such as FID for seen and unseen classes were used in KG-GAN . FID is also used to evaluate quality of generated frames independently in Maharana et al., 2022). R-precision indicates quality by measuring the retrieval capabilities of generated frames over ground truth captions comparing to retrieval using the real frames.  Classification metrics, such as Character F1 score measure the quality of generated characters in predicted images. Also, frame accuracy checks the exact match between semantics of the ground truth and generated frames. Maharana et al., 2022) Language metrics are also relevant: viewing SV frames as a video, captions for generated frames can be produced using video captioning techniques. BLEU scores evaluate the quality of captions as an indirect measure of visual quality, based on the idea that well designed semantics will be captured in captions better than low quality concepts.  Human Evaluation can reveal the human perception over quality, as in most generative tasks. Specifically for SV, evaluators need to assess results over visual quality, consistence and relevance comparing to the previous stateof-the-art model on the same task. Maharana et al., 2022) 6.8 Multi-task transformers with knowledge 6.8.1 Methods Multi-task models can easily be built using multimodal transformer backbones. Instead of utilizing external knowledge graphs as in previous methods, many implementations employ self-knowledge exclusively, by obtaining more structured representations from the existing visual and textual data.
External knowledge A natural unification of multiple tasks under the same model would incorporate tasks moving in the same direction such as cross-modal reasoning tasks or cross-modal-retrieval tasks. Indeed, VQA, VCR and VE were unified in Rationale VT transformer (Marasović et al., 2020), a framework that utilizes visual and linguistic clues to generate free-text rationales. Two knowledge sources attempt to provide reasoning information regarding scenes: a grounded situation recognizer (Pratt et al., 2020) describes activities on scenes, entities involved and draws bounding boxes for entities to visually ground them; and Visual Commonsense Graphs  to fuse commonsense inferences about events and intents so that a temporal perspective of a scene is also considered. Rationales are generated for VQA-E (visual question answering) , E-SNLI-VE (visual entailment) (Do et al., 2020) and VCR (visual commonsense reasoning) (Zellers et al., 2019) datasets. Visual recognition of objects is the first step for visual understanding, followed by capturing their in-between relationships utilizing the knowledge provided by the grounded situation recognizer (Pratt et al., 2020). Higher-level cognition is achieved using knowledge from VisualCOMET , which receives the knowledge stored in Visual Commonsense Graphs to generate commonsense inferences. VisualCOMET is built upon GPT-2 (Radford et al., 2019), therefore a unimodal purely linguistic input can be provided, utilizing object labels, textual question/answers and inferences. Alternatively, GPT-2 can be adapted, resulting in a hybrid implementation: visual features and bounding box coordinates act as visual embeddings, combined with VisualCOMET token embeddings indicating the beginning of before, after, intent inferences.
Targeting again reasoning tasks, (Shevchenko et al., 2021) builds on top of LXMERT (Cho et al., 2020) to address the knowledge-enhanced versions of the VQA, VCR and VE tasks on the OK-VQA (Marino et al., 2019), FVQA , NLVR2 (Suhr et al., 2019), SNLI-VE (Xie et al., 2019) datasets. External knowledge is provided from ConceptNet (Speer et al., 2017) and Wikidata (Vrandecic and Krötzsch, 2014). Knowledge-rich expressions are created by matching embedded knowledge with training sentences from the datasets. Moreover, a training objective targeting the alignment of knowledge embeddings and knowledge-rich expressions encourages learning a global representation structure. Utilizing this objective is proven beneficial during both pre-training and fine-tuning. It is also observed that the introduction of this knowledge-oriented objective smooths the embedding space, which facilitates similarity matching between words.
KB-VLP  utilizes knowledge embeddings based on Wikidata (Vrandecic and Krötzsch, 2014) entities, which are concatenated with the visiolinguistic instances as inputs of a VL transformer. Specifically, entity recognition on text is performed to extract relevant Wikidata entries, which are embedded via Wikipedia2vec to form text-related knowledge embeddings. Object tags obtained from the image are used to obtain image-related knowledge embeddings from relevant Wikidata entities. The input vector consists of 5 components: word embeddings for text, text-related knowledge embeddings, word embeddings sequences for object tags per image, visual features and image-related knowledge embeddings. Two specialized pre-training objectives are used: sentence-level objective substitutes elements from the input vector with other random elements, while tokenlevel objective extends text -image masking to text-related knowledge embedding -image-related knowledge embedding masking. Task specific datasets for KB-VLP are VQA , GQA (Hudson and Manning, 2019) and OK-VQA (Marino et al., 2019) for visual question answering, and NLVR2 (Suhr et al., 2019) for visual reasoning.
Internal knowledge OSCAR (Li et al., 2020c) is one of the models that effortlessly transit from knowledge-free to knowledge-enhanced learning utilizing self-acquired knowledge in its simplest form. Instead of -rather naivelyletting the model infer the correct image-text alignments in an exhaustive way, OSCAR facilitates the procedure with the usage of object tags, as intermediaries between text and image instances. This procedure is endorsed from the observation that salient objects in the image will most probably also appear in text. The input to the VL transformer module consists of word tokens, object tag embeddings and visual features. The intermediary object tags form separate semantic spaces, depending on whether they are paired with text or image, yielding two dedicated pre-training objectives. The masked token loss objective views text and tag word representations in the same space, randomly masking each of them and letting reconstruct the missing parts through the visual modality. Conversely, contrastive loss views tags paired with visual features, and randomly replaces the real tag sequence with another one sampled from the dataset, learning to pull apart mismatched tag sequences and bring close together the matching ones. OSCAR succeeds in both understanding tasks, such as cross-modal retrieval (ITR/TIR), visual question answering (on VQA  and GQA (Hudson and Manning, 2019)) and visual reasoning (on NLVR2 (Suhr et al., 2019)), as well as in generation tasks, such as image captioning and novel object captioning.
ERNIE-ViL  leverages structured visual knowledge from scene graphs to bridge detailed semantics across vision and language. Such fine-grained representations are important to differentiate between conceptually similar scenes. Scene graph prediction tasks (object, attribute and relationships prediction) encourage learning those fine-grained differences. Even though not using external knowledge, ERNIE-ViL internally constructs structured knowledge during the cross-modal pretraining. Nevertheless, this self-knowledge is sufficient to boost performance in 5 VL tasks, especially in those that fine-grained associations are required, such as visual referring expressions (VRE). Other tasks benefited from this approach are VCR, VQA and cross-modal retrieval (ITR/TIR).
ROSITA  extends the self-knowledge idea by employing both cross-modal and intra-modal knowledge in the same time. Given an image-text pair, the first step is to construct intra-modal graphs, i.e an image graph and a text graph. The image graph consists of regions (defined by a pre-trained object detector) as nodes, with IoU scores of paired regions acting as edge weights between those regions. Similarly for the text graph, objects, attributes and relationships are extracted from text to fill the nodes of the text graph, while edge weights are defined by the co-occurrence frequency between pairs of nodes. In both graphs, zero similarity scores between nodes indicate absence of edge. A cross-modal scene graph is derived from the image and text graph by aligning predicted region tags from the image side and words from the text side by comparing their textual semantic similarity. By calculating this similarity score for all possible tag-word pair, edge weights between cross-modal nodes are defined. Nodes connected via cross-modal edges, named anchor nodes, form subgraphs which maintain intra-modal and cross-modal edges, as well as two-hop connections that contain paths of cross-modal followed by intra-modal edges. ROSITA leverages this enhanced representation to boost three downstream tasks: VQA, VRE and ITR.

Evaluation
Human Evaluation is useful in cases of language generation tasks, such as the rationales generation of (Marasović et al., 2020). In this case, the need for human evaluation arises from the observation that certain discrete rationales, even though not being paraphrases of each other, can be suitable. The following aspects were evaluated: visual plausibility referring to how well the generated rationales support the answer (in VQA and VCR) or the entailment (VE) given the image, and visual fidelity measuring the appearance of irrelevant information within more plausible generated rationales. By excluding images, textual Plausibility evaluates generated rationales based on their support on the answer (in VQA and VCR) or the entailment (VE) exclusively.
Classification metrics such as accuracy serve as the golden standard for non-generative models on crossmodal reasoning tasks. In (Shevchenko et al., 2021). OK-VQA accuracies per question types are also reported, in order to validate improvements in commonsense-oriented categories attributed to the injection of commonsense knowledge.
Ranking metrics provide valuable insights in cases when retrieval tasks are performed , Recall@k for k=1, 5, 10 is reported.
Language metrics such as BLEU (Papineni et al., 2002), CIDEr (Vedantam et al., 2014), SPICE are used for language generation tasks, such as image captioning and novel object captioning (Li et al., 2020c). 7 The future of knowledge in VL 7.1 Explainability and biases Some early works in KVL tasks widely addressed the need for boosting explainability via knowledge graphs. The complex opaque reasoning accompanying many state-ofthe-art VL models indeed renders explainability a signifi-cant aspect. However, as development of more advanced models emerged, interpretability and fairness became incidental to performance. The pursue of impressive results often promotes models with more vulnerabilities, which have been significantly underexplored. For example, the leading field of NLP has experienced some non-negligible failures, such as producing completely wrong statements based on a false conditioning 3 . It is unknown how many such vulnerabilities exist in state-of-the-art models, as there is no systematic way to capture and recognize them, nor is there any guarantee they will not occur. Those issues question the trust of humans over black-box models and even make such models susceptible to misuse. Explainability and robustness of VL models are even less explored comparing to NLP, leaving lots of space for the development of transparent models and post-hoc explainability methods. Generally, given the interwoven nature of knowledge graphs and explainability, it is expected that sooner or later research interests will resume towards this direction.

Zero-shot learning
Previous works regarding zero-shot classification tasks Nayak and Bach, 2020;Geng et al., 2020Geng et al., , 2021 leverage knowledge graphs to fuse feature information from seen to unseen classes. Little work has been done so far towards the more complex task of multimodal zero-shot learning with external knowledge (Chen et al., 2021c), leaving numerous unexplored directions open for future research.

Exploitation and integration of more knowledge senses
Despite the increasing interest towards knowledgeenhanced multimodal learning, there are some significantly underexplored knowledge aspects that could be leveraged in various tasks. For example, factual knowledge does not have a noticeable presence outside of VQA applications. Named entities and events are only addressed in a couple of applications. Temporal knowledge or even hypotheses and counterfactual thinking could reveal new aspects of existing tasks with the potential of interesting implementations.

Datasets
Even though dedicated knowledge-based datasets have been developed for VQA (Shah et al., 2019;Marino et al., 2019;Jain et al., 2021), lack of corresponding ventures have been observed in other VL tasks. Such datasets could 3 Aligning Language Models to Follow Instructions either incorporate external knowledge from certain knowledge bases in the first place, or be more flexible and dynamically retrieve knowledge to satisfy more challenging inputs. Suitable and high-quality datasets are the first step towards the evolution of the knowledge-enhanced multimodal learning field. Furthermore, knowledge-based datasets could be combined with explanation-oriented datasets, like VQA-E ) and e-SNLI-VE (Do et al., 2020) to address the issues mentioned in section 7.1.

Knowledge-enhanced generative tasks
The field sitting on the intersection of knowledge graphs and generative models has been significantly underexplored, despite the major success both fields have experienced in recent years. Until now, visual knowledge has been attempted in image synthesis, forming the task of image generation from scene graphs and layouts with interesting results and improvements on complex scene synthesis Li et al., 2019b;He et al., 2021b). Also domain knowledge  in GANs has demonstrated some insightful preliminary observations regarding knowledge-guided synthesis of unseen attributes, without the need of massive pre-training. A handful of aforementioned generative multimodal approaches (Yang et al., 2019a) incorporate knowledge graphs for the tasks of visual storytelling and story visualization. More compelling results could unfold from the combination of various knowledge graphs with multimodal generative models, enabling conditioning on commonsense, hierarchical, factual knowledge, and also enforcing interpretable insights into the generation process.

The need for multi-task learners
An abundance of multi-task knowledge-free transformerbased VL models have emerged in recent literature, presenting impressive results on a variety of downstream tasks by utilizing the same pre-trained body each time. On the other hand, knowledge-enhanced VL models usually target a single task, and only a few knowledge-enhanced multi-task models (Marasović et al., 2020;Shevchenko et al., 2021; have been developed. Even in the case of multi-task transformers, almost half of them utilize only self-knowledge without exploiting the benefits of additional external sources. In the same time, the harder venture of integrating external knowledge limits the range of tasks that multi-task models target. Specifically, current implementations have covered only reasoning tasks. Therefore, as a first future direction, retrieval tasks can also be attempted. Going one step further, some VL tasks addressed in multi-task knowledge-free VL transformers have never been explored in literature; hence, unified multi-task architectures could possibly explore their knowledge-enhanced capabilities, without the need of developing individual non-reusable approaches. In any case, multi-task knowledge-enhanced models would unlock the full potential of the contributions of knowledge, with competitive architectures pushing the state-of-the-art even further.

Conclusion
Introducing external knowledge in multimodal learning has demonstrated promising research directions, targeting performance, explainability and extendability of existing tasks. In this survey paper, we analyzed the meeting point of visiolinguistic representation learning and knowledge assisted learning, focusing on the contribution of existing knowledge graphs and unstructured knowledge sources. The presented taxonomy of knowledge-enhanced datasets, tasks and models provides one of the first attempts towards structuring the field of knowledge-enhanced VL learning, with the aim to guide future research and address prospects and challenges of this upcoming field.