New Ideas and Trends in Deep Multimodal Content Understanding: A Review

The focus of this survey is on the analysis of two modalities of multimodal deep learning: image and text. Unlike classic reviews of deep learning where monomodal image classifiers such as VGG, ResNet and Inception module are central topics, this paper will examine recent multimodal deep models and structures, including auto-encoders, generative adversarial nets and their variants. These models go beyond the simple image classifiers in which they can do uni-directional (e.g. image captioning, image generation) and bi-directional (e.g. cross-modal retrieval, visual question answering) multimodal tasks. Besides, we analyze two aspects of the challenge in terms of better content understanding in deep multimodal applications. We then introduce current ideas and trends in deep multimodal feature learning, such as feature embedding approaches and objective function design, which are crucial in overcoming the aforementioned challenges. Finally, we include several promising directions for future research.


Introduction
Semantic information that helps us illustrate the world usually comes from different sensory modalities in which the event is processed or is experienced (i.e. auditory, tactile, or visual). Thus, the same concept or scene can be presented in different ways. If we consider a scene where "a large yellow dog leaps into the air to catch a frisbee", then one could select audio or video or an image, which also indicates the multimodal 5 aspect of the problem. To perform multimodal tasks well, first, it is necessary to understand the content the available modality during testing stage, multimodal applications include bi-directional tasks (e.g. imagesentence search [1] [2], visual question answering (VQA) [3] [4]) and uni-directional tasks (e.g. image captioning [5] [6], image generation [7] [8]), both of them will be introduced in the following sections.
With the powerful capabilities of deep neural networks, data from visual and textual modality can be represented as individual features using domain-specific neural networks. Complementary information from 20 these unimodal features is appealing for multimodal content understanding. For example, the individual features can be further projected into a common space by using another neural network for a prediction task. For clarity, we illustrate the flowchart of neural networks for multimodal research in Figure 1. On the one hand, the neural networks are comprised by successive linear layers and non-linear activation functions, the image or text data is represented in a high abstraction way, which leads to the "semantic gap" [9]. On 25 the other hand, different modalities are characterized by different statistical properties. Image is 3-channel RGB array while text is often symbolic. When represented by different neural networks, their features have unique distributions and differences, which leads to the "heterogeneity gap" [10]. That is to say, to understand multimodal content, deep neural networks should be able to reduce the difference between high-level semantic concepts and low-level features in intra-modality representations, as well as construct a 30 common latent space to capture semantic correlations in inter-modality representations.
Much effort has gone into mitigating these two challenges to improve content understanding. Some works involve deep multimodal structures such as cycle-consistent reconstruction [11][12] [13], while others focus on feature extraction nets such as graph convolutional networks [14] [15][16]. In some algorithms, reinforcement learning is combined with deep multimodal feature learning [17][18] [19]. These recent ideas 35 are the scope of this survey. In a previous review [20], the authors analyze intrinsic issues for multimodal research but mainly focus on machine learning. Some recent advances in deep multimodal feature learning

Attention mechanism
The guy is wearing a blue shirt and doing an ollie  are introduced [21], but it mainly discusses feature fusion structures and regularization strategies.
In this paper, we focus on two specific modalities, image and text, by examining recent related ideas.
First, we focus on the structures of deep multimodal models, including auto-encoders and generative adver-quirement is closely related to the following two aspects: the heterogeneity gap [10] and the semantic gap [9] [42]. The first addresses the gap between the high-level concepts of text descriptions and the pixel-level values of an image, while the second exists between synthetic images and real images.
The above issues in text-to-image application are exactly what generative models attempt to address, through methods such as Variational Auto-Encoders (VAE) [43], auto-regressive models [44] and Gener-100 ative Adversarial Networks (GANs) [8] [22]. Recently, various new ideas and network architectures have been proposed to improve image generation performance. One example is to generate a semantic layout as intermediate information from text data to bridge the heterogeneity gap in image and text [45] [46] [47].
Image generation is a promising multimodal application and has many applicable scenarios such as photo editing or multimedia data creation. Thereby, this task has attracted lots of attention. However, there are two main limitations to be explored further. Similar with image captioning, the first limit is regarding the 110 evaluation metrics. Currently, Inception Score (IS) [51][52] [55], Fréchet Inception Distance (FID) [55], Multi-scale Structural Similarity Index Metrics (MS-SSIM) [56] [57], and Visual-semantic Similarity (VS) [49] are used to evaluate generation quality. These metrics pay attention to generated image resolution and image diversity. However, performance is still far from human perception. Another limitation is that, while generation models work well and achieve promising results on single category object datasets like Caltech- 115 UCSD CUB [58] and , existing methods are still far from promising on complex dataset like MS-COCO where one image contains more objects and is described by a complex sentence.
To compensate for these limitations, word-level attention [53], hierarchical text-to-image mapping [46] and memory networks [59] have been explored. In the future, one direction may be to make use of the Capsule idea proposed by Hinton [60] since capsules are designed to capture the concepts of objects [48]. 120

Bi-directional Multimodal Applications
As for bi-directional applications, features from visual modality are translated to textual modality and vice versa. Representative bi-directional applications are cross-modal retrieval and visual question answering (VQA) where image and text are projected into a common space to explore their semantic correlations.
(1) Cross-modal retrieval 125 Single-modal and cross-modal retrieval have been researched for decades [61]. Different from singlemodal retrieval, cross-modal retrieval returns the most relevant image (text) when given a query text (image).
As for performance evaluation, there are two important aspects: retrieval accuracy and retrieval efficiency.
For the first, it is desirable to explore semantic correlations across an image and text features. To meet this requirement, the aforementioned heterogeneity gap and the semantic gap are the challenges to deal 130 with. Some novel techniques that have been proposed are as follows: attention mechanisms and memory networks are employed to align relevant features between image and text [62] [63][64] [65]; Bi-directional sequential models (e.g. Bi-LSTM [66]) are used to explore spatial-semantic correlations [1] [62]; Graph-based embedding and graph regularization are utilized to keep semantic order in text feature extraction process [67] [68]; Information theory is applied to reduce the heterogeneity gap in cross-modal hasing [69]; Ad- 135 versarial learning strategies and GANs are used to estimate common feature distributions in cross-modal retrieval [70][71] [72]; Metric learning strategies are explored, which consider inter-modality semantic similarity and intra-modality neighborhood constraints [73] [83][86] are applied for learning compact hash codes with different lengths. However, the problems should be considered when one employs hashing methods for cross-modal retrieval are feature quantization and non-differential binary code optimization. Some methods, such as self-supervised learning [83] and continuation [86], are explored to address these two issues.
Recently, Yao et al. [85] introduce an efficient discrete optimal scheme for binary code learning in which a 145 hash codes matrix is construct. Focusing on the feature quantization, Wang et al. [84] introduce a hashing code learning algorithm in which the binary codes are generated without relaxation so that the large quantization and non-differential problems are avoided. Analogously, a straightforward discrete hashing code optimization strategy is proposed, more importantly, in an unsupervised way [87].
Although much attention has been paid to cross-modal retrieval, there still exists room for performance 150 improvement (see . For example, to employ graph-based methods to construct semantic information within two modalities, more context information such as objects link relationships are adopted for more effective semantic graph construction [61]. (2) Visual question answering Visual question answering (VQA) is a challenging task in which an image and a question are given, 155 then a correct answer is inferred according to visual content and syntactic principle. We summarize four types of VQA [88] in Figure 2. VQA can be categorized into image question answering and video question answering. In this paper, we target recent advances in image question answering. Since VQA was proposed, it has received increasing attention in recent years. For example, there are some training datasets [89] built for this task, and some network training tips and tricks are presented in work [90]. 160 To infer correct answers, VQA systems need to understand the semantics and intent of the questions completely, and also should be able to locate and link the relevant image regions with the linguistic information in the questions. VQA applications present two-fold difficulties: feature fusion and reasoning rationality.
Thus, VQA more closely reflects the difficulty of multimodal content understanding, which makes VQA applications more difficult than other multimodal applications. Compared to other applications, VQA has 165 various and unknown questions as inputs. Specific details (e.g. activity of a person) in the image should be identified along with the undetermined questions. Moreover, the rationality of question answering is based on high-level knowledge and advanced reasoning capability of deep models. As for performance assessment, answers on open-ended questions are difficult to evaluate compared to the other three types in Figure   2 where the answer typically is selected from specific options, or the answer contains only a few words [89]. 170 As summarized in Figure 2, the research on VQA includes: free-form open-ended questions [91], where the answer could be words, phrases, and even complete sentences; object counting questions [92] where the answer is counting the number of objects in one image; multi-choice questions [32] and Yes/No binary problems [93]. In principle, the type of multi-choice and Yes/No can be viewed as classification problems, where deep models infer the candidate with maximum probability as the correct answer. These two types are 175 associated with different answer vocabularies and are solved by training a multi-class classifier. In contrast, object counting and free-form open-ended questions can be viewed as generation problems [89] because the answers are not fixed, only the ones related to visual content and question details.
Compared to other three mentioned multimodal applications, VQA is more complex and more openended. Although much attention has been paid to visual question answering research, there still exist several 180 challenges in this field. One is related to accuracy. Some keywords in question might have been neglected and some visual content might remain unrecognized or misclassified. Because of this, a VQA system might give inaccurate even wrong answers. Another is related to diversity and completeness of the predicted answer, which is especially crucial for free-form open-ended problems, as the output answers should be as complete as possible to explain the given question, and not limited to a specific domain or restricted 185 language forms [89]. The third one is the VQA datasets, which should have been less biased. For the existing available datasets, questions that require the use of the image content are often relatively easy to answer. However, harder questions, such as those beginning with "Why", are comparatively rare and difficult to answer since it needs to more reasoning [94]. Therefore, the biased question type impairs the evaluation for VQA algorithms. For recommendations, a larger but less biased VQA dataset is necessary. 190

Challenges for Deep Multimodal Learning
Typically, domain-specific neural networks process different modalities to obtain individual monomodal representations and are further embedded or aggregated as multimodal features. Importantly, it is still difficult to fully understand how multimodal features are used to perform the aforementioned tasks well. Taking text-to-image generation as an example, we can imagine two questions: First, how can we organize two 195 types of data into a unified framework to extract their features? Second, how can we make sure that the generated image has the same content as the sentence described?
These two kinds of questions are highly relevant to the heterogeneity gap and the semantic gap in deep multimodal learning. We illustrate the heterogeneity gap and the semantic gap in Figure 3. Recently, much effort has gone into addressing these two challenges. These efforts are categorized into two directions: 200 towards minimizing the heterogeneity gap and towards preserving semantic correlation.

Heterogeneity Gap Minimization
On the one hand, although complementary information from multiple modalities is beneficial for multimodal content understanding, their very different statistical properties can impair the learning of this comple-  features are extracted from hierarchical networks and text features are from sequential networks. Naturally, these features are distributed inconsistently so that they are not directly comparable, which leads to the heterogeneity gap. Both the modality data and the network itself contribute to this discrepancy.
Female with glasses, holding a camera, shades her eyes with her hands while standing in the foreground of an ocean view.  Figure 3 A conceptual illustration of two challenges. We use different shapes to denote different modalities; the circle represents text feature distributions, and the triangle represents image feature distributions. Different shapes with the same color mean that they are semantically similar in content. Apart from the "semantic gap" [9] which is commonly mitigated in monomodal deep visual tasks such as image classification, the key for deep multimodal content understanding also lies in mitigating the "heterogeneity gap" [10], i.e.
reducing the inter-modality gap and exploring the semantic correlations. the heterogeneity gap. One direction is from the viewpoint of deep multimodal structures and another is from the viewpoint of feature learning algorithms.
Auto-encoders and generative adversarial networks (GANs) [22] are two important structures for rep-220 resenting the multimodal structure. We will introduce both of them in the following sections. Generative adversarial networks learn features to bridge image data and text data. For example, GANs are commonly applied to generate images according to their descriptive sentences. This idea is then developed into several variants, such as StackGAN [51], HDGAN [49], and AttnGAN [53]. Auto-encoders are used to correlate multimodal features based on feature encoding and feature reconstruction. For example, Gu et al. [95] [96] 225 use cross-reconstruction method to preserve multimodal semantic similarity where image (text) features are reconstructed to text (image) features.
In addition, much effort has gone into minimizing the gaps in uni-modal representations. For instance, sequential neural networks (e.g. RNNs) are employed to extract multi-granularity including character-level, word-level, phrase-level and sentence-level text features [58] [97][98] [99]. Graph-based approaches have 230 been introduced to explore the semantic relationship in text feature learning [68][100].
Regarding the goal of reducing the heterogeneity gap, uni-modal representations are projected into a common latent space under joint or coordinated constraints. Joint representations combine uni-modal fea-tures into the same space, while coordinated representations process uni-modal features separately but with certain similarity and structure constraints [20].

Semantic Correlation Preserving
Preserving semantic similarity is challenging. On the one hand, the differences between high-level semantic concepts (i.e. features) and low-level values (e.g. image pixel) result in a semantic gap among intra-modality embeddings. On the other hand, uni-modal visual and textual representations make it difficult to capture complex correlations across modalities in multimodal learning.

240
As images and text are used to describe the same content, they should share similar patterns to some extent. Therefore, using several mapping functions, uni-modal representations are projected into the common latent space using individual neural networks (see Figure 1). However, these embedded multimodal features To preserve the semantic correlations, one must measure the similarity between multimodal features 250 where joint representation learning and coordinated representation learning can be adopted. Joint presentation learning is more suitable for the scenarios where all modalities are available during testing stage, such as in visual question answering. For other situations where only one modality is available for testing, such as cross-modal retrieval and image captioning, coordinated representation learning is a better option.
Generally, feature vectors from two modalities can be concatenated directly in joint representation learn-255 ing; Then the concatenated features are used to make classification or are fed into a neural network (e.g. RNN) for prediction (e.g. producing an answer). Simple feature concatenation is a linear operation and less effective, advanced pooling-based methods such as compact bilinear pooling [101] [102] are introduced to connect the semantically relevant multimodal features. Neural networks are also alternative for exploring more corrections on the joint representations. For example, Wang et al. [103] introduce a multimodal 260 transformer for disentangling the contextual and spatial information so that a unified common latent space for image and text is construct. Similarly, auto-encoders, as as unsupervised structures, are used in several multimodal tasks like cross-modal retrieval [75] and image captioning [104]. The learning capacity of the encoder and decoder are enhanced by improving on the structure of the sub-networks by stacking attention [105] [106], paralleling LSTM [107] [104], and ensembling CNN [108]. Different sub-networks have their own parameters. Thereby, auto-encoders would have more chances to learn comprehensive features.
The key point for coordinated representation learning is to design optimal constraint functions. For example, computing inner product or cosine similarity between two cross-modal features is a simple way to constrain dot-to-dot correlations. Canonical correlation analysis [72] [109][110] [111] is commonly used to maximize semantic correlations between vectors; For better performance and stability improvement, metric 270 learning methods such as bi-directional objective functions [112][113] [73] are utilized. However, mining useful samples and selecting appropriate margin settings remain empirical in metric learning [74]. Regarding these limits in metric learning, some new methods, such as adversarial learning [71][83] [70] and KL-divergence [74], are introduced for these demerits. Instead of selecting three-tuple samples, these alternative methods consider the whole feature distributions in a common latent space. In addition, attention 275 mechanisms [33][34] [114] and reinforcement learning [115] [116] are popularly employed to align relevant features between modalities.
To address the above-mentioned challenges, several new ideas, including methods for feature extraction, structures of deep networks, and approaches for multimodal feature learning, have been proposed in recent years. The advances from these ideas are introduced in the following sections. 280

Recent Advances in Deep Multimodal Feature Learning
Regarding the aforementioned challenges, exploring content understanding between image and text has attracted sustained attention and lots of remarkable progresses have been made. In general, these advances are mainly from a viewpoint of network structure and a viewpoint of feature extraction/enhancement. To this end, combining the natural process pipeline of multimodal research (see Figure 1), we categorize these 285 research ideas into three groups: deep multimodal structures presented in Section 4.1, multimodal feature extraction approaches introduced in Section 4.2, and common latent space learning described in Section 4.3. Deep multimodal structures indicate the basic framework in community; Multimodal feature extraction is the prerequisite which supports the following similarity exploring; Common latent space learning is the last but a critical procedure to make the multimodal features comparable. For a general overview for these 290 aspects in multimodal applications, we make a chart for the representative methods in Figure 4.
between images and text, deep multimodal structures usually involve computer vision and natural language processing (NLP) field [174]. For instance, raw images are processed by hierarchical networks such as CNNs and raw input text can be encoded by sequential networks such as RNN, LSTM [99], and GRU [174].
During the past years, a variety of related methods have blossomed and accelerated the performance of multimodal learning directly in multimodal applications, as shown in Figure 4.

300
Deep multimodal structures include generative models, discriminative models. Generative models implicitly or explicitly represent data distributions measured by a joint probability P (X, Y ), where both raw data X and ground-truth labels Y are available in supervised scenarios. Discriminative models learn classification boundaries between two different distributions indicated by conditional probability P (Y |X). Recent representative network structures for multimodal feature learning are auto-encoders and generative adver-305 sarial networks. There are some novel works to improve the performance of multimodal research based on these two basic structures (see Figure 4).

Auto-encoders
The main idea of auto-encoder for multimodal learning is to first encode data from a source modality as hidden representations and then to use a decoder to generate features (or data) for the target modality.

310
Thus, it is commonly used in dimensionality reduction for bi-directional applications where two modalities are available at the test stage. For this structure, reconstruction loss is the constraint for training encoder and decoder to well capture the semantic correlations between image and text features. For clarity, we identify three ways for correlation learning using auto-encoders in Figure 5. For instance, as shown in Figure   5(a), the input images and text are processed separately with non-shared encoder and decoder, after which 315 these hidden representations from the encoder are coordinated through a constraint such as Euclidean distance [117]. The coordinated methods can be replaced by joint methods in Figure 5(b) where image and text features are projected into a common space with a shared multilayer perceptron (MLP). Subsequently, the joint representation is reconstructed back to the original raw data [125]. Alternatively, feature correlations are captured by cross reconstruction with similarity constraints (e.g. a similarity matrix is used as 320 supervisory information [127] [96]) between hidden features. The idea of constraining sample similarity is also incorporated with GANs into a cycle-consistent formation for cross-modal retrieval like GXN [95] and CYC-DGH [12] (see Figure 4).
The neural networks contain in the encoder-decoder framework can be modality specific. For image data, the commonly used neural networks are CNN while sequential networks like LSTM are most often used for 325 text data. When applied for multimodal learning, the decoder (e.g. LSTM) constructs hidden representations of one modality in another modality. The goal is not to reduce reconstruction error but to minimize the output likelihood estimation. Therefore, most works focus on the decoding since it is a process to project the   [104] is to parallelize more parameters of LSTMs to capture more context information. Similar ideas can be found in "CNN ensemble learning" [108]. Instead of grabbing more information by stacking and paralleling, "Attention-LSTM" [107][175] combines attention technique into LSTM to highlight most relevant correlations, which would be more targeted. An adversarial training strategy is employed into the decoder to make all the representations discriminative for semantics 335 but indiscriminative for modalities so that intra-modal semantic consistency is effectively enhanced [125].
Considering the fixed structure in the decoder like RNN might limit the performance, Wang et al. [26] introduce evolutionary algorithm to adaptively generate neural network structures in the decoder.

Generative Adversarial Networks
As depicted in Figure 4, adversarial learning from generative adversarial networks [22] [53], but has been less popular in VQA tasks. GANs combine generative sub-models and discriminative sub-models into a unified framework in which two com-ponents are trained in an adversary manner.
Different from auto-encoders, GANs can cope with the scenarios where there are some missing data.

345
To accurately explore the correlations between two modalities, multimodal research works involving GANs have been focusing on the whole network structure and its two components: generator and discriminator.
For the generator which also can be viewed as an encoder, an attention mechanism is often used to capture the important key points and align cross-modal features such as AttnGAN [53] and The discriminator, which usually performs binary classification, attempts to discriminate the ground-355 truth labels from the outputs of the generator. Some recent ideas are proposed to improve the discrimination of GANs. Originally, discriminator in the first work [22] just needs to classify different distributions into "True" or "False" [8]. However, discriminator can also make a class label classification where a label classifier is added on the top of discriminator [57][120]. Apart from the label classification, a semantic classifier is designed to further predict semantic relevances between a synthesized image and a ground-360 truth image for text-to-image generation [7]. Only focusing on the paired samples leads to relatively-weak robustness. Therefore, the unmatched image-text samples can be fed into a discriminator (e.g. GAN-INT-CLS [8] and AACR [72]) so that the discriminator would have a more powerful discriminative capability.
According to previous work [48], the whole structure of GANs in multimodal research are categorized into direct methods [8][120] [57], hierarchical methods [49] [50] and iterative methods [51][52] [53]. Contrary 365 to direct methods, hierarchical methods divide raw data in one modality (e.g. image) into different parts such as a "style" and "structure" stage, thereby, each part is learned separately. Alternatively, iterative methods separate the training into a "coarse-fine" process where details of the results from a previous generator are refined. Besides, cycle-consistency from cycleGAN [136] is introduced for unsupervised image translation where a self-consistency (reconstruction) loss tries to retain the patterns of input data after a cycle of feature 370 transformation. This network structure is then applied into tasks like image generation [13][11] and crossmodal retrieval [95][12] to learn semantic correlation in an unsupervised way.
Preserving semantic correlations between two modalities is to reduce the difference of inconsistently distributed features from each modality. Adversarial learning keeps pace with this goal. In recent years, adver-  Figure 4). For example, in the first work about unsupervised 380 image captioning [121], the core idea of GANs is used to generate meaningful text features from scratch of text corpus and cross-reconstruction is performed between synthesized text features and true image features.

Multimodal Feature Extraction
Deep multimodal structures support the following learning process. Thereby, feature extraction is closer for exploring visual-textual content relations, which is the prerequisite to discriminate the complementarity 385 and redundancy of multiple modalities. It is well-known that image features and text features from different deep models have distinct distributions although they convey the same semantic concept, which results in a heterogeneity gap. In this section, we introduce several effective multimodal feature extraction methods for addressing the heterogeneity gap. In general, these methods focus on (1) learning the structural dependency information to reasoning capability of deep neural networks and (2) storing more information for seman-390 tic correlation learning during model execution. Moreover, (3) feature alignment schemes using attention mechanism are also widely explored for preserving semantic correlations.

Graph Embeddings with Graph Convolutional Networks
Words in a sentence or objects within an image have some dependency relationships, and graph-based visual relationship modelling is beneficial for the characteristic [35]. Graph Convolutional Networks (GCNs) 395 are alternative neural networks designed to capture this dependency information. Compared to standard neural networks such as CNNs and RNNs, GCNs would build a graph structure which models a set of objects (nodes) and their dependency relationships (edges) in an image or sentence, embed this graph into a vectorial representation, which is subsequently integrated seamlessly into the follow-up steps for processing.
Graph representations reflect the complexity of sentence structure and are applied to natural language pro-400 cessing such as text classification [176]. [35] [65] and image feature extraction [15][16] [138]. Among these methods, GCNs capture semantic relevances of intra-modality according to the neighborhood structure. GCNs also capture correlations between two modalities according to supervisory information. Note that vector representations from graph convolutional networks are fed into subsequent networks (e.g. "encoder-decoder" framework) for further learning.

410
GCNs aim at determining the attributes of objects and subsequently characterize their relationships. On the one hand, GCNs can be applied in a singular modality to reduce the intra-modality gap. For instance, Yu et al. [14] introduce a "GCN+CNN" architecture for text feature learning and cross-modal semantic correlation modeling. In their work, Word2Vec and k-nearest neighbor algorithm are utilized to construct semantic graphs on text features. GCNs are also explored for image feature extractions, such as in image captioning 415 [15][16] [138]. In previous work [138], a tree structure embedding scheme is proposed for semantic graph construction. Specifically, input images are parsed into several key entities and their relations are organized into a visual parsing tree (VP-Tree). This process can be regarded as an encoder. The VP-Tree is transformed into an attention module to participate in each state of LSTM-based decoder. VP-Tree based graph construction is somewhat in a unified way. Alternative methods are introduced to construct more fine-graded 420 semantic graphs [15][16]. Specifically, object detectors (e.g. Faster-RCNN [179]) and visual relationship detectors (e.g. MOTIFS [180]) are used to get image regions and spatial relations, semantic graphs and spatial graphs are constructed based on the detected regions and relations, respectively. Afterwards, GCNs extract visual representations based on the built semantic graphs and spatial graphs.
Graph convolutional networks are also introduced to mitigate the inter-modality gap between image and 425 text [144] [143]. Take the work [143] for VQA as an example, an image is parsed into different objects, scenes, and actions. Also, a corresponding question is parsed and processed to obtain its question embeddings and entity embeddings. These embedded vectors of image and question are concatenated into node embeddings then fed into graph convolutional networks for semantic correlation learning. Finally, the output activations from graph convolutional networks are fed into sequential networks to predict answers.

Memory-augmented Networks
To enable deep networks to understand multimodal content and have better reasoning capability for various tasks, a solution may be the mentioned GCNs. Moreover, another solution that has gained attention 440 recently is memory-augmented networks. Directly, when much information in mini-batch even the whole dataset is stored in a memory bank, such networks have greater capacity to memorize correlations.
In conventional neural networks like RNNs for sequential data learning, the dependency relations between samples are captured by the internal memory of recurrent operations. However, these recurrent operations might be inefficient in understanding and reasoning overextended contexts or complex images. For 445 instance, most captioning models are equipped with RNN-based encoders, which predict a word at every time step based only on the current input and hidden states used as implicit summaries of previous histories.
However, RNNs and their variants often fail to capture long-term dependencies [31]. For this limitation, memory networks [30] are introduced to augment the memory primarily used for text question-answering [88]. Memory networks improve understanding of both image and text, and then "remember" temporally 450 distant information.
Memory-augmented networks can be regarded as recurrent neural networks with explicit attention methods that select certain parts of the information to store in their memory slots. As reported in Figure 4, memory-augmented networks are used in cross-modal retrieval [ Figure 6. A memory block, which acts as a compressor, encodes the input sequence into its memory slots.
The memory slots are a kind of external memory to support learning; the row vectors in each slot are accessed and updated at each time-step. During training, a network such as LSTM or GRU, which acts as a memory controller, refers to these memory slots to compute reading weights (see Figure 6). According to the weights, the essential information is obtained to predict the output sequence. Meanwhile, the controller 460 computes writing weights to update values in memory slots for the next time-step of the training [182].
Memory networks can expand the "memory" of networks thus store more information. are utilized to determine the importance of concatenated visual and text features over the whole training data. Further considering both two modalities, a visual knowledge memory networks is introduced in which memory slots store key-value vectors computed from images, query questions and a knowledge base [146]; Instead of storing the actual output features, Song et al. [64] adopt memory slots to store a prototype concept representation from pre-trained concept classifiers, which is inspired from the process of human memory.

475
Memory-augmented networks improve the performance of deep multimodal content understanding by offering more information to select. However, this technique is less popular in image generation, image captioning and cross-modal retrieval than VQA (see in Figure 4). A possible reason for this is that in crossmodal retrieval, memory-augmented networks might require extra time when memory controllers determine when to write or read from the external memory blocks. It will hurt overall retrieval efficiency.

Attention Mechanism
As mentioned in Section 3.2, one challenge for deep multimodal learning is to preserve semantic corre-    [190], and cross-modal retrieval [62][79] [65]. In principle, the attention mechanisms compute different weights (or importances) according to relevances between two 490 global (or local) multimodal features and assign different importances to these features. Thereby, the networks will be more targeted at the sub-components of the source modality-regions of an image or words of a sentence. To further explore the relevances between two modalities, the attention mechanisms are adopted on multi-level feature vectors [150] [189], employed in a hierarchical scheme [188] [191], and incorporated with graph networks for modelling semantic relationships [35]. 495 To elaborate on the current ideas and trends of attention algorithms, we categorize this popular mechanism into different types. According to objective computing vectors, we categorize the current attention algorithms into four types: visual attention, textual attention, co-attention, and self-attention. Their diagrams are introduced in Figure 7. We further categorize the attention algorithms into single-hop and multiple-hop (i.e. stacked attention) according to the iterations of importance calculation. The predicted answers are more accurately related to the question type and image content. Visual attention 510 is widely used to learn features from two modalities.
Textual attention. Compared to visual attention, the textual attention approach is relatively less adopted.
As shown in Figure 7(b), it has an opposite computing direction [153] [192] [193]. The computed weights are based on text features to obtain relevances for different image regions or objects. According to the work [88], the reason why textual attention is necessary is that text features from the multimodal models often 515 lack detailed information for a given image. Meanwhile, the application of textual attention is less dominant as it is harder to capture semantic relevances between abstract text data and image data. Moreover, image data has always contained more irrelevant content for similar text. In other words, the text might describe only some parts within an image. Figure 7 with hierarchical text features may vary dramatically so that the complex correlations are not fully captured [164]. Therefore, Yu et al. [164][184] develop the co-attention mechanism into a generalized Multi-modal Factorized High-order pooling (MFH) block in an asymmetrical way. Thereby, higher-order correlations of 535 multi-modal feature achieve more discriminative image-question representation and further result in significant improvement on the VQA performance.

Co-attention. As shown in
Self-attention. Compared to the co-attention algorithm, self-attention, which considers the intra-modality relations, is less popular in deep multimodal learning. As intra-modality relation is complementary to the inter-modality relation, its exploration is considered improving the feature learning capability of deep It is important to note that when these four types of attention mechanisms are applied, they can be used to highlight the relevances between different image region features and word-level, phrase-level or sentence-level text features. These different cases just need region/object proposal networks and sentence 555 parsers. When multi-level attended features are concatenated, the final features are more beneficial for content understanding in multimodal learning.
As for single-hop and multiple-hop (stacked) attention, the difference lies in whether the attention "layer" will be used one or more times. The four mentioned attention algorithms can be applied in a single-hop manner where the relevance weights between image and text features are computed once only. However, 560 for multiple-hop scenarios, the attention algorithm is adopted hierarchically to perform coarse-to-fine feature learning, that is, in a stacked way [65][188] [165] [150][105] [152] [106] . For example, Xu et al. [150] introduce two-hop spatial attention learning for VQA. The first hop focuses on the whole and the second one focuses on individual words and produces word-level features. Yang et al. [105] also explore multiple attention layers in VQA thereby the sharper and higher-level attention distributions will contribute refined 565 query features for predicting more relevant answers. Singh et al. [152] achieve marginal improvements using "attention on attention" framework in which the attention module is stacked in parallel and for image and text feature learning. Nevertheless, a stacked architecture has tendency for gradient vanishing [106].
Regarding this, Fan et al. [106] propose stacked latent attention for VQA. Particularly, all spatial configuration information contained in the intermediate reasoning process is retained in a pathway of convolutional 570 layers so that the vanishing gradient problem is tackled.
In summary, to better understand the content in visual and textual modality, attention mechanisms pro-

Common Latent Space Learning
As illustrated in Figure 1, feature extractors (e.g. GCNs) would yield modality-specific presentations.
In other words, these features distribute inconsistently and are not directly comparable. To this end, it is necessary to further map these monomodal features into a common latent space with the help of an 580 embedding networks (e.g. MLP). Therefore, common latent feature learning has been a critical procedure for exploiting multimodal correlations. In the past years, various constraint and regularization methods have been introduced into multimodal applications (see Figure 4). In this section, we will include these ideas, such as attention mechanisms, which aim to retain similarities between monomodal image and text features.
According to the taxonomy methods [20], multimodal feature learning algorithms include joint and coordinated methods. The joint feature embedding is formulated as: while coordinated feature embeddings are represented as: where J refers to the jointly embedded features, F and G denote the coordinated features. x 1 , ..., x n and y 1 , ..., y n are n-dimension monomodal feature representations from two modalities (i.e. image and text).
The mapping functions J (·), F(·) and G(·) denote the deep networks to be learned, "∼" indicates that the 590 two monomodal features are separated but are related by some similarity constraints (e.g. DCCA [196] ).

Joint Feature Embedding
In deep multimodal learning, joint feature embedding is a straightforward way in which monomodal features are combined into the same presentation. The fused features are used to make a classification in cross-modal retrieval [63]. It also can be used for performing sentence generation in VQA [32][36] [91].

595
In early studies, some basic methods are employed for joint feature embedding such as feature summation, feature concatenation [51][52] [53], and element-wise inner product [148][186], the resultant features are then fed into a multi-layer perceptron to predict similarity scores. These approaches construct a common latent space for features from different modalities but cannot preserve their similarities while fully understanding the multimodal content. Alternatively, more complicated bilinear pooling methods [101]  [113] [200]. For example, convolutional kernels are initialized under the guidance of text features. Then, these text-guided kernels operate on extracted image features to maintain semantic correlations [200].
The attention mechanisms in Section 4.2.3 can also be regarded as a kind of joint feature alignment method and are widely used for common latent space learning (see Figure 4). Theoretically, these feature 625 alignment schemes aim at finding relationships and correspondences between instances from visual and textual modalities [20] [88]. In particular, the mentioned co-attention mechanism is a case of joint feature embedding in which image and text features are usually treated symmetrically [63]. To sum up, joint feature embedding methods are basic and straightforward ways to allow learning interactions and perform inference over multimodal features. Thus, joint feature embedding methods are more suitable for situations where image and text raw data are available during inference, and joint feature embedding methods can be expanded into situations when more than two modalities are present. However, for content understanding among inconsistently distributed features, as reported in previous work [89], there is 640 potential for improvement in the embedding space.

Coordinated Feature Embedding
Instead of embedding features jointly into a common space, an alternative method is to embed them separately but with some constraints on features according to their similarity (i.e. coordinated embedding).
For example, the above-noted reconstruction loss in auto-encoders can be used to constraining multimodal feature learning in the common space [68] [75]. Using traditional canonical correlation analysis [109], as an alternative, the correlations between two kinds of features can be measured and then maintained [111] [196].
To explore semantic correlation in a coordinated way, generally, there are two commonly used categories: classification-based methods and verification-based methods.
For classification-based methods when class label information is available, these projected image and   [194].

675
These classification-based and verification-based approaches are widely used for deep multimodal learning. Although the verification-based methods overcome some limits of classification-based methods, they still face some disadvantages such as the negative samples and margin selection, which inherit from metric learning [74]. Recently, new ideas on coordinated feature embedding methods have combined adversarial learning, reinforcement learning, cycle-consistent constraints to pursue high performance. Several represen-680 tative approaches are shown in Figure 4.
Combined with adversarial learning.  [135] and cross-modal retrieval [82]. Because reinforcement learning avoids exposure bias [19][18] and non-differentiable metric issue [17] [19]. It is adopted to promote multimodal correlation modeling. To incorporate reinforcement learning, its basic components are defined (i.e. "agent", 700 "environment", "action", "state" and "reward"). Usually, the deep models such as CNNs or RNNs are viewed as the "agent", which interacts with an external "environment" (i.e. text features and image features), while the "action" is the prediction probabilities or words of the deep models, which influence the internal "state" of the deep models (i.e. the weights and bias). The "agent" observes a "reward" to motivate the training process. The "reward" is an evaluation value through measuring the difference between the predictive distri-CIDEr (Consensus-based Image Description Evaluation) score of a generated sentence and a true descriptive sentence. The "reward" plays an important role for adjusting the goal of predictive distribution towards the ground-truth distribution.
Reinforcement learning is commonly used in generative models in which image patch features or word-710 level features are regarded as sequential inputs. When incorporating reinforcement learning into deep multimodal learning, it is important to define an algorithm to compute the expected gradients and the "reward" as a reasonable optimization goal.
For the first term, the expected gradients, REINFORCE algorithm [204] is widely used as a policy gradient method to compute gradients, then to update these "states" via back-propagation algorithms [17] [203]. Instead of measuring the difference, sample similarity is more straightforward to track. As an example, visual-textual similarity is used as "reward" after deep networks are trained under the ranking loss (e.g. a triplet loss) [19][115] [82]. Note 720 that, the design of triplet ranking loss function is diverse, such as in a bi-directional manner [115] or based on inter-modal triplet sampling [19]. For example, Wang et al. [116] devise a three-stage "reward" for three different training scenarios according to similarity scores and binary indicators.
Combined with cycle-consistent constraint. Class label information or relevance information between image and text is crucial for understanding semantic content. However, this supervisory information some- In these two functions, F (·) is a mapping process from Y to X and G(·) is a reversed process from X to Y . Cycle-consistency has been used on several tasks such as cross-modal retrieval [158][95] [12][149] [125], image generation [11][13] [95] and visual question answering [162].
Cycle-consistency is an unsupervised learning method for exploring semantic correlation in the common 735 latent space. To ensure predictive distribution and retain as many correlations as possible, the aforementioned forward and backward cycle-consistent objective functions are necessary. The feature reconstruction loss function acts as the cycle-consistency objective function. For example, Gorti et al. [11] utilize the crossentropy loss between generated words and the actual words as cycle-consistency loss values to optimize the process text-to-image-to-text translation. For cross-modal retrieval tasks, Li et al. [158] adopt Euclidean 740 distance between predictive features and reconstructed features as the cycle-consistency loss where the two cycle loss functions interact in a coupled manner to produce reliable codes.
Currently, the application of cycle-consistent constraints for deep multimodal learning can be categorized as structure-oriented and task-oriented. The former group focuses on making several components in a whole network into a close loop in which output of each component is used as the input for another component. For structure-oriented groups, the cycle-consistent idea is combined with some popular deep networks, such as GANs, to make some specific combinations. In these methods, image features are projected as "text features" and then reconstructed back to itself. Currently, the combination with GANs is a popular 750 option since paired correspondence of modalities can be learned in the absence of a certain modality (i.e. via generation). For example, Wu et al. [12] plug a cycle-consistent constraint into feature projection between image and text. The inversed feature-learning process is constrained using the least absolute deviation. The whole process is just to learn a couple of generative hash functions through the cycle-consistent adversarial learning. Regarding this limit, Li et al. [158] devise an outer-cycle (for feature representation) and an inner-755 cycle (for hash code learning) constraint to combine GANs for cross-modal retrieval. Thereby, the objects for which the cycle-consistency loss constrains have increased. Moreover, in their method, the discriminator should distinguish if the input feature is original (viewed as True) or generated (viewed as False).
For task-oriented groups, cycle-consistency is adopted into dual tasks. In cycle-consistency, we use an inverse process (task A to task B to task A) to improve the results. When a whole network performs both tasks 760 well, it indicates that the learned features between the tasks have captured the semantic correlations of two modalities. For example, Li et al. [162] combine visual question answering (VQA) and visual question generation (VQG), in which the predicted answer is more accurate through combining image content to predict the question. In the end, the complementary relations between questions and answers lead to performance gains. For text-image translation, a captioning network is used to produce a caption which corresponds to a 765 generated image from a sentence using GANs [11]. The distances between the ground truth sentences and the generated captions are exploited to improve the network further. The inverse translation is beneficial for understanding text context and the synthesized images. To sum up, there are still some questions to be explored in task-oriented ideas, such as the model parameter sharing scheme, and these implicit problems make the model more difficult to train and might encounter gradient vanishing problems, the task-oriented 770 cycle-consistent constraint is applied to unify multi-task applications into a whole framework and attracts more research attention.

Conclusion and Future Directions
In this survey, we have conducted a review of recent ideas and trends in deep multimodal learning (image and text) including popular structures and algorithms. We analyzed two major challenges in deep multimodal

Image-to-Text
Text-to-Image Image-to-Text Text-to-Image Figure 8 The achieved progress of cross-modal retrieval on the Flickr30K [205] and the MS-COCO [41] datasets.
to learn a representation that is aligned across three modalities: sound, image, and text. The network is only 800 trained with "image + text" and "image + sound" pairs. He et al. [209] construct a new benchmark for cross-media retrieval in which image, text, video, and audio are included. It is the first benchmark with 4 media types for fine-grained cross-media retrieval. However, this direction is still far from satisfactory.
Deep neural networks, including convolutional neural networks and recurrent neural networks, have made the monomodal feature extraction and multimodal feature learning end-to-end trainable. The represen-805 tations from multimodal data can be automatically learned effectively, without the need of requiring expert knowledge in a certain field, which makes the process of understanding of multimodal content more intelli- Image-to-Text Text-to-Image Image-to-Text Text-to-Image

On MIRFlickr25K
On NUS-WIDE Figure 9 The achieved progress of cross-modal hash retrieval on the MIRFlickr25k [206] and the NUS-WIDE [207] datasets. Hashing methods have higher retrieval efficiency using the binary hash codes.
examining scalability and heterogeneity [178]. Finally, generation-based tasks such as image generation and image captioning are effective for unsupervised learning, since numerous labeled training data can be generated from the deep networks. Combined with reinforcement learning, the image generation process is more controllable. For example, some fine-grained attributes including texture, shape and color can be specified during deep network training. Once it understands the content between modalities, the deep network, like 820 an agent, will synthesize photo-realistic images, which can be used in other applications.

Methods
Year CIDEr    Table 4 Performance of visual question answering on VQA 1.0 dataset [211]