Integrating Information Theory and Adversarial Learning for Cross-modal Retrieval

Accurately matching visual and textual data in cross-modal retrieval has been widely studied in the multimedia community. To address these challenges posited by the heterogeneity gap and the semantic gap, we propose integrating Shannon information theory and adversarial learning. In terms of the heterogeneity gap, we integrate modality classification and information entropy maximization adversarially. For this purpose, a modality classifier (as a discriminator) is built to distinguish the text and image modalities according to their different statistical properties. This discriminator uses its output probabilities to compute Shannon information entropy, which measures the uncertainty of the modality classification it performs. Moreover, feature encoders (as a generator) project uni-modal features into a commonly shared space and attempt to fool the discriminator by maximizing its output information entropy. Thus, maximizing information entropy gradually reduces the distribution discrepancy of cross-modal features, thereby achieving a domain confusion state where the discriminator cannot classify two modalities confidently. To reduce the semantic gap, Kullback-Leibler (KL) divergence and bi-directional triplet loss are used to associate the intra- and inter-modality similarity between features in the shared space. Furthermore, a regularization term based on KL-divergence with temperature scaling is used to calibrate the biased label classifier caused by the data imbalance issue. Extensive experiments with four deep models on four benchmarks are conducted to demonstrate the effectiveness of the proposed approach.


Introduction
Semantic information that helps us understand the world usually comes from different modalities such as video, audio, and text. Namely, the same concept can be presented in different ways. Therefore, it is possible to search semantically-relevant samples ( e.g. images) from one modality when given a query item from another modality ( e.g. text). With the increasing amount of multimodal data available, more efficient and accurate retrieval methods are still in demand in the multimedia community.
Deep learning methods can effectively embed features from different modalities into a commonly shared space, and then measure the similarity between these embedded features. To date, the "heterogeneity gap" [1] and the "semantic gap" [2] are still challenges to be addressed for cross-modal retrieval. Since the data in different modalities are described by different statistical properties, the heterogeneity gap characterizes the difference between feature vectors from different modalities that have similar semantics * Corresponding author.
E-mail address: m.s.k.lew@liacs.leidenuniv.nl (M.S. Lew). but are distributed in different spaces. Similarities between these feature vectors are not well associated so that these vectors are not directly comparable, leading to inconsistent distributions. The semantic gap characterizes the difference between the high-level user perception of the data and the lower-level representations of the data by the computer ( i.e. pixels or symbols). To achieve better retrieval performance, it is essential to address these gaps for associating the similarity between cross-modal features in the shared space.
To capture the semantic correlations between cross-modal features, many approaches have been proposed in recent years. Some approaches focus on designing effective structures from a deep networks perspective. For instance, graph convolutional networks are employed to model the dependencies within visual or textual data [3] . Other approaches focus on designing similarity constraint functions from a deep features perspective. For example, bilinear pooling-based methods are applied to align image and text features to then accurately capture inter-modality semantic correlations. In other examples, coordinated representation learning methods [4] , such as ranking loss [5,6] and cycle-consistency loss [7] are widely used to preserve similarity between cross-modal https features. These constraint functions mainly aim at reducing the semantic gap by focusing on the similarity between two-tuple or three-tuple samples. However, they might not directly mitigate the heterogeneity gap caused by the inconsistent feature distributions in the different spaces.

Motivations
Considering the limitations of similarity constraint functions, we propose a new method to perform cross-modal retrieval from two aspects. First, we reduce the heterogeneity gap by integrating Shannon information theory [8] with adversarial learning, in order to construct a better embedding space for cross-modal representation learning. Second, we combine two loss functions, including Kullback-Leibler divergence loss and bi-directional triplet loss, to preserve semantic similarity during the feature embedding procedure, thereby reducing the semantic gap.
To do this, we combine the information entropy predictor and the modality classifier in an adversarial manner. Information entropy maximization and modality classification are two processes trained with competitive goals. Since the image is a 3-channel RGB array while the text is often symbolic, uni-modal features extracted from image or text data are characterized by different statistical properties, which can be used to distinguish the original modalities these features belong to. As a result, when these features in the shared space are correctly classified into their original modalities with high confidence, then their feature distributions convey less information content, and the modality classifier performs modality classification with lower uncertainty. In contrast, when cross-modal features become modality-invariant and show their commonalities, these features cannot be classified into the modality they originally belong to. In this case, the feature distributions in the shared space conveys more information content and higher modality uncertainty.
According to Shannon's information theory [8] , we can measure the modality uncertainty in the shared space by computing information entropy. This basic proportional relation provides the principle to mitigate the heterogeneity gap. For this purpose, we integrate modality uncertainty measurement into cross-modal representation learning. As shown in Fig. 1 , a modality classifier (in the following we call it a discriminator ) is devised to classify image and text modality, rather than perform a "true/false" binary classification. This discriminator also provides its output probabilities to calculate the information entropy of the cross-modal feature distributions. At the start of training, the discriminator can classify images and text modalities with high confidence due to their different statistical properties. In contrast, the feature encoders (in the following we call it a generator ) project features into a shared space and attempt to fool the discriminator and make it perform an incorrect modality classification until features in the shared space are fused heavily into a confusion state, maximizing the modality uncertainty.
On the basis of this heavily-fused state, we further use similarity constraints on the feature projector to reduce the semantic gap. Specifically, Kullback-Leibler (KL) divergence loss is used to preserve semantic correlations between image and text features by using instance labels as supervisory information. More importantly, we consider the issue of data imbalance and introduce a regularization term based on KL-divergence with temperature scaling to calibrate the biased label classifier. Afterwards, we adopt the commonly used bi-directional triplet loss and instance label classification loss ( i.e. categorical cross-entropy loss) to achieve good retrieval performance.

Our contributions
Our contributions can be summarized three-fold as follows: Fig. 1. Conceptual diagram of combining information theory and adversarial learning for cross-modal retrieval. The features Z i ∈ F d and Z t ∈ F d with dimension d for image-text pairs are extracted using deep neural networks. Shape indicates modality and color denotes pair-wise similarity information. The modality classifier aims to classify the text and image modalities, thereby minimizing the uncertainty of modality classification it performs (measured by Shannon information entropy). Conversely, the feature encoders project uni-modal features into a commonly shared space and attempt to fool this classifier by maximizing its uncertainty of modality classification, which is computed by the information entropy predictor. The modality classifier and the information entropy predictor are combined in an adversarial manner to reduce the heterogeneity gap. If the classifier's uncertainty is maximized, features Z i and Z t are intertwined into a domain confusion state where this classifier cannot confidently determine which modality each input feature ( Z i or Z t ) belongs to. Namely, this classifier becomes least-confident on its classification results. This process of adversarial combining is introduced in Section 3.2 and Section 4.1 . Furthermore, the feature projector aims to associate the semantic similarity by using pair-wise objective functions such as bi-directional triplet loss.
First, we combine information theory and adversarial learning into an end-to-end framework. Our work is the first to explore information theory in reducing the heterogeneity gap for crossmodal retrieval. This method is beneficial for constructing a shared space for further learning commonalities between cross-modal features, which can be used for tasks in other modalities, such as video-text matching.
Second, we introduce a regularization term based on KLdivergence with temperature scaling to address the issue of data imbalance, which calibrates biased label classifier training and guarantees the accuracy of instance label classification. To the best of our knowledge, we are not aware of any prior use in the context of addressing imbalance issues on retrieval datasets.
Third, we use bi-directional triplet loss to constrain intramodality semantics. Aside from these intra-modality constraints, we also consider optimizing inter-modality similarity. We use the instance labels to construct a supervisory matrix. This matrix regularizes the semantic similarity between the projected image (or text) features and text (or image) features by minimizing KLdivergence. This inter-modality constraint is more effective since it focuses on all the projected cross-modal feature distributions in a mini-batch.
The rest of paper is organized as follows. Related work is reviewed in Section 2 . We give definitions and a theoretical analysis for the proposed method in Section 3.2 . We present the specific components for implementation including network structures, objective functions, and optimization in Section 4 . We test the proposed method on four datasets, and the results are reported in Section 5 . Finally, the conclusions are given in Section 6 .

Cross-modal representation learning and matching
Preserving the similarity between cross-modal features should consider two aspects: inter-modality and intra-modality. Supervision information ( e.g. class label or instance label), if available, is beneficial for learning features from these two aspects. Preserving feature similarity can be realized by using methods such as joint representation learning and coordinated representation learning [4] . Joint representation learning methods project the uni-modal features into the shared space using straightforward strategies such as feature concatenation, summation, and inner product. Subsequently, more complicated bilinear pooling methods, such as multimodal compact bilinear (MCB) pooling, are proposed to explore the semantic correlations of cross-modal features. To regularize the joint representations, deep networks are commonly trained by using objective functions, such as regression-based loss [9,10] .
Coordinated representation learning methods process image and text features separately but impose them under certain similarity constraints [4] . In general, these constraints can be categorized into classification-based and verification-based methods in supervised scenarios. In terms of classification-based methods, both image and text features are used to make a label classification by using categorical cross-entropy loss function. Because a paired image-text input has the same class label, their features can be associated in the shared space. However, classification-based methods cannot preserve the similarity between inter-modality features well because the similarity between image and text features is not directly regularized.
Verification-based methods, based on metric learning, are proposed to further optimize inter-modality feature learning. Given a similar (or dissimilar) image-text pair, their corresponding features should be verified as similar (or dissimilar). Therefore, the goal of deep networks is to push features of similar pairs closer, while keeping features of dissimilar pairs further apart. Verificationbased methods include pair-wise constraints and triplet constraints, which focus on inferring the matching scores of imagetext feature pairs [10] .
Triplet constraints optimize the distance between positive pairs to be smaller than the distance between negative pairs by a margin. They can capture both intra-modality and inter-modality semantic correlations. For example, bi-directional triplet loss has been employed to optimize image-to-text and text-to-image ranking [6] . Although triplet constraints are widely used for crossmodal retrieval, the difficulties are in the mining strategy for negative pairs and the selection of a margin value, which are usually task-specific and empirically selective.

Adversarial learning for cross-modal retrieval
The afore-mentioned joint and coordinated representation learning approaches focus on two-tuple or three-tuple samples, which may be insufficient for achieving overall good retrieval performance. Adversarial learning, as an alternative method, has shown its powerful capability for modeling feature distributions and learning discriminative representations between modalities when deep networks are trained with competitive objective functions [6,11] .
Recent progress in using adversarial learning for cross-modal retrieval can be categorized as feature-level and loss function-level discriminative models.
From a feature-level perspective, it is possible to preserve semantic consistency by performing a min-max game between intermodality feature pairs [6] . A straightforward way is to build a discriminator, making a "true/false" classification between image features (regarded as true), corresponding matched text features (regarded as fake), and unmatched image features from other categories (also regarded as fake) [6] . Alternatively, a cross-modal auto-encoder can be combined to generate features for another modality. For example, a generator attempts to generate image features from textual data and then regards them as true, while for a discriminator, image features extracted from original images and these from the generated "images" are labeled as true and fake, respectively. The adversarial training explores the semantic correlations of cross-modal representations. Intra-modality discrimination also can be considered in cross-modal adversarial learning, forcing the generator to learn more discriminative features. In this case, the discriminator tends to discriminate the generated features from its original input. From a loss function-level perspective, instead of making a binary classification ( i.e. true or fake), adversarial learning is designed to train two groups of loss functions or two processes with competitive goals. This idea is applied in recent work for crossmodal retrieval [6,11] . To be specific, a feature projector is trained to generate modality-invariant representations in the shared space, while a modality classifier is constructed to classify the generated representations into two modalities. Similarly, in this paper, we combine two networks and train them with two competitive goals.

Information-theoretical feature learning
As mentioned before, feature vectors from different modalities are distributed in different spaces, resulting in the heterogeneity gap, which affects the accuracy of cross-modal retrieval. Therefore, it becomes essential to reduce feature distribution discrepancies and thereby reduce the heterogeneity gap. The solution for this is to measure and then minimize distribution discrepancy. For example, distribution disparity of cross-modal features can be characterized by Maximum Mean Discrepancy (MMD), which is a differentiable distance metric between distributions. However, MMD suffers from sensitive kernel bandwidth and weak gradients during training.
Information-theoretical based methods are used to measure the differences of feature distributions and learn better cross-modal features. As an example, the cross-entropy loss function is widely used to estimate the errors between inference probabilities and ground-truth labels where the gradients are computed according to the errors. Once the gradients are computed, deep networks can further update their parameters via the back-propagation algorithm. KL-divergence (also called relative entropy) is another popular criterion to characterize the difference between two probability distributions. Minimizing the difference is beneficial for retaining the semantic similarity between features. For example, Zhang et al . [12] employ the KL-divergence to measure the similarity between projected features and supervisory information.
Recently, Shannon information entropy [8] has been used for performing tasks such as semantic segmentation [13] and crossmodal hash retrieval [14] . These studies indicate that Shannon entropy can be used for multimodal representation learning by estimating uncertainty [8] . Take generative adversarial networks as an example: if the generator makes image features and text features close and minimizes their discrepancy, then the discriminator will become less-certain or under-confident, i.e. , having a high information entropy to predict which modality each feature comes from. We applied this principle in our previous work [14] to design an objective function to maximize the domain uncertainty over crossmodal hash codes in a commonly shared space. Deep networks trained by using information entropy construct a domain confusion state where the heterogeneity gap can be effectively reduced. On the basis of this state, other loss functions, such as ranking loss, can be further applied to regularize feature similarity.

Problem formulation
We consider a supervised scenario for cross-modal retrieval. Denote X i as the input images and the corresponding descriptive sentences as X t . Each image and its descriptive sentences have Relationship between output probabilities and information content. The more uncertain the shared space, the more information content it conveys. (c): Relationship between modality uncertainty and output probabilities for each modality. When probabilities predicted for two modalities are identical, the shared space is intertwined into a domain confusion state ( i.e. most uncertain). If one modality is identified with a higher probability (closer to 1) while another with a lower probability (closer to 0), the domain confusion state is not achieved. the same instance label Y . Therefore, we can organize an input pair ( x i , x t , y ) to train a deep network. To be specific, feature encoders E 1 (·; θ E 1 ) and E 2 (·; θ E 2 ) extract image and text features, respectively, and then further embed these uni-modal features into a shared space by using non-shared sub-networks. The embed- Note that the parameters in the nonshared sub-networks for uni-modal image and text feature embedding have been included into θ E 1 and θ E 2 , respectively. The goal is to train a deep network to make the embedded features Z i and Z t modality-invariant and semantically discriminative, improving the retrieval accuracy.
As shown in Fig. 1 , the networks E 1 , E 2 , and the information entropy predictor act as a generator, while the modality classifier acts as a discriminator. The training of the generator and the discriminator is formulated as an interplay min-max game to mitigate the heterogeneity gap. The feature projector attempts to preserve feature similarity under several constraints, which are introduced in Section 4.2, 4.3 , and 4.4 .

Information entropy and modality uncertainty
Image features can be extracted from convolutional neural networks, while text features can be extracted from sequential networks. These feature vectors from different modalities have similar semantics but are distributed in different spaces. Their similarities in the different spaces are not well associated so that these feature vectors are not directly comparable. Hence, it is required to further embed them into a shared space ( i.e. Z i and Z t in Fig. 1 ). Unimodal features are characterized by different statistical properties. Therefore, as shown in Fig. 2 (a), it is possible to identify a feature in the shared space coming from a visual modality with higher probability P i (more certain classification) than coming from a textual modality with lower probability P t =1 −P i (less certain classi-fication). In other words, these cross-modal features are not intertwined heavily. As a result, the domain confusion state is not achieved. Conversely, if a given feature can not be distinguished which modality this feature originally comes from, it indicates that this feature has identical probability ( P i = P t ) coming from each modality. In this case, the shared space has highest uncertainty and the cross-modal features are intertwined into a domain confusion state, which corresponds to highest information content. We use information entropy [8] to measure the uncertainty of the shared space. Fig. 2 (b) illustrates that two modalities with an equal probability leads to the highest Shannon information entropy and thus information content.
Modality uncertainty refers to the unreliability of classification that the discriminator classifies image features and text features into two modalities. It is proportional to Shannon information entropy [8] , as shown in Fig. 2 (c). Based on this observation [14] , we design the discriminator to measure its output modality uncertainty by using information entropy as a criterion. Maximizing information entropy means that the discriminator becomes leastconfident in classifying the original modality of image and text features, resulting in the greatest reduction of the heterogeneity gap.

Adversarial learning and information entropy
To make cross-modal features modality-invariant, we devise a generator and a discriminator, as shown in Fig. 1 . The discriminator performs modality classification to identify visual modality and textual modality based on cross-modal features. Following [6] , we define the modality label as Y * c for these two modalities (for visual modality * = i and textual modality * = t). Using output probabilities of the discriminator, we can compute cross-entropy loss to realize modality classification [6] . Once the network convergences under the constraint of this loss function, visual modality and textual modality are clearly identified and classified, thereby minimizing the modality uncertainty.  3. KL-divergence for cross-modal feature projection, which considers all features Z i and Z t in the shared space. Each paired image feature and text feature share the same instance label, indicated by the same color. The cross-modal feature projection module is critical to explore the similarity between image features and normalized text features. The projection process is formulated in Eqs. 2 and 3 . Fig. 4. The implementation of integrating information entropy predictor and modality classifier in Fig. 1 into a unified discriminator. Together with the feature extractors, the whole framework is in the form of generative adversarial network. For clarity, we ignore the feature projector mentioned in Fig. 1 , which includes label classification loss, bi-directional triplet loss, and KL-divergence loss.
Conversely, the generator is designed to maximize the modality uncertainty over the cross-modal feature distributions. To achieve this, the generator learns modality-invariant features to fool the discriminator, maximizing the uncertainty of modality classification the discriminator performs. If the modality uncertainty is maximized, the discriminator is most likely to make an incorrect modality classification and be least-confident about its classification results. In this case, cross-modal features are intertwined into a domain confusion state and become indistinguishable.
To this end, we explore the ways to integrate information entropy and adversarial learning into an end-to-end network, which is introduced in Section 4.1 . For better understanding, we also explore another combining paradigm in the Experimental Section.

KL-Divergence for cross-modal feature projection
To reduce the semantic gap, we use KL-divergence to characterize the differences between projected cross-modal features ( Z i and Z t in Fig. 1 ) and a supervisory matrix computed from their in- , (see Eq. 9 ). In this way, the semantic correlations among cross-modal features can be preserved. We illustrate this process in Fig. 3 . It is important to note that when using KL-divergence to preserve semantic correlations of cross-modal features, all positive and negative pairs in a minibatch are considered. As for the supervisory matrix f (Y l , Y l ) , it is computed by using matrix multiplication and is normalized to the range from 0 and 1.
We argue that different operations to realize f (Z i , Z t ) affect similarity preserving. Directly, the operation f (·) can be an inner product on cross-modal features Z i and Z t . However, using the in-ner product has some implicit drawbacks. First, when multiplying one image feature vector with all text feature vectors, the results of the inner product are not optimally comparable due to the nonnormalized text features, and vice versa. Second, the angles between each image feature vector and each text feature vector, as well as their whole feature distributions, are changing when training the deep network, which makes it problematic for an inner product to measure feature similarity.
To tackle the above limitations, we adopt a cross-modal feature projection to characterize the similarity between features. The idea is related to the work in [12] . Cross-modal feature projection is based on the same distribution and operates on the normalized features. For instance, an image feature vector, z i j ∈ Z i , can be projected to the distribution of a text feature vector z t k ∈ Z t , then each projected feature vector from image to text (termed "i → t") can be formulated as: where "i " and "t" represent the visual and the textual modality, respectively, " j" and "k " represent the index of each image feature and text feature in the shared space, respectively, z t k denotes the normalized feature. Therefore, the length of ˆ , and denotes the similarity between image feature z i j and text feature z t k . When associating each image feature z i j with all text features Z t , we obtain all different lengths, Therefore, when projecting all image features into all text features Z t , we get a similarity matrix A i → t , which is formulated as Similarly, if projecting all text features into all image features Z i , we obtain another similarity matrix A t→ i : In the above two equations, Z i and Z t represent the cross-modal features from two modalities. N is the number of samples in a mini-batch. These two similarity matrices are normalized by a softmax function. Afterwards, we use KL-divergence to characterize the difference between the normalized matrices and the supervisory Y l )) . The specific objective function is introduced in Section 4.2 .

Implementation and optimization
We introduce the implementation and optimization of our proposed approach in this section. We employ four convolutional neural networks such as ResNet-152 [15] and MobileNet [16] to obtain image features and a Bi-directional L STM (Bi-L STM) [17] to extract text features. All the extracted image and text features are uni-modal. Later, we borrow the protocols of non-shared encoding sub-networks (fully-connected layers) in [12] to get the crossmodal features Z i and Z t .
Once the cross-modal features are obtained, we use the proposed algorithm to train the networks based on the above theoretical analysis. The algorithm includes combining information entropy and adversarial learning to mitigate the heterogeneity gap, and loss function terms ( i.e. KL-divergence loss, categorical crossentropy loss, and bi-directional triplet loss) to preserve semantic correlations between cross-modal features.

Combining information theory with adversarial learning
We combine information entropy predictor and modality classifier in Fig. 1 into a unified sub-network, as shown in Fig. 4 . In this paradigm, the discriminator D with parameters θ D performs a modality classification and computes the Shannon information entropy. The backbone nets E 1 and E 2 for feature extraction act as the generator G . The whole structure forms a generative adversarial network. The information entropy computed from the discriminator back-propagates to the feature encoders. Specifically, when the discriminator is fixed, and its parameters are θ D , then the in- The negative information entropy L s is label-free during training, and it regularizes the whole feature distribution to be modalityinvariant.
The discriminator consists of some fully-connected layers. The last layer with two neurons yields probabilities that correspond to two modalities. This discriminator classifies whether the input features Z i and Z t are from the visual or the textual modality given the pre-defined modality label Y * c . In contrast, the generator ( i.e. E 1 and E 2 ) aims at learning modality-invariant features to fool the discriminator to make an incorrect modality classification so that the generator gradually maximizes the output information entropy from the discriminator. Therefore, the learning process of the discriminator affects that of the generator in an indirect way. The objective function is calculated using the output probabilities For the generator E 1 and E 2 : It is expected for the generator G to maximize the information entropy H(P D ) , and subsequently the modality uncertainty (see L c refers to the negative cross-entropy loss of the discriminator and is minimized to clearly classify image and text features into two modalities during training. Note that the gradients calculated from term L s are only used to optimize the parameters θ E 1 and θ E 2 of the generator, whereas the gradients from term L c are only for optimizing the parameters θ D of the discriminator, as shown in Fig. 4 . Minimizing loss L c and L s when trained iteratively will reduce the heterogeneity gap. The optimization method is straightforward, even though the gradients calculated from L c will not directly affect the parameters of the feature encoders E 1 and E 2 . The output probabilities of the discriminator change when updating its parameters, which will affect the Shannon information entropy and affect the output features from E 1 and E 2 in the end.

KL-Divergence for similarity preserving
We also compute KL-divergence directly across Z i and Z t to further preserve semantic similarity. KL-divergence focuses on the projections of image and text features and is computed by Y l )) . Here, superscript " " means matrix transpose. L kl focuses on constraining the whole feature distributions and is complementary to the following bi-directional triplet loss function. We have introduced the process of cross-modal feature projection in Section 3.3 . Given the similarity matrices ( i.e. A i → t (Z i , Z t ) and A t→ i (Z t , Z i ) ), we use the softmax function to normalize these matrices in Eq. 6 and Eq. 7 . The supervisory matrix is normalized after matrix multiplication as in Eq. 8 . Similar to [12] , since we project features from visual (or textual) modality into textual (or visual) modality, the KL-divergence regularizes the semantics in bi-directional feature projection, which is formulated in Eq. 9 as: where ε is a small constant to avoid division by zero. Loss L kl refers to the KL-divergence between the projections of image-text features and their supervisory matrix. This loss is minimized and the gradients computed from L kl are used to update the parameters θ E 1 and θ E 2 of the generator, thereby the semantics between image features and text features can be associated.

Categorical cross-entropy loss
Label classification is a popular idea for cross-modal features learning [12] . We use the instance labels provided on the datasets for label classification. For categorical cross-entropy loss, we apply the norm-softmax strategy and feature projection in [12] to learn more discriminative cross-modal features. On the one hand, the normalized parameters θ P in the label classifier encourage crossmodal features to distribute more compactly so that the softmax classifier performs label classification correctly. On the other hand, projection between image and text features strengthens their similarity association and is beneficial for label classification [12] . Feature projection can be computed using Eq. 1 . Subsequently, given the instance label y l , categorical cross-entropy loss L ce is defined by Eq. 10 1 and is minimized during training: 1 We omit the bias term for simplicity where N is the number of image-text pairs in a mini-batch. W y l, j and W j represent the y l, j -th and the j-th column of weights W in classifier parameters θ P according to [12] . ˆ z i → t j and ˆ z t→ i j are the projections image to text and the projections text to image, respectively, by using Eq. 1 .

KL-Divergence for data imbalance
Label classification using categorical cross-entropy loss can preserve semantic correlations between cross-modal features. However, we argue that there also exists a data imbalance issue when training the label classifier because each image is described by more than one sentence ( e.g. each image has five description sentences in the Flickr30K dataset). In the end, it causes the learned label classifier to prefer text features.
The issue of data imbalance in cross-modal retrieval can be resolved by constructing an augmented semantic space to realign features [18] . In this work, we use the temperature scaling [19] to tackle the data imbalance issue. The biased label classifier can be calibrated by re-scaling its output probabilities i.e. , Re-scaling the probabilities with temperature τ raises the output entropy so better image-text matching can be observed [19] . Subsequently, we use KL-divergence to measure the differences between the re-scaled probabilities. Since the magnitudes of the gradients produced by the re-scaling probabilities scale as 1 /τ 2 , it is important to multiply them by τ 2 . Finally, the KL-divergence loss on the scaling probabilities for data imbalance can be formulated as L di : (11) where ε is a small constant to avoid division by zero. With τ = 1 , we recover the original KL-divergence. As reported in Table 5 , we find that the parameter τ can affect the effectiveness of loss L di .
Minimizing loss L di effectively reduces the influence of data imbalance issue and improves retrieval accuracy. The final objective function for label classification is ( L ce + L di ). The gradients calculated from loss ( L ce + L di ) are used to optimize the parameters θ E 1 , θ E 2 , and θ P in the generator and the label classifier, respectively.

Bi-directional triplet constraint
The triplet constraint is commonly used for feature learning. To achieve the baseline performance, we use this constraint from an inter-modality and an intra-modality perspective to strengthen the discrimination of cross-modal features.
Given cross-modal features Z i and Z t in the shared space, the cosine function is used to measure global similarity between feature vectors, i.e. S jk = (Z i j ) Z t k . We adopt the hard sampling strategy to select three-tuples features from an inter-modality and an intra-modality viewpoint. Hence, the inter-modality and intramodality triplet loss functions are formulated as: L tr = L inter + L intra (14) where m is the margin in the bi-directional triplet loss function.
For instance, in case of inter-modality, S j,k + = (Z i j ) Z t k + , where the anchor features are selected from the visual modality, while the positive features are selected from the textual modality. In case of intra-modality, S j, j + = (Z i j ) Z i j + , both the anchor features and the positive features are selected from the visual modality. Minimizing bi-directional triplet loss L tr keeps the correlated image-text pairs closer to each other, while the uncorrelated image-text pairs are pushed away. This loss directly operates on the cross-modal features Z i and Z t so that the gradients from it optimize the parameters θ E 1 and θ E 2 of the generator.
The problem of integrating information theory and adversarial learning for cross-modal retrieval is formally defined, in Eq. 15 , as a min-max game using the previously defined loss terms. We further introduce the complete procedure of training and optimization in Algorithm 1 . Finally, when trained to convergence, the net-Algorithm 1 Whole network training and optimization pseudocode.
(Embed image features into the shared space) 5: (Embed text features into the shared space) 6: loss computing and feature optimization: fix θ D , update parameters θ E 1 , θ E 2 , θ P : 12: θ P ← θ P − lr 2 · ∇ θ P (L ce + L di ) 13: 14: 15: end for 16: fixate θ P , θ E 1 , θ E 2 , update parameters θ D : 17: 18: end for 19: return the embedded cross-modal features Z i and Z t in Figure  1 work yields cross-modal features Z i and Z t in the shared space, as shown in Fig. 1 . These return cross-modal features are used for performing retrieval.

Datasets and settings
We demonstrate the efficacy of the proposed method on the Flickr8K [20] , Flickr30K [21] , Microsoft COCO [22] , and CUHK-PEDES [23] datasets. Each image in these datasets is described by several descriptive sentences. For Flickr8K, we adopt the standard dataset splitting method to obtain a training set (6K), a validation set (1K), and a test set (1K). For Flickr30K, we follow the previous work [12] and use 29,783 images for training, 10 0 0 images for validation and 10 0 0 images for testing. For MS-COCO, we follow the training protocol in [12] and split this dataset into 82,783 training, 30,504 validation and 50 0 0 test images, and then report the performance on both 5K and 1K test set. For CUHK-PEDES, it contains 40,206 pedestrian images of 13,003 identities. Following [12] , we split this dataset into 11,003 training identities with 34,054 images, 10 0 0 validation identities with 3078 images and 10 0 0 test identities with 3074 images. Note that all captions for the same image are used as separate image-text pairs to train network.
Models are trained on GEFORCE TITAN X and Tesla K40 GPUs. To extract text features, the embedded words are fed into a Bi-LSTM to capture vectors with dimension 1024 (1024-D). We follow [12] and set the Bi-LSTM with dropout rate 0.3. For fair comparison, we adopt ResNet [15] , MobileNet [16] , and VGGNet [24] as the backbone to extract image features and further fine-tune them with learning rate lr 1 = 2 × 10 −5 , decaying every 2 epochs exponentially. The output 2048-D image features and 1024-D text features are further projected into a shared space. Then cross-modal features in the space are 512-D vectors ( i.e. Z i and Z t in Fig. 1 ). The batch size is set to 64 or 32 depending on available GPUs memory. For the bi-directional triplet loss function, initially, we treat the inter-modality and intra-modality sampling identically although each of them might have different contributions [25] , we empirically set the margin to m = 0 . 5 . The re-scaling parameter τ for data imbalance issue is set as τ = 4 (see Table 5 ). In practice, the discriminator can classify image and text modality easily at the start of training, so the generator typically requires multiple ( e.g. , 5) update steps per discriminator update step during training (see Algorithm 1 ).
Once trained to converge, the network yields image features Z i and text features Z t . We use the cosine function to measure their similarity. We use Recall@K (K = 1, 5, 10) for evaluation and comparison. Moreover, we adopt the precision-recall and mAP for the ablation studies, and visualize their feature distributions by t-SNE. Furthermore, we display the cross-modal retrieval results using our method.

Results on the flickr30k and MS-COCO datasets
The retrieval results on the Flickr30K and MS-COCO datasets are reported in Table 1 . Hereafter, "Image-to-Text" means using an image as a query item to retrieve semantically-relevant text from the textual gallery. "Text-to-Image" means using a text as query to retrieve images from the visual gallery. In most cases, our proposed approach shows the best performance when using three different deep networks. For the "Image-to-Text" task on the MS-COCO dataset, the best results are obtained by Zheng et al . [34] , which adopted a deeper network for text feature learning and used a two-stage training strategy. However, for the "Text-to-Image" task and the "Image-to-Text" task on the Flickr30K dataset, our method performs better. Take ResNet-152 as an example, the results are R@1 = 43.5% on the Flickr30K and R@1 = 48.3% on the MS-COCO for "Text-to-Image" task; the results are R@1 = 56.5% on the Flickr30K dataset and R@1 = 58.5% on the MS-COCO dataset for "Image-to-Text" task.
Besides, we obverse that the strategy for network training is critical for retrieval performance. Take [34] as an example, the backbone network (ResNet-152) is fixed at stage I (R@1 = 44.2% on "Image-to-Text" task on Flickr30K) and then fine-tuned with a small learning rate on stage II (R@1 = 55.6% on the "Image-to-Text" task on Flickr30K). In contrast, our network structure is trained end-to-end in only one stage (we fine-tune the backbone network with a small learning rate from the beginning). Our reported results are close to those in two-stage dual learning [34] . When tested on the Flickr30K dataset for the "Image-to-Text" task, the recall results are R@1 = 56.5%, R@5 = 82.2%, R@10 = 89.6%, which are the best overall previous methods.
Obviously, the feature learning capacity of the backbone networks would affect retrieval performance significantly. We can see from Table 1 , the retrieval results based on ResNet-152 are usually higher than those of MobileNet and VGGNet. Moreover, our method also has good performance using MobileNet. For instance, regarding the "Image-to-Text" task on the Flickr30K dataset, the recall result of CMPM+CMPC [12] is R@1 = 40.3%, but the result from our method is R@1 = 46.6%, which is a significant improvement.
Considering the two branches of "Image-to-Text" task and the "Text-to-Image" task, we think that the data imbalance issue still influences the performance of each branch. More specifically, for all listed methods, the "Image-to-Text" task has better performance, which indicates that the network still has more biases on text feature learning as a result of the issue of data imbalance. Thus, there exists more room for improvement using other strategies, such as data augmentation.

Results on CUHK-PEDES dataset
The "Text-to-Image" retrieval results on the CUHK-PEDES dataset are reported in Table 2 . We evaluate the proposed method using four deep networks. All results indicate that our method outperforms other counterparts. The optimal results are achieved with R@1 = 55.72% using ResNet-152 as backbone network. The results using MobileNet are sub-optimal but also have some improvements. For example, CMPM+CMPC achieves a recall R@1 = 49.37% and R@10 = 79.27%, while our method obtains R@1 = 51.85% and R@10 = 81.27%. Moreover, the results of our method show that deeper networks achieve better retrieval performance, whereas the light-weight MobileNet has a similar performance as ResNet-50.

Results on flickr8k dataset
The retrieval results on the Flick8K dataset are reported in Table 3 . The best results R@1 = 40.6%, R@5 = 67.8%, R@10 = 78.6% are achieved by joint correlation learning [31] where a batch-based triplet loss, which considers all image-sentences pairs, is used for learning correlations. The second-best results are achieved using ResNet-152 (same as [31] ) R@1 = 40.1%, R@5 = 67.8%, R@10 = 79.2%, which has better R@10 performance compared to [31] . Our method shows competitive results compared to other counterparts and also indicates that there exists room for further performance improvement.

Ablation studies
For analyzing the effect of each component, the ablation studies are conducted on the Flickr30K dataset using MobileNet as a backbone net, we use the commonly used categorical cross-entropy L ce and bi-triplet loss function L tr to construct the baseline in Table 4 , we call this Baseline1 configuration "Only L ce + L tr ".

Analysis of KL-divergence for data imbalance
Each image in a dataset ( e.g. Flickr30k) has more than one description sentence. We think this leads to a data imbalance issue for cross-modal feature learning. The network has more text data for training, which causes the learned label classifier to prefer text features. Therefore, we adopt a regularization term L di based on KL-divergence to calibrate this bias. To this end, the label classifier can be re-calibrated on the image features and text features. In Table 4 , this Baseline2 configuration is named " L ce + L tr + L di ". Table 1 Comparison of retrieval results on the Flickr30K [21] and MS-COCO [22] dataset (R@K (K = 1,5,10)(%)).

Flickr30K
MS-COCO  The best results are in bold and the second best results are underlined.
The Recall and mean Average Precision (mAP) show the effectiveness of this loss. Compared to Baseline1, the scaling KL-divergence loss L di contributes more on Recall@1 for both the "Image-to-Text" (42.3%) and "Text-to-Image" task (32.5%).

Analysis of KL divergence for cross-modal feature projection
KL divergence is obtained by adding L kl which constrains the image features and text features in the shared space under the supervision of supervisory matrix. It focuses on the whole feature distribution and is complementary to the bi-directional triplet loss function. We denote Baseline3 as "L ce + L tr + L di + L kl " in Table 4 . As we can see, Recall@1 of the "Image-to-Text" task has been improved significantly by 2.4%. However, the KL-divergence loss shows a slight improvement on the "Text-to-Image" task. The results indicate that the KL-divergence loss function contributes more to image feature learning, which might be caused by the issue of data imbalance of the dataset.

Analysis of adversary combining
The prior loss terms have been used to constrain the similarity of the image-text features in the shared space. Intuitively, twotuple or three-tuple feature exemplars are helpful for reducing the   Table 4 . The larger area under the line indicates better performance.  Table 6 Comparison of two combining paradigms in four retrieval datasets (R@1, R@10, and mAP(%)).

Image-to-Text
Combining "semantic gap" and further making the whole feature distribution close at the same time. However, the constraint loss functions ( e.g. cosine similarity) cannot constrain the distribution discrepancy of the whole distribution because these loss functions are symmetrical. Focusing on the whole feature distribution, we combine the Shanon information entropy L s and the modality classification loss L c in an adversary training manner to reduce the heterogeneity gap. This full method is named "L ce + L tr + L di + L kl + L s + L c " and corresponding results are shown in Table 4 . Compared to former baselines, the results obtained by using our method are improved significantly. Furthermore, we compare the precision-recall curves for the above four configurations and baselines, the results are shown in Fig. 5 . The larger the area under the curve, the better the algorithm. Regarding the different tasks, the improvements are slightly different. Overall, we can see that each added component helps to improve the overall performance of the retrieval algorithm.

Analysis of temperature τ
We analyze the temperature parameter τ in loss L di in Eq. 11 .
Other loss terms are kept the same with the full method, i.e.
"L ce + L tr + L di + L kl + L s + L c ". We vary this parameter τ from 1 to 6, and their corresponding results are reported in Table 5 . We can observe that the optimal results are achieved if the classifier's output probabilities are re-scaled by τ = 4 . As claimed in [19] , the temperature scaling raises the output entropy of the classifier with τ > 1 . In our experiments, we found it is beneficial for improving the image-text matching.

Distribution visualization
We choose 40 image-text pairs from the Flickr30K dataset to visualize their feature distributions using t-SNE. We only choose the first description caption among the five sentences. In Fig. 6 , the circle and the triangle shape denote text features and image features, respectively. Label information is represented by a different color.
This distribution indicates the effectiveness of each component ( e.g. KL-divergence for cross-modal feature projection, and the Shannon information entropy trained in an adversarial manner). In Fig. 6 (a), there exist several feature outliers within the distribution and the proximity relationship between pair-wise features is not obvious. When using the proposed components, the features distribute much better. For example, in Fig. 6 (d), all loss functions are utilized to constrain feature learning, the pair-wise feature shows a close proximity relationship. Moreover, image features and text features are distributed within smaller ranges (-60 ∼ 60). Few outliers exist among the whole distribution.
Qualitative retrieval results on the Flickr30K and the CUHK-PEDES dataset are shown in Fig. 8 . For the "Image-to-Text" task, the proposed method can return almost all paired text of the query image. The "Image-to-Text" task also has good performance, the proposed method retrieves the paired image correctly. Also, other retrieved images show contents relevant to the query sentence.  Table 4 . When each loss function is gradually applied, the paired image features and text features have smaller distances. Best viewed in color. Fig. 7. The illustration of independent combining information entropy and modality classification into an adversary, which is an intuitive structure of the diagram in Fig. 1 . Other loss functions including categorical cross entropy loss, KL-divergence loss, and bi-directional triplet loss are kept the same, but we do not show in this graph for simplicity. Different from the framework in Fig. 4 , the gradients computed from the modality classifier in this combining paradigm are used to optimize the parameters θ I and θ T of the feature extractor. The feature extractor maximizes the loss L d = L c ( Eq. 5 ) of modality classifier C (to make image features and text features as similar as possible), while the parameters θc of the modality classifier minimize the loss L d . This process depends on a gradient reversal layer to multiply gradient values by -1 when back-propagating [39] . .

Further exploring
In this paper, we propose to integrate Shannon information entropy with the discriminator for cross-modal retrieval. That is, the discriminator performs modality classification and measures the information entropy at the same time (see Fig. 4 ). Herein, we further explore a paradigm to integrate information entropy with adversarial learning. This combining paradigm is more straightforward to the structure in Fig. 1 . Concretely, we build two branches of sub-networks: an uncertainty predictor for modality uncertainty prediction and a modality classifier for modality classification. Then adversarial learning is implemented as an interplay between these two-subnetworks with competitive objectives. The uncertainty predictor aims at maximizing the modality uncertainty of the shared space (measured by information entropy), while the modality classifier is to identify image inputs and text inputs by modality classification. We illustrate this combining paradigm in Fig. 7 . Compared to the former paradigm depicted in Fig. 4 , the optimization depicted in Fig. 7 is different and more complex. The gradients computed by the classifier are used to update parameters θ I and θ T in the feature extractor. To learn modality-invariant features, the feature extractor minimizes the loss of the uncertainty predictor and it maximizes the loss L d of the modality classifier, which aims to make image features and text features as similar as possible [39] . The parameters of the modality classifier minimize its loss L d . This training process needs to depend on the gradient reversal layer [39] , which would multiply gradient values by -1 when executing back-propagating.  8. Qualitative test results on the Flikcr30K and CUHK-PEDES datasets. We report Recall@5 of the "Image-to-Text" task and the "Text-to-Image" task from left to right. The correct retrieval images or text are in red and a red box, while the failure retrieval are in green. For Flickr30K, each image is described by 5 sentences. Hence, each text query also has a correct retrieved image, but other retrieved images have similar content as described by the sentence. For the CUHK-PEDES dataset, each category has more than one image, thus almost all correct images are retrieved according to the text query. The list is best viewed in color. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) The training procedure is almost the same as used in Algorithm 1 except for the gradients from the modality classification loss that updates the backbone network, leading to a slower training process. The retrieval performance of these two combined methods presented in Fig. 4 and Fig. 7 (named as unified and separate, respectively) are given in Table 6 . The backbone net for image feature extraction is ResNet-152. These two combined strategies show different performances on the four datasets when combining information entropy and modality classification into a unified discriminator. The performance improves slightly on the Flickr30K, MS-COCO, and Flickr8K datasets when adopting the combining strategy of Fig. 4 . However, the method depicted in Fig. 7 has better performance on the CUHK-PEDES dataset, which is not the common objects dataset. This method has R@1 improved by 3.3% (from 65.58% to 67.79%), Also, the mAP has improved by 1.8% compared to the unified method depicted in Fig. 4 . In summary, the proposed framework of combining information entropy and adversarial learning in Fig. 4 has better performance and has faster convergence during training.

Conclusion
In this work, we explored methods to improve the performance of cross-modal retrieval by integrating information theory and adversarial learning by analyzing the relation between information entropy and modality uncertainty. Based on this relation, we explored two different paradigms to combine information entropy maximization and modality classification in an adversarial manner.
Training these two components iteratively reduces feature distribution discrepancies and further the heterogeneity gap. This is beneficial for preserving semantic similarity between cross-modal features by using bi-directional triplet loss and cross-entropy loss. In addition, we also considered the issue of data imbalance, which leads to a biased classifier and affects label classification. KL-divergence is used as an additional loss term to regularize the re-scaled probabilities computed from image features and text features. It is also used to constrain the cross-modal feature projections and is helpful for learning modality-invariant features. The efficacy of the proposed method was demonstrated by thorough experimental results on four well-known datasets using four deep models.
Successfully combining information entropy and adversarial learning depends on the competitive goals between the information entropy predictor and the modality classifier, and this leads to challenging directions worth further investigation. For example, we used instance labels as supervisory information in this work. Then the information entropy loss was computed only based on image modality and text modality. However, retrieval performance depends on the matching of each image-text feature pair. For some large-scale datasets, each category may include a large number of image-text pairs. Thus, it is valuable to make the information entropy loss specific for each category so that the discrepancy between two modalities can be reduced more granularly. Moreover, the problem of data imbalance leads to training a biased label classifier, which is an issue that can also be resolved by training strategies like data augmentation or by using other loss functions, e.g. knowledge distillation loss.
In terms of future work, the label-free Shannon information entropy can be used in some unsupervised learning scenarios, and has been used in performing tasks such as semantic segmentation [13] . Examining the application of combining Shannon information entropy with adversarial learning for cross-modal retrieval, we find that Shannon information entropy can be used for multimodal feature learning by estimating the modality uncertainty. It will be promising to explore Shannon entropy further when applied to other kinds of cross-modal feature learning similar to image-text retrieval, such as video-text, audio-video, and audio-text matching which aims at learning modality-invariant representations.