Keywords

1 Introduction

The integration of vision and language has recently gained a lot of attention from both computer vision and NLP communities. As humans, we can seamlessly connect what we visually see or imagine and what we hear or say, therefore building effective bridges between our ability to see and our ability to express ourselves in a common language. In the effort of artificially replicating these connections, new algorithms and architectures have recently emerged for image and video captioning [1, 5, 16] and for visual-semantic retrieval [7, 13, 15]. The former architectures combine vision and language in a generative flavour on the textual side, and in the latter common spaces are built to integrate the two domains and retrieve textual elements given visual queries, and vice versa.

While the standard objective in visual-semantic retrieval is that of associating images and visual sentences (i.e. sentences that visually describe something), the variety of sentences which can be found in textual corpora is definitely larger, and also contains sentences which do not describe the visual content of a scene. Here, we go a step beyond and extend the task of visual-semantic retrieval to a setting in which the textual domain does not exclusively contain visual sentences, and explore the task of identifying relevant visual sentences given image queries. As such, the task establishes two challenges, the first one being that of understanding whether the sentence has a visually relevant content, and the second being that of associating elements between the two domains.

Further, we also address a second shortcoming of most visual-semantic works, i.e. that of dealing with photo-realistic images and simple texts. As there is a growing need of extending these algorithms to less general semantic and visual domains, we both increase the complexity on the visual and on the semantic side. To create an environment where all the aforementioned challenges live together, we focus on the case of artistic data—which surely advertise more complex and unusual visual and semantic features, and propose a new dataset with visual and contextual sentences for each visual item. In short, visual sentences deal with the visual appearance of the item, contextual ones describe either the item or its context without dealing with its visual appearance.

We also design and evaluate a model for jointly associating visual and textual elements, and identifying visual textual samples as opposed to contextual ones. Taking inspiration from state of the art models for visual-semantic retrieval, we test both traditional approaches, based on global feature vectors, and approaches that model the latent alignment between visual and textual chunks.

The rest of this paper is organized as follows: after briefly reviewing the related literature in Sect. 2, we present the Artpedia dataset in Sect. 3. Further, in Sect. 4 we propose our model for bringing visual and contextual sentences in visual-semantic retrieval, which is subsequently evaluated together with different baselines in Sect. 5.

2 Related Work

In this section, we first give an overview of cross-modal retrieval models. Then, we review computer vision works related to the cultural heritage domain with a focus on other relevant datasets for art understanding.

2.1 Cross-Modal Retrieval

Cross-modal retrieval is one of the core challenges in computer vision and multimedia communities and consists in the retrieval of visual items given textual queries, and vice versa. In this context, several cross-modal retrieval models have been proposed [7, 13, 15], with the objective of minimizing the distance of matching image-text pairs and, on the contrary, maximizing that of non-matching elements. Among them, Faghri et al. [7] introduced a simple modification of standard loss functions based on the use of hard negatives that has been demonstrated to be effective in improving the performance of cross-modal retrieval and has been widely adopted by several subsequent methods [6, 10, 11, 15].

Inspired by the use of multiple image descriptors to improve related visual-semantic tasks [1, 25], Lee et al. [15] have recently proposed to match images and corresponding descriptions by inferring a latent correspondence between image regions and single words of the caption. In this work, we exploit a similar attentive mechanism to match each painting with the sentences that actually describe the visual content of the painting itself, and we demonstrate the effectiveness of using multiple image regions in place of a single image descriptor also for visual-semantic artistic data.

Table 1. Overview of the most relevant datasets containing artistic images.

2.2 Computer Vision for Cultural Heritage

In the last years, several efforts have been done to apply computer vision techniques to the cultural heritage domain resulting in different works and applications ranging from generative models to classification and retrieval solutions. On the generative and synthesis side, up-and-coming results have been obtained by style transfer models that aim to transfer the style of a painting to a real photo [9] and, on the contrary, create a realistic representation of a given painting [23, 24].

On a different note, several large-scale art datasets have been proposed to foster researches on this domain, with a particular focus on style and genre recognition [12, 18]. For a comprehensive analysis, Table 1 shows a summary of the most relevant dataset related to the cultural heritage domain. To the best of our knowledge, there is a limited bunch of works that address the problem of retrieving artistic images from textual descriptions, and vice versa [2, 3, 8]. While [2, 3] take the problem in a semi-supervised way by exploiting the knowledge from large-scale datasets containing realistic images, [8] uses additional metadata such as title, author, genre, and period of the paintings to match images and text. In this paper, we instead propose a visual-semantic model capable of discriminating visual and contextual sentences for each considered painting and, at the same time, associating the corresponding visual and textual elements.

3 The Artpedia Dataset

To foster the research on the development of visual-semantic algorithms which deal with contextual sentences, we propose a novel dataset with visual and contextual sentences describing real paintings. Artpedia contains a collection of 2, 930 painting images, each associated to a variable number of textual descriptions. Each sentence is labelled either as a visual sentence or as a contextual sentence, if does not describe the visual content of the artwork. Contextual sentences can describe the historical context of the artwork, its author, the artistic influence or the place where the painting is exhibited. As in standard cross-modal datasets, the association between sentences and painting is also provided. A sample of the dataset and its annotations is shown in Fig. 1.

As the name suggests, the dataset has been collected by crawling Wikipedia pages. To this aim, our crawling strategy followed the Wikipedia category hierarchy by navigating all categories containing paintings between the 13th and the 21th century. We then extracted the textual descriptions taking into account all the summaries of each Wikipedia page and the description section whenever present. Finally, we split the text into sentences using the spaCy NLP toolboxFootnote 1 and manually annotated each sentence either as visual or contextual. As an additional product of the crawling procedure, we also release the title and the year of each painting, together with the URL of each image.

Fig. 1.
figure 1

Sample paintings from our Artpedia dataset with corresponding visual (green boxes) and contextual (red boxes) sentences. (Color figure online)

Overall, Artpedia contains a total of 28, 212 sentences, 9, 173 labelled as visual sentences and the remaining 19, 039 as contextual sentences. On average, each painting is associated with 3.1 visual and 6.5 contextual sentences. The mean length of the textual items is 21.5 words, considerably longer than those of standard image captioning datasets. For a comprehensive analysis of the visual and semantic content of our Artpedia dataset, we report in Fig. 2 the distribution of paintings over the given range of centuries, the distribution of sentence lengths, and the most common object classes obtained by running a pre-trained object detector [14, 20].

With respect to other visual-semantic datasets containing artistic images (reported in Table 1), Artpedia provides a larger number of sentences, divided into visual and contextual through a manual annotation procedure. Moreover, to the best of our knowledge, this is the only dataset that contains two types of artistic sentences describing both the visual content of the paintings and other contextual information. For this reason, we devise a visual-semantic model capable of jointly discriminating between visual and contextual sentences of the same painting, and identifying which visual descriptions from a subset of textual elements (i.e. a subset of visual descriptions from different paintings) are associated to a specific painting.

Fig. 2.
figure 2

Analyses on our Artpedia dataset. From left to right, we report the painting distribution over centuries, the sentence lengths distribution, and the most common detection classes.

To allow the training of our model and foster researches on this domain, we also provide training, validation and test splits obtained by proportionally dividing the number of paintings. Splits have been obtained with the constraint of balancing the distributions over centuries and the number of visual sentences to maintain relevant statistics across the subsets. Table 2 reports the number of paintings for each split along with the corresponding number of visual and contextual sentences.

Table 2. Number of paintings, visual and contextual sentences for each Artpedia split.

4 Aligning Visual and Contextual Sentences with Images

Cross-modal retrieval is characterized by two main tasks: when the query is a textual sentence, the objective is to retrieve the most relevant images, while with an image as a query, the objective is to retrieve the most relevant sentences. The goal is to maximize recall at K, the fraction of queries for which the most relevant item is ranked among the top K retrieved ones. Besides, our setting leverages the presence of visual and contextual sentences, and takes into account this difference when computing the latent alignment within a single page. In the following, we refer to a page as an element of our Artpedia dataset comprising an image and its visual and contextual sentences. Our goal is not only to maximize recall, but also to distinguish the two types of sentences associated to a painting.

In a nutshell, our model firstly maps image regions and sentence words into a joint embedding space. Then, it computes a cross-attention mechanism divided in two branches, where one attends to words with respect to each image region, while the other attends to image regions with respect to each word. This mechanism computes a similarity score for each branch between an image and a sentence. During training, the similarity score is used to minimize two loss functions: our intra-page loss, which strives to rank the sentences associated to a single image, bringing near its visual sentences and pushing away its contextual ones, and the inter-page triplet ranking loss that takes into account all images and their visual sentences as in standard cross-modal retrieval settings.

4.1 Similarity Function

As mentioned before, the similarity is computed with a cross-attention mechanism that comprises two distinct branches: image-to-text and text-to-image attention, inspired by [15, 25]. Since the two branches are similar, diversified only by the input order, we only describe the first one.

Firstly, given an image I, we extract salient regions such that each of them encodes an object or other entities, and project them into the joint embedding space, obtaining a final set of regions \(\{\varvec{v}_{1}, \dots , \varvec{v}_{k}\}, \varvec{v}_{i} \in \mathbb {R}^{D}\). Also, given a sentence T composed of n words, encoded with a word embedding strategy, we project each word into the joint embedding space thus obtaining a vector \(\varvec{e}_{j} \in \mathbb {R}^{D}\) for each word j. Therefore, given an image I with k detected regions and a sentence T with n words, we compute the similarity matrix for all possible region-word pairs:

$$\begin{aligned} s_{i j}=\varvec{v}_{i}^\top \varvec{e}_{j}\quad i \in [1, k], j \in [1, n] \end{aligned}$$
(1)

where \(s_{i j}\) represents the similarity between the region i and the word j. Since region and word features are \(\ell _2\) normalized, this product corresponds to a cosine similarity.

To attend words with respect to each image region, we compute a sentence-context vector for each region. The sentence-context vector \(\varvec{a}_{i}\) is a weighted representation of the sentence with respect to the region i of the image, where the similarities between the region i and the sentence words are used to weight each word as follows:

$$\begin{aligned} \varvec{a}_{i}=\sum _{j=1}^{n} \alpha _{i j} \varvec{e}_{j} \end{aligned}$$
(2)

where

$$\begin{aligned} \alpha _{i j}=\frac{\exp \left( \lambda _{s} s_{i j}\right) }{\sum _{j=1}^{n} \exp \left( \lambda _{s} s_{i j}\right) } \end{aligned}$$
(3)

and \(\lambda _{s}\) is a temperature parameter [4].

Finally, to evaluate the similarity of each image region given the sentence-context, we compute the cosine similarity between the attended sentence vector \(\varvec{a}_{i}\) and each image region feature \(\varvec{v}_{i}\):

$$\begin{aligned} R\left( \varvec{v}_{i}, \varvec{a}_{i}\right) =\frac{\varvec{v}_{i}^\top \varvec{a}_{i}}{\left\| \varvec{a}_{i}\right\| } \end{aligned}$$
(4)

To summarize the similarity between an image I and a sentence T, we employ average pooling between all image regions and the sentence-context vector:

$$\begin{aligned} R_{A V G}(I, T)=\frac{\sum _{i=1}^{k} R\left( \varvec{v}_{i}, \varvec{a}_{i}\right) }{k} \end{aligned}$$
(5)

Likewise, the other branch follows the same procedure but swapping image regions and sentence words, computing a region-context vector for each sentence word, evaluating their cosine similarities and summarizing the final branch score in the same way. Finally, by averaging the similarity scores of the two branches, we obtain the final similarity score S(IT) between an image I and a sentence T.

4.2 Training

Intra-page Loss. With the objective of correctly ranking visual and contextual sentences of a given image, we propose an intra-page loss function that learns the latent alignment between an image and its corresponding visual sentences within a single page of the dataset. Given an image I, a visual sentence \(T_{V}\) and a contextual sentence \(T_{C}\), our intra-page loss is computed by taking into account the similarity score \(S(I, T_{V})\) between the image and the visual sentence and the similarity score \(S(I, T_{C})\) between the image and the contextual one:

$$\begin{aligned} L_{intra}(I, T_{V}, T_{C})=[\alpha - S(I, T_{V}) + S(I, T_{C})]_{+} \end{aligned}$$
(6)

where \([x]_{+} = max(x,0)\) and \(\alpha \) is the margin. Note that, since this loss function is computed within a single page, both considered visual and contextual sentence are taken within the sentences of the given image I.

Inter-page Triplet Ranking Loss. Since our final objective is not only to identify visual and contextual sentences of the same image, but also to associate matching image-visual sentence pairs within the entire dataset, we define an inter-page triplet ranking loss, which is typical of cross-modal retrieval methods.

As proposed in [7], we focus solely on the hardest negatives in the mini-batch. So that, our final inter-page triplet ranking loss with margin \(\alpha \) is defined as follows:

$$\begin{aligned} L_{inter}(I, T)=\max _{\hat{T}}\left[ \alpha -S(I, T) + S(I, \hat{T})\right] _{+} + \max _{\hat{I}}\left[ \alpha -S(I, T) + S(\hat{I}, T) \right] _{+} \end{aligned}$$
(7)

where only the hardest negative sentences \(\hat{T}\) or hardest negative images \(\hat{I}\) for each positive pair S(IT) are taken into account. In our case, a negative sentence \(\hat{T}\) is a visual sentence of another image. Since this loss function aims to associate images and visual sentences of the entire dataset, contextual sentences are only used by our intra-page loss.

Final Training Objective. The final training loss is obtained by a linear combination of the two loss functions, i.e. \( L= \lambda _w L_{inter} + (1 - \lambda _w) L_{intra} \), where \(\lambda _w \in [0,1]\) is a parameter that weights the contribution of the two losses. When \(\lambda _w\) is equal to 0, the training procedure only minimizes our intra-page loss, whilst when \(\lambda _w\) is equal to 1, all the attention is given to the inter-page triplet ranking loss.

5 Experimental Evaluation

In this section, we experimentally evaluate the effectiveness of our approach by comparing it with different baselines. First, we provide all implementation details used in our experiments.

5.1 Implementation Details

To encode image regions, we use Faster R-CNN [20] trained on Visual Genome [1, 14], thus obtaining 2048-dimensional feature vectors. For each image, we exploit the top 20 detected regions with the highest class confidence scores. To project regions into the visual-semantic embedding space, we use a fully connected layer with a size of 512.

For the textual counterpart, we compare GloVe [19] with word embeddings learned from scratch. In both cases, the word embedding size is set to 300. Then, with the aim of capturing the semantic context of the sentence, we employ a bi-directional GRU with a size of 512, so that given a sentence with n words, the bi-directional GRU captures the context reading forward from word 1 to n and reading backwards from word n to 1, averaging the two hidden states to obtain the final embedding vector for each word.

To train our model, we use the Adam optimizer with an initial learning rate of \(10^{-6}\) decreased by a factor of 10 after 15 epochs. In all our experiments, we use a batch size of 128 and clip the gradients at 2. Finally, the margin \(\alpha \) and the temperature parameter \(\lambda _s\) are respectively set to 0.2 and 6.

5.2 Baselines

To evaluate our solution, we build different baselines to quantify both the effectiveness of using a cross-attention model and that of our intra-page loss. To this aim, we first exploit global features to encode images and sentences in place of multiple feature vectors for each image or sentence. In particular, to encode images, we extract 2048-dimensional feature vectors from the average pooling layer of a ResNet-152, while, to encode sentences, we feed word embeddings through a bi-directional GRU network and average the outputs of the last hidden state in both directions. After projecting both images and sentences into a common embedding space, the final similarity score between an image and a sentence is given by the cosine similarity between the two \(\ell _2\)-normalized embedding vectors.

Furthermore, we compare the proposed intra-page loss function with respect to binary cross-entropy. Therefore, visual and contextual sentences are not projected into the same embedding space, but fed through a binary classification branch. In practice, each sentence is classified either as visual or contextual by concatenating the image and sentence embeddings and feeding them through two fully connected layers of size 512 and 1, respectively. For the cross-attention model, the image embedding is obtained by averaging the image region embedding vectors, while the sentence embedding is obtained by averaging the last hidden states of the bi-directional GRU in the two directions.

Table 3. Intra-page results in terms of Average Precision (AP).

For both baselines, all other hyper-parameters and training details are the same as those used in our complete model.

5.3 Cross-Modal Retrieval Results

We first evaluate the effectiveness of our model to identify and distinguish visual sentences with respect to contextual ones. Table 3 shows the results on the Artpedia test set in terms of average precision (AP). In particular, the results are obtained by training the models with \(\lambda _w\) equal to 0 (i.e. by only minimizing the intra-page loss or binary cross-entropy). As it can be seen, our intra-page loss function always obtains better performance with respect to the binary cross-entropy baseline either when exploiting global features to embed images and sentences or when using the cross-attention approach described in Sect. 4. Regarding the word embedding strategy, GloVe vectors achieve better results with respect to word embeddings learned from scratch, probably due to the presence of peculiar words, typical of the artistic domain.

Table 4. Cross-modal retrieval results with a different number N of retrievable items and with respect to different \(\lambda _w\) weights.
Fig. 3.
figure 3

Comparison between visual-semantic embedding spaces obtained by training the model with different \(\lambda _w\) weights. Visualizations are obtained by running the t-SNE algorithm [17] on top of embedding vectors representing images and sentences (both visual and contextual).

In Table 4, we show the performance of our complete model trained with various \(\lambda _w\) weights to differently balance the contribution of the two loss functions. In this case, the goal is not only to correctly distinguish between visual and contextual sentences of a given image, but also to find the corresponding visual sentences from a subset of other textual elements (i.e. visual sentences of different images). Results are reported in terms of recall@K (\(K=1,5\)) using a different number N of items from which perform retrieval. In details, given an image as a query, the retrieval of a textual element is performed from a subset of visual sentences of N different images (i.e. the visual sentences of the query and those of other \(N-1\) randomly selected images). Instead, given a textual query, the retrieval of an image is performed from a subset of N different images (i.e. the image linked to the query and other \(N-1\) randomly selected images from the Artpedia test set). We also report the results of identifying visual sentences with respect to contextual ones in terms of average precision. As it can be noticed, by increasing the \(\lambda _w\) weight, we obtain an increment of recall metrics with a slight drop of average precision values, in almost all considered combinations of features and word embeddings. Also in this case, the cross-attention mechanism and the GloVe word embeddings achieve better results than global features and learned word embeddings.

Finally, Fig. 3 shows learned embedding spaces using the best model (i.e. cross-attention with GloVe word embeddings) using different \(\lambda _w\) weights. Since in this case images and sentences are composed of an embedding vector for each image region and word of the sentence, we represent each image or sentence by summing the \(\ell _2\)-normalized embedding vectors of its image regions or words, and \(\ell _2\)-normalized again the result. This strategy has been largely used in image and video retrieval works, and is known for preserving the information of the original vectors into a compact representation with fixed dimensionality [22]. To get a suitable two-dimensional representation out of a 512-dimensional space, we run the t-SNE algorithm [17], which iteratively finds a non-linear projection which preserves pairwise distances from the original space. As it can be observed, the higher the \(\lambda _w\) weight, the greater the distance between images and visual sentences in the embedding space, thus confirming the drop of average precision values when decreasing the importance of our intra-page loss during training.

6 Conclusion

In this paper, we have addressed the problem of cross-modal retrieval of images and sentences coming from the artistic domain. To this aim, we have collected and manually annotated a new visual-semantic dataset with visual and contextual sentences for each collected painting. Further, we have designed and evaluated a cross-modal retrieval model that jointly associates visual and textual elements, and discriminates between visual and contextual sentences of the same image. Experimental evaluations conducted with respect to different baselines have shown promising results and have demonstrated the effectiveness of our solution on both considered visual-semantic retrieval tasks.