Cross-Modal Image Retrieval Considering Semantic Relationships With Many-to-Many Correspondence Loss

A cross-modal image retrieval that explicitly considers semantic relationships between images and texts is proposed. Most conventional cross-modal image retrieval methods retrieve the target images by directly measuring the similarities between the candidate images and query texts in a common semantic embedding space. However, such methods tend to focus on a one-to-one correspondence between a predefined image-text pair during the training phase, and other semantically similar images and texts are ignored. By considering the many-to-many correspondences between semantically similar images and texts, a common embedding space is constructed to assure semantic relationships, which allows users to accurately find more images that are related to the input query texts. Thus, in this paper, we propose a cross-modal image retrieval method that considers semantic relationships between images and texts. The proposed method calculates the similarities between texts as semantic similarities to acquire the relationships. Then, we introduce a loss function that explicitly constructs the many-to-many correspondences between semantically similar images and texts from their semantic relationships. We also propose an evaluation metric to assess whether each method can construct an embedding space considering the semantic relationships. Experimental results demonstrate that the proposed method outperforms conventional methods in terms of this newly proposed metric.


I. INTRODUCTION
With the recent spread of digital storage devices, the amount of images stored in personal databases, e.g., smartphones and personal computers, has increased [1], [2]. Therefore, a convenient and user-friendly image retrieval system is required to help users find their desired images from a huge number of images [3]. Among various image retrieval systems, image retrieval from query text (also referred to as text-to-image retrieval) is one of the most convenient retrieval methods for users. The development The associate editor coordinating the review of this manuscript and approving it for publication was Ramakrishnan Srinivasan .
Traditionally, text-to-image retrieval has been realized by labeling candidate images manually [8], referring to textbased image retrieval. Here, the candidate images in the database are assigned several text labels, and the retrieval process is performed by calculating the similarities between the input text query and the text labels [9], [10], [11]. Although such methods can realize image retrieval from a text query, a laborious labeling process is required. Recently, cross-modal image retrieval methods that can retrieve the VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ target images from a database with unlabeled images have been proposed [12], [13], [14]. These methods embed the images and texts in a common semantic embedding space where the distances between the embedded features can be calculated directly [15]. These methods can achieve high accuracy from the detailed queries; however, users do not always clearly remember the specific details of the target images, which can result in ambiguous queries. For a more flexible retrieval, it is important to construct an embedding space that facilitates accurate retrieval even when ambiguous queries are input. Since an ambiguous query usually contains a wide range of meanings, it is helpful to leverage adequate information from a database. For this purpose, an embedding space where semantically similar images and texts are close is desired. By constructing such an embedding space, the wide range of meanings can be considered more accurately, and more images that are relevant to the ambiguous text query can be retrieved. Since a query can be related to multiple images, semantically similar images utilized in the training phase are beneficial in terms of constructing the embedding space. However, most conventional cross-modal image retrieval methods ignore the distances between non-paired semantically similar images and texts, which should be close [16], [17]. These methods primarily focus on the one-to-one correspondences between images and texts predefined in general open datasets. Specifically, conventional methods follow the loss function that maximizes the similarity between predefined ground truth pairs than other samples in the embedding space, and evaluation metrics (e.g., Recall@k) that give a higher score to such an embedding space are used [18], [19], [20]. As a result, these methods do not focus on the many-to-many correspondences between semantically similar images and texts; thus, non-paired but semantically similar images and texts close will likely be distant in the embedding space [21]. In such an embedding space, even though images and texts representing an exact match may be located accurately, images that are similar to the query text are not located adequately [22], [23].
To realize image retrieval using an ambiguous query in the embedding space where semantically similar images and texts are close, the relationships between these images and text must be considered explicitly in the training phase. However, semantically similar images and texts are not predefined in general open datasets. Considering that there are many and various semantically similar images and texts, annotating all corresponding semantic relationships manually would be unreasonable. Therefore, a similarity calculation procedure that focuses on these text relationships is beneficial [24]. The key point of this procedure is to calculate the similarity between text captions and utilize the similarity as the proxy for the semantic similarity between samples. Following this procedure, the cross-modal retrieval that considers semantic relationships can be realized.
In this paper, we propose a cross-modal image retrieval method that can build an embedding space that pre-FIGURE 1. Construction of many-to-many correspondence compared to one-to-one correspondence.
serves the semantic relationships among images and texts. Figure 1 illustrates the many-to-many correspondence concept compared to the one-to-one correspondence used in the conventional cross-modal retrieval method. We propose the sentence-based semantic loss function to achieve our objective. The proposed loss function utilizes the semantic relationship as a basis to construct the many-to-many correspondence in the embedding space. The semantic relationship is calculated from the similarity between text captions in the training phase. As a result, the proposed method attempts to mitigate the limitations of conventional one-to-one image retrieval methods. In addition, to evaluate the proposed method, we introduce the semantic relationship distance (SRD) metric, which evaluates whether semantic relationships are preserved.
Our primary contributions are summarized as follows.
• We propose a cross-modal image retrieval method that considers the semantic relationships between images and texts by minimizing the distances between semantically similar images and texts in the embedding space.
• We introduce the SRD metric to confirm whether a method constructs an embedding space in consideration of semantic relationships by comparing rankings calculated from image-text similarity and semantic similarity.

II. RELATED WORK
In the following, we review work related to cross-modal retrieval (Section II-A), semantic relationships (Section II-B), and evaluation metrics that consider semantic relationships (Section II-C).

A. CROSS-MODAL RETRIEVAL
The goal of cross-modal retrieval is to retrieve samples of one modality from a query of another modality. It is desirable for humans to retrieve images using a text query of natural language [25]; however, it is difficult to fill the semantic gap between images and texts [26]. To this end, a popular approach is to map images and texts into a common embedding space. In early works, canonical correlation analysis was widely adopted to construct such embedding spaces [27]. With the rapid development of deep learning, convolutional neural networks (CNN) and recurrent neural networks (RNN) are frequently used to extract image and text features [28], [29]. Karpathy and Fei-Fei [30] combined CNN and RNN methods to map image and text features to a common embedding space for cross-modal retrieval. In addition, Faghri et al. [31] applied the hard negative mining strategy, which increased retrieval accuracy effectively.
Since a single global feature is not sufficiently representative for cross-modal retrieval, researchers began to match local features (e.g., objects, actions, and properties) and global features from images and texts to improve retrieval accuracy [32]. To correctly match images and texts, the attention mechanism was implemented in crossmodal retrieval to better capture semantically related local features [33], and the semantic consistency between images and texts was considered to improve the alignment between images and texts [17]. In addition to a single attention module, Song and Soleymani [34] utilized a multi-head self-attention network to exploit polysemous meanings. In addition, the graph convolutional network (GCN) has been employed in several methods to consider the relationship between local features, and these methods demonstrated good performance [35], [36].
The above methods have achieved impressive performance in retrieving a predefined ground truth image from large-scale public datasets. However, to the best of our knowledge, few existing methods consider the many-to-many correspondence between images and texts.

B. SEMANTIC RELATIONSHIP
Many conventional methods have focused on the one-to-one correspondence in the training phase by applying contrastive loss or triplet loss [37], [38]. Such methods can derive representative features and retrieve similar images; however, they do not exploit the many-to-many correspondence between images and texts. To this end, some uni-modal retrieval methods do consider the semantic relationship between samples. For example, Gordo and Larlus [16] indicated that a human-annotated text caption is semantically informative for images, and they selected images with text captions of high similarity as positive samples, and they mapped their features close in the embedding space. Despite the usage of captions, Gordo et al. considered all selected images as equally important to the loss calculation; thus, they failed to explicitly quantize the relationship between images. To consider the importance of the images with different similarities, Kim et al. [21] proposed a method that constrains the similarity between images to be consistent with the similarity between text labels. Even though the above methods performed well in unimodal image retrieval, they did not realize cross-modal image retrieval.
In the cross-modal retrieval field, the fact that text captions are not exclusively related to only a single image has also been considered recently. Li et al. [39] considered to reduce the loss caused by forcing the paired samples to be the same. However, this method did not consider to process the natural language query. Similar to our work, Yu et al. [40], Zhen et al. [41] attempted to bring semantically similar samples together in the embedding space. However, these methods required additional labels containing highsemantic information to find out the semantic relationships. To avoid the usage of the additional labels, Chun et al. [20] imported a probabilistic model to the cross-modal retrieval model, expecting queries to retrieve more semantically similar samples of another modality. The method proposed by Chun et al. [20] is somewhat similar to our proposed method; however, some differences should be highlighted. Chun et al. [20] expanded the range where samples are distributed in the embedding space. In contrast, in our method, the loss function is modified to quantize the relationship between semantically similar samples. Thus, we expect that our proposed method can construct many-tomany correspondences between samples more accurately.

C. EVALUATION METRICS CONSIDERING SEMANTIC RELATIONSHIPS
In most cross-modal retrieval works, retrieval performance is measured using the Recall@k, median rank, and mean rank evaluation metrics. These metrics can assess whether annotated ground truth image-text pairs are matched. However, these metrics focus on the one-to-one correspondence between images and texts and ignore the fact that text captions can describe multiple images in a given dataset. As a result, they do not exploit the many-to-many correspondence between images and texts. Thus, these conventional metrics cannot evaluate whether semantic relationships are preserved, and they cannot fairly assess methods when the retrieved targets are reasonably related to the query.
These metrics do not offer a fair evaluation of retrieval considering many-to-many correspondence; thus, Chun et al. [20] proposed the Plausible-Match R-Precision (PMRP) metric. The PMRP metric computes the ratio of plausibly positive samples ranked in the top-k, where plausibly positive samples are defined using pre-annotated object labels in the dataset. However, the object information is not sufficiently salient to reflect the semantic relationships between images and texts due to a lack of relationship representation between objects [16]. In addition, the PMRP metric requires a hyperparameter to compute the retrieval score, which makes it difficult to evaluate retrieval performance correctly. The evaluation metric we proposed in this paper is parameter-free and is more reliable in terms of reflecting semantic relationships between the images and texts.
In the video retrieval field, Wray et al. [24] proposed a semantic similarity calculation procedure using text captions. Inspired by Wray et al. [24], we propose a many-to-many evaluation metric based on the similarity between text captions. Note that there are several ways to compute the similarity between sentences. In our work, we adopt the transformer-based Sentence-BERT [42] method to compute the similarity between text captions because it exhibits effective text representation abilities and fast calculation speeds.

III. PROPOSED CROSS-MODAL IMAGE RETRIEVAL METHOD
Here, we present the proposed cross-modal image retrieval method. The proposed method involves three main steps, i.e., STEP I: semantic similarity calculation; STEP II: crossmodal similarity calculation; and STEP III: loss calculation. Figure 2 shows an overview of the proposed method. The dataset used in the conventional method comprises images I n (n = 1, . . . , N , where N is the number of training samples) and texts T m (m = 1, . . . , N ). Here, I n and T m for n = m are the paired image and text in the dataset. First, we calculate the semantic similarities s ss n,m by computing the similarities between text captions T n and T m . We then calculate the embedded image and text features (f img n , f txt m ∈ R D C , where D C is the dimension of the embedded features) and compute their cross-modal similarity s n,m following the conventional method. Finally, a many-to-many correspondence loss based on the semantic similarity feedback to each module.

A. STEP I: SEMANTIC SIMILARITY CALCULATION
In the first step, we calculate the semantic similarities using the text captions T n to construct the many-to-many correspondences between semantically similar image and text samples. This process is shown as STEP I in Fig. 2. Inspired by [42], we extract the semantic features f ss n ∈ R D S from T n using a trained semantic encoder E ss (·), where D S represents the dimension of the semantic features. The extracted features f ss n are used to calculate similarities s ss n,m between T n and T m . The above procedure can be expressed as follows: By using the calculated semantic similarities s ss n,m in the training phase, the proposed method can keep the many-tomany correspondences between semantically similar images and texts. We construct the embedding space that can consider semantic relationships by adjusting the embedding space to follow s ss n,m .

B. STEP II: CROSS-MODAL SIMILARITY CALCULATION
In the second step, I n and T m are embedded into the common semantic embedding space following the conventional method. This process is shown as STEP II in Fig. 2. Theoretically, an arbitrary cross-modal image retrieval method can be applied in this step; thus, we explain the proposed method in reference to the most basic cross-modal image retrieval architecture [44].
First, using the two embedding encoders E img (·) and E txt (·), which are provided by the conventional method, the proposed method calculates f img n and f txt m from I n and T m as follows: The proposed method then calculates the similarities s n,m between f img n and f txt m as follows: Conventional methods train the two embedding encoders E img (·) and E txt (·) to maximize s n,m for n = m than s n,m for n ̸ = m. Although the training strategy in conventional methods allows the encoders to preserve the one-to-one correspondence, the many-to-many correspondence between the semantic similar samples is not guaranteed explicitly.
To deal with them, the proposed method preserves both one-to-one and many-to-many correspondences using the semantic similarities s ss n,m and cross-modal similarities s n,m . Specifically, the proposed method trains E img (·) and E txt (·) to follow s ss n,m . With this procedure, the constructed embedding space is expected to preserve the semantic relationships between the images and texts.

C. STEP III: LOSS CALCULATION
In the third step, we calculate the proposed sentence-based semantic loss L sbs to fine-tune the embedding encoders. This process is shown as STEP III in Fig. 2. The sentencebased semantic loss L sbs is calculated by combining the text-to-image many-to-many correspondence loss L t2i sbs and the image-to-text many-to-many correspondence loss L i2t sbs as follows: L sbs = L t2i sbs + L i2t sbs .
Note that each loss focuses on preserving both the one-to-one and many-to-many correspondences from the text-to-image view and the image-to-text view, respectively. Although the goal of the proposed method is to retrieve desired images from a query text, we introduce both text-to-image and image-to-text directional losses following the conventional cross-modal image retrieval methods [45]. The introduced losses are constructed based on the combination of the triplet loss and log-ratio loss [21]. Generally, triplet loss is used in cross-modal image retrieval to preserve the one-to-one correspondence, and the log-ratio loss is used for assuring the many-to-many correspondence. By combining these two loss functions, the proposed text-to-image sentence-based semantic loss L t2i sbs is calculated as follows: ν n,m = s n,m s n,n ,  where λ, δ, andŝ n are the threshold to determine similar text, the margin hyperparameter, and the minimum of s ss n,m for s ss n,m ≥ λ, respectively. The proposed text-to-image sentence-based semantic loss L t2i sbs is reduced as the crossmodal similarity between semantically similar samples is closer to the corresponding semantic similarity. In other words, by training the embedding encoders E txt (·) and E img (·) to minimize L t2i sbs , the embedding space constructed by the embedding encoders preserves the semantic relationships between images and texts.
In addition, following the conventional cross-modal image retrieval procedure, we introduce the image-to-text sentencebased semantic loss L i2t sbs as follows: max{0, δ −ŝ n + s m,n } (s ss m,n < λ) , ν n,m = s n,m s n,n .
As is known in the cross-modal image retrieval field, the overall loss can be constrained in the text-to-image and image-to-text directions by introducing both losses. These constraints treat images and texts fairly, which leads to the construction of the accurate embedding space [45]. Using the model trained by L sbs , the retrieval task is performed by simply calculating the cross-modal similarity between the candidate images and query text, and then ranking the candidate images by the cross-modal similarity. Here, there is no need to calculate semantic similarity for the retrieval task.

IV. EXPERIMENTS
We conducted experiments on a frequently used open dataset to evaluate the effectiveness of the proposed method. The experimental settings and results are described in the following subsections.

A. EXPERIMENTAL SETTINGS 1) DATASETS
In our experiments, we used the large-scale MSCOCO dataset [46] and Flickr30K dataset [47], which are adopted  Calculation process of proposed SRD metric. r q,n and r ss q,n are the ranks of candidate images calculated from s q,n and s ss q,n , respectively.
by most cross-modal image retrieval methods. The two datasets contain images and corresponding texts that describe the contents of the paired image. For MSCOCO, following the widely used data splits provided by [44], 123,287 and 5,000 images were used for the training and test sets, respectively. For Flickr30K, following the data splits provided by [31], 29,783, 1,000, and 1000 images were used for the training, validation, and test sets, respectively. After training, we evaluated the retrieval performance of the proposed by retrieving the target image from the test set using each correspondence text as a query.

2) IMPLEMENTATION DETAILS
For evaluating the effectiveness of our sentence-based semantic loss function, we introduce our loss to the training of recently proposed cross-modal image retrieval methods  [31], [34], [35], [36]. We compared the cross-modal retrieval methods with our loss and the original ones. In addition, we compared the models fine-tuned with the proposed loss to PCME [20], which considered that images and texts are not exclusively related in a different way. Technical details of these methods are listed in Table 1. All comparative methods adopted the RNN to extract the text features, which were utilized in the form of a gated recurrent unit (GRU) [48]. For VSE++, PVSE, and PCME, CNN was adopted to extract image features. Specifically, VGG was utilized in VSE++, and ResNet was utilized in PVSE and PCME. For the SGRAF and CGMN methods, the Faster-RCNN [49] object detector was employed to calculate the image features, and then the GCN [50] was adopted to realize the imagetext matching. These methods are implemented based on the open-source codes provided by each author. Note that the trained weights of all the models are also provided by each  author, and we fine-tuned these models using our proposed loss function. In the fine-tuning process, we used Adam optimizer [51], and the models were fine-tuned for 10 epochs using our proposed loss function with an initial learning rate of 2e-5 and batch size of 64. For the hyperparameters, we experimentally set λ = 0.75 and δ = 0.1, and the cosine similarity was normalized in the range [0, 1]. In our method, considering semantic information of the relationships between words is crucial for calculating semantic similarity. For this reason, we follow sentence-BERT [42] for constructing semantic feature encoder E ss (·). Compared with the other sentence similarity calculation methods such as Bag-of-words, Word2Vec [52] and Sent2Vec [53], sentence-BERT can accurately calculate semantic similarity considering the relationships between words in the full sentence. This is because sentence-BERT is trained on datasets with huge amounts of annotated similar sentence pairs. By extracting the text features based on sentence-BERT, semantic information can be accurately considered in our method. VOLUME 11, 2023 FIGURE 7. Top-10 retrieval results of the PVSE (K = 1) w L sbs and the original PVSE (K = 1). The queries contain ambiguous words and phrases, e.g., something and some sports. Images with the red frame show that these images are less semantically consistent with the query.

B. EVALUATION CONSIDERING SEMANTICALLY SIMILAR SAMPLES
Here, we describe evaluations that focused on semantic relationships. Recall@k is used to evaluate the performance of cross-modal image retrieval; thus, evaluation metrics for semantic relationships have not been considered extensively. In addition, the MSCOCO and Flickr30K datasets do not provide multiple ground truth images that correspond to a single sentence. These may result in underestimation when evaluating the cross-modal retrieval method, as shown in Fig. 3. Thus, we introduce the SRD evaluation metrics to assess whether the semantic relationships are preserved. The calculation process is shown in Fig. 4. SRD@k simply calculates the distance between the ranking r q,n (q = 1, . . . , Q) and ranking r ss q,n calculated from the cross-modal similarity and the semantic similarity, respectively, where r q,n reveals the rank of the n-th candidate image calculated from the q-th query. SRD@k is defined as follows: Qk q n |r q,n − r ss q,n | (r ss q,n < k) 0 (otherwise).
The value of SRD@k becomes smaller as r q,n and r ss q,n are close. Considering that r ss q,n is calculated based on the semantic similarity, SRD@k can be used to evaluate whether the semantic relationships are preserved. Note that a small SRD value indicates the better many-to-many retrieval performance. As shown in Fig. 3, SRD considers the semantic relationships between images and texts in the evaluation procedure; thus, retrieval performance can be evaluated more reasonably.

1) CONVERGENCE ANALYSIS
We show the convergence curve of the proposed L sbs in each method on MSCOCO dataset in Fig. 5. As is shown in Fig. 5, in all methods, L sbs successfully converged.

2) QUANTITATIVE RESULTS
The experimental results obtained on the MSCOCO dataset are shown in Table 2 and Fig. 6. Note that a small SRD value indicates better many-to-many retrieval performance. As shown in the Table 2, each method with the proposed L sbs (noted as w L sbs ) outperforms the original method in terms of SRD@5 and SRD@10, respectively. In addition, Fig. 6 shows the SRD@k of the methods w L sbs and the comparative methods at different k values. As shown in Fig. 6, each method w L sbs achieves better SRD values than the original one when k>2. Especially, we can see that PVSE (K = 1) w L sbs outperforms PCME by 29.0 and 66.5 in terms of SRD@5 and SRD@10 in the MSCOCO dataset. Considering that the major difference between the two methods is that PVSE (K = 1) w L sbs considers the semantic relationships between samples, while PCME considers the distribution for a single sample, we confirmed that the usage of proposed L sbs enables the model to consider more semantically relative images. These results show the effectiveness of considering the semantic relationship between samples in the training phase. Also, it is notable that the PCME method exhibits a gentler upward trend than the other comparative methods shown in Fig. 6 (a). From this result, we consider that SRD@k is a reasonable metric to evaluate the many-to-many retrieval performance.
Here, since there are no completely identical sentences in this dataset, the similarity between a certain sentence and the other sentences in the dataset cannot achieve 1.0. Thus, SRD@1 only evaluates whether the one-to-one correspondences between images and texts are preserved. In addition, our proposed L sbs utilized semantically similar samples rather than one pair of samples for training. This somehow weakened the correspondence in the annotated pairs, but actually strengthened the semantic relationship between samples. On the other hand, despite the usage of distributions, PCME still considers one single pair. For the above reasons, it is reasonable that PVSE (K = 1) w L sbs performed poorer than PCME in terms of SRD@1, which is equivalent to the evaluation metric only for the one-to-one retrieval task. These reasons can also account for the decrease in SRD@1 for other methods.
Furthermore, the experimental results obtained on the Flickr30K dataset are shown in Table 3. As shown in Table 3, for the Flickr30K dataset, we obtained the same trend of SRD results as in MSCOCO dataset. In addition, some methods even obtained a gain in SRD@1. Considering that the training data of Flickr30K (29,783 images) is fewer than MSCOCO (123,287 images), we infer that the usage of semantically similar samples can boost the oneto-one retrieval performance when the training set is small. These results demonstrate the effectiveness of the proposed L sbs to consider the semantic relationships between images and texts.

3) QUALITATIVE RESULTS
To evaluate the influence of the proposed L sbs on ambiguous query retrieval performance, we conducted a qualitative experiment on PVSE(K = 1) w L sbs and the original PVSE(K = 1) trained while keeping the other conditions the same. Since PVSE (K = 1) is the most typical crossmodal retrieval method using CNN and RNN with the attention mechanism, we analyze the qualitative results of this method.
Here, we input queries including ambiguous pronouns instead of particular nouns. Figure 7 shows the retrieval results obtained by each version of PVSE (K = 1). As shown in Fig. 7, when the query ''A man is riding something in a mountain'' was used, PVSE (K = 1) w L sbs retrieved images containing information people riding skis and mountains. In comparison, PVSE (K = 1) paid more attention to the mountain information and ignored the riding something information. For the query ''people doing some sports,'' PVSE (K = 1) w L sbs retrieved more images including sports information than the PVSE (K = 1). For the ''something is flying in the sky'' query, the retrieval results obtained by PVSE (K = 1) included some images of kites, and the results obtained by PVSE (K = 1) w L sbs were all images of planes. When the query ''a man is holding something in the kitchen'' was input, PVSE (K = 1) retrieved some images that failed to include the information holding something. In comparison, PVSE (K = 1) w L sbs successfully retrieved images related to all information given in the query. These results demonstrate that PVSE (K = 1) w L sbs achieved better retrieval performance with ambiguous queries and constructed an embedding space that considers semantic relationships more effectively.

4) ABLATION STUDY
We also conducted an ablation study with different values for λ to analyze its complexity and the sensitivity. For the same reason mentioned in the qualitative result analysis, we conducted the study on PVSE (K =1). λ is the most important hyperparameter in L sbs that determines how many samples should be considered semantically similar to the target sample. For a large λ value, fewer samples would be selected as being semantically similar to the anchor sample. Here, we set λ = {0.0, 0.25, 0.5, 0.75, 1.0}, and the SRD@k results are shown in Fig. 8. For λ = 0.0, all samples were used In the log-ratio calculation, and the sentence-based semantic loss is considered as follows: r n,m = s n,m s n,n .
When λ = 1.0, the sentence-based semantic loss degrades to a triplet loss, which is expressed as follows: L sbs = max{0, δ − s n,n + s n,m }.
We found that the best SRD@5 and SRD@10 results were obtained when λ was approximately 0.75. In addition, an acceptable SRD@1 was obtained at the same time. This means when λ was approximately 0.75, the retrieval performance considering many-to-many correspondence is guaranteed while maintaining fairly stable one-toone retrieval performance. Thus, we selected λ = 0.75 for the proposed L sbs .

5) LIMITATIONS
Several limitations should be discussed. First, similar to the other machine learning methods, the performance of methods using L sbs is sensitive to the threshold parameter λ (Fig. 8).
It is difficult to define the extent to which two texts are truly similar from a human perspective, making it difficult to determine an appropriate value for λ. Improving the design of the semantic similarity calculation procedure may reduce such difficulty. Second, the performance of methods using L sbs exhibited an undesired drop in both one-to-one retrieval and many-to-many retrieval performance when λ was set to approximately 0.5. This may result from the mixeduse ratio calculation and addition calculation in the loss function. A more carefully designed parameter-free sentencebased semantic loss function may reduce the impact of these limitations. In addition, the best SRD@10 value obtained by the proposed method was over 300 in the MSCOCO dataset (Table 2), which indicates that the semantically similar samples were still not ranked high enough in the retrieval process. Thus, in the future, we plan to construct a more reasonable architecture to better satisfy the many-to-many retrieval objective.

V. CONCLUSION
In this paper, we have newly proposed a cross-modal image retrieval method that can consider the many-to-many correspondence between images and texts. We achieved this objective by introducing a novel sentence-based semantic loss function that can be applied to an arbitrary cross-modal image retrieval method. The effectiveness of our proposed loss was evaluated experimentally, and the results showed that methods using our proposed loss function outperformed those without it in terms of the proposed SRD metric, which was designed to evaluate many-to-many correspondences.
In addition, the results of the qualitative experiment indicate the ability of the introduction of our proposed loss function in retrieving semantically similar images using ambiguous queries. We expect that this work can trigger further research on semantic meanings in the embedding space. In the future, we plan to improve both our loss function and the design of model architecture that can better utilize the loss function.