Recallable Question Answering-Based Re-Ranking Considering Semantic Region for Cross-Modal Retrieval

Question answering (QA)-based re-ranking methods for cross-modal retrieval have been recently proposed to further narrow down similar candidate images. The conventional QA-based re-ranking methods provide questions to users by analyzing candidate images, and the initial retrieval results are re-ranked based on the user's feedback. Contrary to these developments, only focusing on performance improvement makes it difficult to efficiently elicit the user's retrieval intention. To realize more useful QA-based re-ranking, considering the user interaction for eliciting the user's retrieval intention is required. In this paper, we propose a QA-based re-ranking method with considering two important factors for eliciting the user's retrieval intention: query-image relevance and recallability. Considering the query-image relevance enables to only focus on the candidate images related to the provided query text, while, focusing on the recallability enables users to easily answer the provided question. With these procedures, our method can efficiently and effectively elicit the user's retrieval intention. Experimental results using Microsoft Common Objects in Context and computationally constructed dataset including similar candidate images show that our method can improve the performance of the cross-modal retrieval methods and the QA-based re-ranking methods.


I. INTRODUCTION
Multimedia information, especially images, has become familiar with the recent spread of wearable cameras and smartphones. We frequently record our lives as images, and the opportunity for sharing these depicted images has been increasing [1]. On the other hand, with these opportunities, manually managing and finding images on personal devices becomes taking a lot of effort [2]. Recently, to support such a situation, cross-modal retrieval methods that use a text as a query have been proposed as an effective image retrieval method [3], [4], [5], [6], [7], [8]. Since we use texts in our daily life, using them as the query is convenient and has a wide range of applications [9]. Specifically, the cross-modal retrieval methods embed the provided text query and each candidate image in a shared space, and the embedded features are used for retrieving the relevant images. By especially focusing on the refinement of the embedding procedures, the conventional methods have improved the image retrieval performance.
Although the recent cross-modal retrieval methods can match the texts and images accurately [10], [11], they cannot sufficiently narrow down similar candidate images when the query text does not contain enough information. Since users generally do not perfectly understand all content of the candidate images, it is difficult for users to provide the query text that can uniquely identify the desired images. Namely, no matter how much the matching performance between texts and images is improved, users cannot find the desired images depending on the contents of the query text. Furthermore, this problem becomes critical when the database contains multiple FIGURE 1. The task and concept of our paper. The task of our paper is QA-based re-ranking, and we mainly focus on recallability of the QA. With the recallable QA-based re-ranking, users can easily answer the provided questions for re-ranking the retrieval results. similar images such as those taken by lifelog cameras. For example, it is difficult to retrieve the desired image using the text query: "lunch with my friends," from a DB with many similar food scene images. To deal with the problem, we have proposed question answering (QA)-based re-ranking methods that provide questions for clarifying the user's intent to enhance the retrieval performance [12], [13]. The task of QA-based re-ranking is shown in "our task" of Fig. 1. Specifically, the conventional QA-based re-ranking methods search for the clue information that can narrow down the candidate images. By asking users whether or not the clue information is included in the desired images, these methods conduct a re-ranking. With the procedure, these methods can complement the insufficient information for narrowing down similar candidate images, and users can enhance the retrieval performance by only answering the presented questions.
Despite these QA-based re-ranking developments, there is still room for improvement in one experimental and two methodological concerns. The first concern is that the conventional papers do not quantitatively examine the effects of similar candidate images towards the initial cross-modal retrieval performance. By examining them, the importance of the QA-based re-ranking can be further confirmed. The second concern is not explicitly considering query-image relevance. Since the query text is the most important clue information provided by users, consideration of the relevance between the query text and candidate images affects the reranking performance [14], [15]. However, the conventional methods do not explicitly consider them for generating questions, and the candidate images that are already screened in the initial retrieval are treated as important information. The third concern is not considering recallability of the provided questions. In the interactive re-ranking, since users provide additional information by assuming their desired images, the topics of the re-ranking are desirable to be easily recallable by users [16]. If the provided questions are not recallable in the QA-based re-ranking, its re-ranking cannot be effectively conducted since users cannot answer the provided questions. However, the conventional methods do not focus on the recallability, and the information that is not in the user's memory is used for question answering. To boost the value of the QA-based re-ranking, the examination for similar candidate images, the query-image relevance, and the recallability should be further considered.
In this paper, we improve QA-base re-ranking in experimental and methodological perspectives. From an experimental perspective, we quantitatively examine the effects of similar candidate images towards the initial retrieval performance. From a methodological perspective, we propose a QA-based re-ranking method with considering two important factors: the query-image relevance and the recallability. The focus of our method is shown in Fig. 1. Our method introduces two types of weights for the question generation. At first, the weight for the query-image relevance is constructed inspired by the property that QA-based re-ranking is built on the initial retrieval. By using the similarities between the query text and the candidate images calculated in the initial retrieval, our method can generate questions with only focusing on the candidate images related to the query text. Next, we introduce the weight for the recallability in the QA-based re-ranking. Considering that users assume their desired images for answering the questions, the recallability of the questions depends on the memorability of the semantic information for QA. It has been reported that the memorability and the region ratio of each semantic information are known to be correlated [17], [18]. Also, as a preliminary study, we confirmed that Pearson's correlation coefficient between the memorability and the region ratio is 0.47 (p-value < 0.01) in the dataset provided by [17], and 0.47 is generally said to be moderate correlation in our data size [19], [20], [21]. Based on these facts, by considering the region ratio of the topic in the questions, our QA-based re-ranking method is expected to provide questions that are recallable by users.
The contributions of this paper are summarized as follows.
Validating the effects of similar candidate images: By quantitatively validating the effects of similar candidate images towards the initial retrieval performance, the effectiveness of the QA-based re-ranking is further confirmed.
Query-image relevance for QA-based re-ranking: For mainly focusing on the candidate images related to the user's intent, our method uses the similarities between the query text and each candidate image calculated in the initial retrieval for the question generation.
Recallability for QA-based re-ranking: To provide questions with the topics recallable from users, our method uses the region ratio of each semantic information for the question generation.

A. CROSS-MODAL INITIAL RETRIEVAL
Conventionally, various image retrieval methods that use multiple types of query data have been proposed such as single-modality image retrieval using an image query [22], multimodal image retrieval using queries of multiple modalities [23], and cross-modal image retrieval using a query of a non-image modality [3], [5], [6], [8]. Among them, our method focuses on re-ranking the retrieval results calculated from the cross-modal initial retrieval methods via QA. The major cross-modal initial retrieval procedure is to embed a query text and each candidate image into the shared space E. That is, they train the two encoders M L (·): L → E and M V (·): V → E, where L and V are text and image spaces, respectively. By embedding a query text and each candidate image via the encoders M L (·) and M V (·), similarities between the text query and each candidate image are calculated on the space E.
Conventional cross-modal retrieval is realized based on statistical correlation analysis. Recently, deep neural network technologies have been utilized successfully in representation learning for obtaining the encoders M L (·) and M V (·). Based on the hinge loss, Kiros et al. [3] trained M L (·) and M V (·) so that the similarities between correct text-image pairs are higher than those between other different pairs. Faghri et al. [5] improved the method of Kiros et al. [3] by focusing on the samples between a text query and the corresponding image.
Unlike the conventional cross-modal retrieval methods, our method can improve the retrieval performance for similar candidate images. With the QA-based re-ranking focusing on similar candidate image clarification, the conventional crossmodal retrieval performance is expected to be improved even for similar candidate images.

B. RE-RANKING FOR CROSS-MODAL RETRIEVAL
Re-ranking has attracted attention in diverse retrieval tasks, such as object-based retrieval, person re-identification, and text-based image retrieval [24], [25], [26], [27], [28]. Similarly, in the cross-modal retrieval task, various re-ranking approaches have also been examined. The re-ranking for the cross-modal retrieval can be classified based on the necessity of user interaction: self re-ranking and feedback-based re-ranking.
The former improves cross-modal retrieval performance by estimating the key information from the higher-ranked images of the initial cross-modal retrieval results. Several self-re-ranking methods [29] use the initially ranked candidate images as queries for ranking text labels and re-rank the initial retrieval results to consider the consistency between retrieved texts and retrieved images. These methods can re-rank the initial retrieval results without feedback information, however, it is difficult to obtain additional information from users. Therefore, these self-re-ranking methods cannot adequately deal with the problem of similar candidate images.
Feedback-based re-ranking aims to improve retrieval performance based on user feedback. Several learning-based methods [12], [13], [30], [31] allow users to provide natural language-based feedback on retrieval results. Specifically, a reinforcement learning-based re-ranking method in the fashion domain retrieval is proposed by Guo et al. [30]. By learning the texts that represent differences of images, they can estimate the user's desired images from natural languagebased feedback towards the top-ranked image. Although the method of Guo et al. [30] enables users to provide natural language-based feedback, there is no guarantee that the feedback provided by the users effectively clarifies similar candidate images. Besides, users are required to consider additional natural language-based queries for the re-ranking. Also, as the most relevant method, Yanagi et al. [12], [13], [31] proposed a re-ranking method that receives information about objects in the target image. The method proposed in [13], [31] calculates the entropy of each object information based on these existing proportions, and the object information with the largest entropy is used for QA. Also, the method proposed in [12] uses a trainable fully-connected layer for estimating the object information with the largest entropy, and the estimated information is used for QAbased re-ranking. Although these methods receive additional information for re-ranking, they do not focus on the queryimage relevance and recallability. By considering them in the proposed method, a more useful QA-based re-ranking is realized.
Similarly, QA-based retrieval methods have been proposed in the general information retrieval field (text-to-document retrieval) [32]. The application of them is one of the most relevant research. These methods present a question for reranking based on universal information retrieval histories and receive feedback from users. They can clarify the ambiguities contained in the query text in retrieving the content of the Web, however, it is difficult to retrieve the desired information from a database without sufficient retrieval histories. Besides, they cannot present questions that are adequate for each database.

III. RECALLABLE QA-BASED RE-RANKING
Our method consists of three steps: initial retrieval, question generation, and QA-based re-ranking. Fig. 2 shows an overview of our method. In the first step, our method calculates similarities between the query text and each candidate image in the shared space, and the initial retrieval results are determined based on the similarities. In the next step, our method extracts pixel-by-pixel semantic information for each candidate image, and the semantic information with the largest region and the largest entropy in the query-related images is calculated. In the last step, our method asks users whether the calculated semantic information is included in the desired image. Based on the answers from users, our method determines the re-ranking results.

A. INITIAL RETRIEVAL
For providing the candidate images I n (n = 1, . . . , N; N being the number of candidate images) related to the query text Q, the initial retrieval results are calculated based on the crossmodal retrieval procedure. Theoretically, diverse cross-modal retrieval methods can be utilized for calculating the initial FIGURE 2. Overview of the proposed method. At first, our method calculates similarities between the query text and each candidate image in the shared space, and the initial retrieval results are determined based on the similarities. Next, our method extracts pixel-by-pixel semantic information for each candidate image, and the semantic information with the largest region and the largest entropy in the query-related images is calculated. Finally, our method asks users whether the calculated semantic information is included in the desired image. Based on the answers from users, our method determines the re-ranking results.
retrieval results, our first step is described following the most primal cross-modal retrieval method [3].
At first, the text and image features ( f L ∈ R D L and f V n ∈ R D V ) are respectively calculated using the query text Q and the candidate images I n , where D L and D V denote the dimension of each feature. The extracted text and image features are then embedded into the features ( f E ∈ R D E and f E n ∈ R D E ) of the shared space via the encoders M L (·) and M V (·) as follows: (1) where D E denotes the dimension of the embedded features.
The embedded features f E and f E n are used for calculating similarities s n between the query text and the candidate images as follows: By sorting the candidate images I n in descending order of s n , we obtain the initial retrieval results I r k (k = 1, 2, . . . , N), where r k represents the index of the kth rank image (e.g., I r 5 represents the 5th ranked candidate image).

B. QUESTION ANSWERING
If users do not satisfy the initial retrieval results, users can re-rank them by answering the questions provided by our method. Especially, from the semantic information l d (d = 1, 2, . . ., D; D being the number of detectable semantic information), our method explores the semantic information ld that is suitable for the performance improvement and recallability. After that, our method provides the question "Whether ld is included in the desired image?" and receives answers ("yes" or "no") from users. For improving the retrieval performance via question, our method estimates questions that can clarify the initial retrieval results. Here, since our system does not know the user's desired image, the estimated questions are desirable to be effective whether the user's answer is "yes" or "no". To provide such questions, we search for the semantic information that is included in half of not initially screened candidate images following the conventional methods [12], [31]. Here, whether each candidate image is initially screened or not is reflected in the query-image relevance score s n in (3). Therefore, by using the semantic information that is included in half of the candidate images with higher s n , our method can equally and effectively screen the candidate images whether the user's answer is "yes" or "no". Namely, our method can effectively improve the retrieval performance even though the user's desired image is not known. Furthermore, if the provided question is not recallable for users, users cannot answer the question. To provide questions with the semantic information recallable from users, following the recently reported human perceptual papers [17], [18], our method also searches for the semantic information with the large region in the query-related images. To sum up, in this step, our method explores semantic information ld with the largest entropy and the largest region in the query-related images.
Firstly, our method extracts pixel-by-pixel semantic information from each candidate image. For each candidate images I r k , the number of pixels x d,r k with the semantic information l d (d = 1, 2, . . ., D;D being the number of detectable semantic information) are calculated using the trained semantic segmentation model M seg (·). Then one-hot variable o d,r k is calculated as follows: Then our method computes proportion p d,r k (k = 1, 2, . . . , N) of candidate images including the semantic information l d above the kth rank as follows: Finally, our method calculates the score e d for each semantic information l d and calculates the indexd with the highest score as follows: where w adjusts the importance of retrieval effectiveness and recallability. The higher e ret d,r k and higher e rec d,r k respectively reveals that the entropy and the region of l d is also higher. In other words, e ret d,r k and e rec d,r k respectively affect the re-ranking performance and question recallability. Also, in (7), s r k acts as a weight that can focus only on query-related candidate images. By using s r k , our method can generate the questions only focusing on the candidate images related to the user's retrieval intention. Namely, by using the semantic information with higher e d , our method can provide recallable questions that can effectively narrow down the query-related candidate images.

C. QA-BASED RE-RANKING
In this step, at first, our method asks a question "Whether ld included in the desired image ?" to users and receives answers from users in "yes" or "no". After the QA, our method simply conducts similarity-based re-ranking for considering the initial similarities s n following the conventional re-ranking methods [27]. By defining a binary value b that is 1 (resp. 0) on "yes" (resp. "no"), our method calculates similaritiesŝ n as follows:ŝ where β (> 0) balances the initial retrieval and re-ranking. Finally, by sorting the candidate images I n in descending orders ofŝ n , the re-ranked results can be obtained. With the procedure, users can re-rank the initial retrieval results by only answering the provided recallable questions in "yes" or "no". Although our re-ranking scheme is simple, our method can effectively elicit important information from users and narrow the initial retrieval results.

IV. EXPERIMENTS
For evaluating the effectiveness of our re-ranking method, we conducted experiments on a dataset for object memorability [17], a widely used MSCOCO dataset [33], and an auto-generated dataset including similar candidate images. In this section, the general settings of the experiments are firstly described, and then we describe each experiment result.

A. EXPERIMENTAL SETTINGS
Dataset for memorability: The experiments for confirming the correlation between the region ratio and the memorability are conducted using the following dataset. Note that, since there are no text annotations, we cannot use the following dataset for confirming the cross-modal image retrieval performance. Dataset provided by [17]: Dataset provided by [17] (hereinafter referred to as Memorable dataset) is an open dataset including 850 images with 3,412 semantic segmentation and those memorability scores. The memorability score is collected based on a visual memory game through 1, 823 workers from Amazon Mechanical Turk. In our experiments, by using all 858 images and these memorability scores, we confirmed the correlation between the region ratio and the memorability score.
Dataset for cross-modal image retrieval: The experiments for confirming the cross-modal image retrieval performance are conducted using the following two datasets.

Microsoft Common Objects in Context (MSCOCO) [33]:
MSCOCO is an open dataset including various image-text pairs and 172 semantic labels. It is widely used for the evaluation of cross-modal initial retrieval. In our experiments, based on the paper [3], 123,287 and 5,000 images are respectively used for training and test. In our experiments, the test image and its paired text were used as the target image and the text query.
Biased MSCOCO dataset: Although the effectiveness of our method can be verified on MSCOCO dataset, the effectiveness for a database with many similar images cannot be guaranteed. To evaluate its effectiveness, we computationally constructed a database with similar candidate images based on the MSCOCO dataset. Specifically, we calculated an object label contained in most images of the MSCOCO dataset and reconstructed the MSCOCO dataset so that the images of its dataset absolutely contain the object label. The label and the number of the corresponding images are "person" and 2,628, respectively. These extracted images are defined as Biased MSCOCO. With the procedure, the candidate images in the Biased MSCOCO contain contents of "person".
Baseline and comparative methods: Basic cross-modal initial retrieval methods [3], [6], [7], [8], [10], [11], [34], [35], [36], [37], [38], [39] were used as baseline methods. By comparing these baseline methods and our method with them, we confirm that our re-ranking method can improve the initial retrieval performance. Also, basic re-ranking methods [12], [13], [31], [40], [41], [42], [43], [44] were utilized as comparative methods, respectively. By using these various types of re-ranking methods, we confirm that our re-ranking method can effectively improve the retrieval performance compared with traditional rule-based re-ranking and recent deep learning-based re-ranking. Implementation details: In our method, we first calculate the embedded text and visual features ( f E and f E n ) via the encoders M L (·) and M V (·) following each initial cross-modal retrieval method. The brief explanation of encoders (M L (·) and M V (·)) is shown in Table 1. The parameters of each encoder were trained based on the training sets of MSCOCO dataset. Also, SenFormer [45] trained on MSCOCO-Stuff 10 K dataset was used as semantic segmentation model M seg (·). SenFormer is composed of Swin-Transformer and basic Transformer-based decoders. Since MSCOCO-Stuff 10 K dataset contains 171 semantic level categories, D that indicates the number of detectable semantic information in Sections III-III-B equals 171. The trainig procedure and hyperparameters of M L (·), M V (·), and M seg (·) follow each paper. Note that source codes by each author are used for implementing comparative methods. Furthermore, each experimental result is reported using w = 0.5, and β = 0.3 unless otherwise noted. Each hyperparameter is determined based on the ablation studies in Sections IV-IV-F.
Answer preparation: Answers for the questions generated by our method are required for evaluating our method and comparative re-ranking methods. Meanwhile, preparing the answers to all generated questions takes costs. Furthermore, if we manually prepare the answers for each question, experimental objectivity and repeatability are not guaranteed. For fairly evaluating the re-ranking performance, the semantic labels attached to each candidate image are used as the user's knowledge, and answers to each question are prepared using the labels. Here, the semantic labels are only used for the answer preparation and not used for the other retrieval processes.

B. EVALUATING CROSS-MODAL RETRIEVAL PERFORMANCE FOR SIMILAR CANDIDATE IMAGES
The QA-based re-ranking methods focus on improving the retrieval performance for similar candidate images, however, there is no experimental validation that the initial cross-modal retrieval performance for such a database is difficult. By validating them, tackling the QA-based re-ranking becomes a more valuable challenge. For validating them, in this paper, we firstly compared the initial retrieval performance on the Biased MSCOCO dataset and the MSCOCO dataset. Here, the Biased MSCOCO dataset contains fewer candidate images than the MSCOCO dataset, and then we cannot simply compare these performances. Therefore, in this experiment, we randomly extract candidate images from the MSCOCO dataset so that the number of images in the Biased MSCOCO dataset and the MSCOCO dataset are equal. Namely, we compare the retrieval performance on randomly selected 2,628 candidate images and 2,628 similar candidate images. Following evaluations of the conventional cross-modal retrieval methods, we use Recall@k (R@k) as evaluation metric, whose definition is shown as follows: where T and r k are the numbers of query texts and query texts that can rank relevant images in the top-k retrieval results, respectively. Note that, since there is a single relevant image for each query text, the other general evaluation metrics in the field of image retrieval such as MAP and NDCG are not used in the cross-modal image retrieval settings. Table 2 show experimental results. In Table 2, "random" and "similar" respectively show the retrieval performance on randomly selected candidate images and similar candidate images. As shown in Table 2, in each method, we can see that the retrieval performance on "similar" underperforms those on "random". From these results, we confirmed that the initial cross-modal retrieval performance for similar candidate images is more difficult than for non-similar candidate images. For meeting the retrieval performance on "similar" to those on "random," retrieval methods that can clarify these similar images are desirable. It is considered that QA-based re-ranking is one of the suitable solutions to deal with such a situation since they interactively elicit additional information from users by analyzing candidate images.

C. EVALUATING THE CORRELATION BETWEEN THE REGION RATIO AND THE MEMORABILITY
The core idea of our QA-based re-ranking method is based on the correlation between the region ratio and memorability. Although the memorability and the region ratio of each semantic information are known to be correlated [17], [18], quantitatively validating its correlation is important for our method. Then we calculate Pearson's correlation coefficient between the memorability and the region ratio based on the Memorability dataset. At first, for each semantic segmentation of the Memorability dataset, we calculated the percentage of the semantic segmentation pixels in the total number of all pixels (hereinafter referred to as region ratio). Next, we collected the pair of the memorability score and the region ratio of each semantic segmentation. With the procedure, we can obtain 3,412 pairs. Finally, we calculated Pearson's correlation coefficient between these pairs. Scatter plots of the region ration and memorability scores are shown in Fig. 4. With the above experiments, we confirmed that Pearson's correlation coefficient between the memorability score and the region ratio is 0.47 in p-value < 0.01. It is widely known that 0.47 of Pearson's correlation coefficient has moderate correlation in our data size [19], [20], [21]. Therefore, by using the semantic information with a higher region ratio, it is considered that our QA-based re-ranking method can provide questions that are memorable and recallable for users. Based on these results, in this paper, we construct our method and conduct our experiments by focusing on the region ratio.

D. COMPARISON WITH CROSS-MODAL INITIAL RETRIEVAL METHODS
Next, we evaluated that our method can improve the initial retrieval performance. The experiments are conducted on the MSCOCO dataset and the Biased MSCOCO dataset, respectively. By using each dataset, the effectiveness of both various and similar candidate images can be confirmed. R@k, mean rank (mean r) and median rank (median r) are respectively used for evaluating retrieval performance.
Experimental results of the MSCOCO and the Biased MSCOCO dataset are shown in Tables 3 and 4. From Tables 3  and 4, the cross-modal retrieval methods adapted with our method outperform those without it. Specifically, we can see that our method improves the retrieval performance of R@1. From them, it can be considered that our method clarifies similar candidate images. Additionally, examples of retrieval results are also shown in Fig. 3. From these quantitative and qualitative results, it is said that our re-ranking procedure can improve the initial cross-modal retrieval performance. Furthermore, from the effectiveness on the Biased MSCOCO dataset, we confirm that our re-ranking procedure can extract the important clue information for narrowing down similar candidate images.

E. COMPARISON WITH FEEDBACK-BASED RE-RANKING APPROACHES
Finally, we evaluated that our method effectively improves the retrieval performance than the conventional feedback-based re-ranking methods. Generally, it is widely known that fairly comparing the feedback-based re-ranking methods is difficult [46]. Therefore, we conducted our experiments so as to maximize the performance of each feedback-based re-ranking method [12], [13], [31], [40], [41], [42], [43], [44]. Here, [12],   [13], [31] are the conventional QA-based re-ranking methods. Note that, since [13] is a journal version of [31] and there is no methodological difference between [31] and [13], we treated them as the same method. Specifically, following the recently reported paper [12], [13], [31], the semantic label information is used for preparing the feedback of each re-ranking method. For example, in the methods that ask users to select related images for re-ranking, the re-ranking is conducted by considering images including the same semantic labels of the target image as the related image. Since the method of VSRN [37] achieves better meadin rank performance, we used its method for calculating the initial retrieval results. Based on these initial retrieval results, we conducted a re-ranking in each comparative method. For evaluating the retrieval performance, R@k, mean rank, and median rank are respectively used as evaluation metrics. Also, for evaluating the recallability of the generated questions, we calculated the region ratio (RR) of ld in the target image following the recently reported human perceptual papers [17], [18]. Specifically, we calculated the RR of ld in each target image and took the average of those region ratis.
Experimental results of the MSCOCO and the Biased MSCOCO dataset are shown in Tables 5, 6, and 7. Note that the RR can be calculated only on our method and the QA-based re-ranking methods [12], [13], [31]. From Tables 5 and 6, our method outperforms the other comparative feedback-based re-ranking methods. From these results, we confirm the effectiveness of our method in retrieval performance. Furthermore, from Table 7, we can see that our method outperforms the other QA-based re-ranking methods. From these results, in terms of the relationships between the RR and the recallability, we confirm that our method can provide recallable questions than the conventional methods.

F. ABLATION STUDY
To validate the effectiveness of each component and adjust each hyperparameter, we conducted the following three types of ablation studies.
1) Evaluating the effectiveness of considering the queryimage relevance 2) Evaluating the effects of hyperparameter w 3) Evaluating the effects of hyperparameter β Since VSRN [37] achieves better median rank performance in Sections IV-IV-D, we used this method for calculating the initial retrieval results. Evaluating the effectiveness of considering the query-image relevance: Since our method considers the query-image relevance in s r k of (7), we compare the performance of our method and our method without (w/o) s r k . Experimental results are shown in Tables 8 and 9. From Tables 8 and 9, we can see that "Ours" outperforms "Ours w/o s r k " in each evaluation metric and dataset. From these results, we confirm the effectiveness of considering the query-image relevance.
Evaluating the effects of the hyperparameter w: The hyperparameter w adjusts the importance of a retrieval weight and a recallability weight, and then it is considered that the hyperparameter w is a trade-off parameter between retrieval and recallable performance. To validate its effect, we confirmed each performance by changing the hyperparameter w. Experimental results are shown in Tables 10 and 11. From  Tables 10 and 11, we can see that the higher (resp. lower) w indicates better retrieval (resp. recallable) performance. From these results, we confirm the hyperparameter w is a trade-off parameter between retrieval and recallable performance.
Evaluating the effects of the hyperparameter β: The hyperparameter β adjusts the importance of the initial retrieval and re-ranking. The lower (resp. higher) β mainly focuses on initial retrieval (resp. QA-based re-ranking). To validate its effects, we confirm the retrieval performance by changing the hyperparameter β. Since β only affects the retrieval performance, and the recallbale performance is not changed in each β, we mainly reported the retrieval performance. Experimental results are shown in Tables 12 and 13. From  Tables 12 and 13, we can see that the higher β indicates better retrieval performance especially in R@1, R@10, median rank. However, regarding mean rank, it becomes worse results as the hyperparameter β increases. We consider that these results come from false detection results caused by the semantic segmentation model. From them, we believe that the performance of our method would be further improved following the performance of the semantic segmentation model.

V. LIMITATION
The first limitation of our approach is that we mainly focus on the relationships between the region ratio and the recallability. Although the region ratio has a relationship with the reacallability, it can be considered the other factors also have relationships with the recallablity. For analyzing these relationships further, cooperatively considering the fields of retrieval, computer vision, and human perception is required. In our future works, we will further examine the relationships between the questions, recallablity, and human perception.
The second limitation of our approach is that we do not conduct experiments on the real database including similar candidate images. Although there is no open cross-modal image retrieval dataset that includes similar candidate images, it is beneficial to examine the effectiveness of our QA-based re-ranking methods using a non-artificial database including similar candidate images. In our future works, we will try to collect such real databases and conduct experiments on these databases for realizing social applications.

VI. CONCLUSION
In this paper, we improve QA-base re-ranking in experimental and methodological perspectives. From an experimental perspective, we have quantitatively examined the effects of similar candidate images towards the initial retrieval performance. From a methodological perspective, we have proposed a QAbased re-ranking method with considering the query-image relevance and the recallability. Our method can generate questions with focusing on the recallabiliy of the semantic label in the query-related candidate images. With the procedure, our method can effectively elicit important information from users and narrow the initial retrieval results. Experimental results demonstrate that our idea contributes to the improvement of retrieval performance and recallability. We believe that our idea for interaction with users can lead to further development in the image retrieval field.