Abstract

Due to the advantages of low storage cost and fast retrieval efficiency, deep hashing methods are widely used in cross-modal retrieval. Images are usually accompanied by corresponding text descriptions rather than labels. Therefore, unsupervised methods have been widely concerned. However, due to the modal divide and semantic differences, existing unsupervised methods cannot adequately bridge the modal differences, leading to suboptimal retrieval results. In this paper, we propose CLIP-based cycle alignment hashing for unsupervised vision-text retrieval (CCAH), which aims to exploit the semantic link between the original features of modalities and the reconstructed features. Firstly, we design a modal cyclic interaction method that aligns semantically within intramodality, where one modal feature reconstructs another modal feature, thus taking full account of the semantic similarity between intramodal and intermodal relationships. Secondly, introducing GAT into cross-modal retrieval tasks. We consider the influence of text neighbour nodes and add attention mechanisms to capture the global features of text modalities. Thirdly, Fine-grained extraction of image features using the CLIP visual coder. Finally, hash encoding is learned through hash functions. The experiments demonstrate on three widely used datasets that our proposed CCAH achieves satisfactory results in total retrieval accuracy. Our code can be found at: https://github.com/CQYIO/CCAH.git.

1. Introduction

As the internet and social networking grow rapidly, multimedia information data such as images and texts are increasing dramatically, and it is a great challenge to retrieve these data efficiently. Cross-modal retrieval aims to search for heterogeneous modal data with a similar semantic representation by one modality. Hashing methods[18] are widely used in retrieval tasks to improve storage and computational efficiency. Cross-modal hashing methods attempt to represent heterogeneous modal data as compact binary codes while maintaining semantic similarity between different modal data in a common hidden space.

Cross-modal hashing methods fall into two broad categories: supervised methods and unsupervised methods. Commonly available supervised hashing methods [2, 7, 913] have demonstrated significant performance. The principle is to use hand-labeled label information or precomputed similarity matrices to guide model training and learning of binary codes. Unfortunately, in real-world and more challenging scenarios, images are often accompanied by their textual description, but difficult to obtain their labels, categories, or tags.

Recently, an increasing number of research hotspots have emerged in unsupervised cross-modal hashing methods. Unsupervised hashing methods [1, 1418] attempt to get rid of the model’s reliance on manually annotated data during training, relying solely on the features of the data itself, and demonstrate superior performance. However, a common drawback of the above-unsupervised approach is that the co-occurrence information inherent in the vision-text is easily overlooked in the high-level semantic feature extraction process due to the lack of labeling information guidance (Figure 1). This further leads to unsupervised models that are unable to accurately capture the semantic connections between different modal data, making retrieval accuracy suboptimal. In view of this, we point out that hash codes of images and text that appear in pairs should have either a minimum Hamming distance or a maximum degree of semantic similarity.

In addition, most existing cross-modal methods focus on the alignment of semantic features between cross-modal data (GAN [19]). Simplifying the semantic association of reconstructed features within a modality with the original features makes the generated hash codes not perfectly compatible with cross-modal retrieval tasks. Inevitably, there is an inherent modal divide problem in high-level semantic interactions, where one cannot pay attention to both intramodal and intermodal semantic information of one’s modality, nor can one bridge the alignment of modal features and hash encoding, resulting in retrieval results that do not achieve optimal solutions.

To solve the above problem, in this paper we propose a novel deep unsupervised cyclic semantic alignment cross-modal hashing method termed CLIP-based cycle alignment hashing for unsupervised vision-text retrieval (CCAH). CCAH is an end-to-end learning framework that simultaneously notices both intramodal and intermodal semantic features and hash code consistency. Our CCAH network model consists of three components: deep feature extraction, cycle alignment, and hash encoding learning. Previous unsupervised network models have suffered from a problem of low accuracy in text retrieval images. It is well known that in image text pairs, images contain richer semantic information and can extract higher-level semantic representations at a finer level. Compared to the corresponding text description, (e.g.,: BOW) the text has relatively little semantic information, and often only a few keywords can be matched to the described image area (attention points). Moreover, the text has a contextual relationship, and the same word may represent different semantic information, resulting in text retrieval images that are often less accurate than image retrieval text. We propose to consider the text as data in a graph structure, transforming text features into node information in the graph, further fusing sparse text features by using GAT networks, and fusing related neighboring node information with the original nodes in an attention scoring mechanism, while the attention score indicates the closeness of the connection between different nodes, with higher scores being more closely related. And the auto-encoder is used to encode and decode the extracted modal features. Our contribution to this work is as follows:(i)We propose a new deep hash network model called CCAH. CLIP is used as a visual coder to extract fine-grained features. The GAT network is also used for feature extraction of text modalities.(ii)We propose a circular alignment method to align image features with features extracted by auto-encoder, and then align the features after mapping them to the text modality space to ensure semantic links between modalities and vice versa.(iii)The experiments demonstrate that our model achieves satisfactory results in terms of final total retrieval accuracy under three commonly used multimodal datasets.

Currently, cross-modal hash retrieval is broadly divided into supervised and unsupervised hashing. Supervised hashing methods have better performance compared to unsupervised methods with the aid of labels or similarity matrices to avoid redundant information interference.

2.1. Supervised Hashing Methods

Supervised hashing methods: supervised hashing methods use manually annotated label information or load predefined similarity matrices to guide the training of binary encoding between different modalities and have shown excellent implementation in multimodal data retrieval. Recently, many supervised hashing methods have been used to continuously improve the retrieval accuracy benchmarks. TDH [20] uses triples to flexibly capture a variety of higher-level similarities, rather than the simple similarity or dissimilarity of binary groups, sorting to optimize intraclass and interclass variation; SCM [13] learns the hash function bit by bit using supervised information in linear time complexity; DOH [21] learns ordinal representations to generate ranking-based hash codes by leveraging the ranking structure of feature space from both local and global views; Seph [3] uses a probability distribution, which is approximated by minimizing Kullback–Leibler divergence, to a hash code learned in Hamming space; QCH [9] proposes to simplify the optimization process by transforming the multimodal objective function into a unimodal formalism; MCSCH [12] proposed a multiscale association mining strategy, which is a multiscale feature-guided sequence hashing method; DLFH [11] introduces a discrete learning algorithm that learns binary hash codes directly, without the need for successive relaxations. However, the above methods require a lot of manual and financial effort to label the dataset during the hash function learning process, which is often unrealistic in real-life scenarios. And without labeling information, the retrieval accuracy inevitably degrades.

2.2. Unsupervised Hashing Methods

Unsupervised hashing methods: to reduce the need for manual annotation information during model training, unsupervised cross-modal hashing methods are proposed. CVH [1] learns binary codes by minimizing the similarity-weighted Hamming distance; IMH [6] builds two intramodal similarity matrices based on neighbor relations; CMFH [16] uses matrix decomposition to address the semantic relevance of different modalities and maps heterogeneous modal data into a hidden state space; UDCMH-based [17] learning of features and hash codes under Laplacian and discrete constraints; DJSRH [14] fuses semantic information into the affinity matrix to calculate potential correlations between modes; DSAH [22] aligns intramodel and intermodal data by fusing them using semantic similarity alignment and heterogeneous modal data reconstruction; JIMFH [23] combines intramodal and intermodal hash codes to obtain the final hash code; DBRC [24] proposes a framework with adaptive binary reconstruction that allows discrete hash codes to be learned directly; HNH [25] weighted the original similarities using Hadamard products and created a joint similarity matrix using linear combinations. Although these unsupervised cross-modal hash models have achieved better results regarding the colinear information of image text pairs, they still ignore part of the image information, resulting in poor accuracy of text retrieved images.

3. Problem Formulation

3.1. Problem Definition

Suppose we have m image text pairs, We define our data structure as . We define to represent the -th image and to represent the -th text. Each image text pair instance can be represented as . We define the representation of the feature dimension as . The semantic features extracted by the visual feature encoder denoted as and , is the high-level dimensional feature representation of the image obtained by passing the original vision through the image encoder. We also define the feature representation of the text after the text encoder as , denotes the high-level feature dimensional representation of the text, and m is the number of sample instance points. In addition, we define the hash code representation as , and . Here denotes the length of the hash code, and the hash code of the -th original data in is denoted . In addition we define the cosine similarity loss function for paired image text as , and use the function for element wise symbolic functions. The definitions are as follows:here we define to denote the regularization paradigm for the Frobenius regularization of vectors and matrices.

3.2. Model

In Figure 2, we show all the components of our model. The CLIP-based cycle alignment hashing for unsupervised vision-text retrieval (CCAH) consists of three parts, namely, the feature extraction part, the cycle semantic alignment part, and the hash coding learning part.

Graph networks [26] represent node information as a graph, transforming the graph topology into a constructed adjacency matrix by aggregating node-to-node associations, fusing the information of each node and its neighbors into a new node. With attention [27] showing advanced execution in NLP and CV, the attention mechanism is introduced into graph networks, where instead of just doing a simple fusion, the attention algorithm gives each node an attention score, and then fuses the different nodes for information. Less relevant feature words have a lower score, and feature words that are more relevant to them can receive a high attention score. In fusing this information, the influence of different feature words on the nodes is reinforced and better semantic information can be extracted.

Since our text is a 1386-dimensional feature vector representation, we treat these features as node data and each text can be represented as . To obtain sufficient expressive power to transform the input features into higher-level features, after a learnable weighting matrix transformation, then self-attention is applied to the node.where is the attention calculation factor, denotes the importance of node to node . We calculate each neighboring node of node . To make the coefficients easily comparable between different nodes, we use the softmax function to normalize all neighboring nodes.

represents the transpose of a vector. By doing this for all nodes, the node information of the adjacency matrix has been transformed into a new node vector containing the attentional features of each neighboring node, which is the most easily weighted fusion of semantic information that is lacking in the text modality, leading to a more powerful modal representation of the text modality. The graph attention network is a fusion of feature words associated with a certain feature word with its associated feature words using attention. And the weighted fusion employing the attention mechanism can obtain a new semantic feature representation containing the information of neighboring nodes.

3.2.1. Deep Feature Extraction

In order to extract richer information about the high-level semantic representations of the modalities, we design different modal encoders for different modal data. Since image modalities contain richer semantic information than text modalities, and the single-stream model (eg: ViLT [28]) cannot bridge the inherent modal gap across modalities, cannot perform optimal feature extraction for each modality, and has limited ability to mine semantic consistency information for heterogeneous data, we adopt a dual-stream model to extract semantic features for different modal data information and show excellent results throughout the training phase. The results were excellent throughout the training phase.

(1) Image Feature Extraction. CLIP [29] used a training method of contrast learning in unsupervised learning, using a dataset of huge size for training, compared to ViT [30], which yielded good quality results on several datasets. Using the CLIP pretrained model as a feature extractor for image modalities in our model. In the image section using CLIP’s image encoder (encode-image), we feed the original image into the CLIP image encoder (Figure 3), and after extraction, we obtain a 1024-dimensional high-level semantic vector, which we define as .

(2) Text Feature Extraction. We consider text modal data as not containing as much high-level semantic information as image data, but text semantics are contextually relevant. We treat the features of text as nodes of a graph and use graph attention networks (GAT [31]) to extract aggregated semantic information from text. GAT treats text features as nodes, and converts input features into higher-level features to obtain more powerful expression, introduces an attention mechanism, performs self-attention on nodes, and finds the attention weight coefficients between nodes; and by weighted summation of surrounding neighboring nodes, you can get information that aggregates all surrounding nodes, making the connection of text information more realistic (Figure 4). The text features are constructed as adjacency matrices, and the information in the adjacency matrices represents the linkage of text modalities, and the semantic representation of text can be better processed by weighting the features. The original text message is characterized by .

For simplicity, we define the feature extractor as . The mathematical notation of each modal feature extractor is defined as follows:where and are the original image and text. and are the parameters of the feature extractor. To this end, we can extract semantically rich high-level representation features for each modality, which can be used to fully explore the semantic relationships between the data and further guide modal alignment and hash code learning.

3.2.2. Cycle Alignment

To facilitate intramodal semantic feature alignment and to maintain cross-modal data semantic interaction, we propose the use of circular semantic alignment methods. The distance between semantically similar vision-texts is promoted to be close in the common representation space and vice versa makes the distance in the common representation space farther. To further align text and images we use intramodal and intermodal loss measures. We use auto-encode to compress the high-level semantic features into low-level semantic representations and to reconstruct this underlying semantic feature back into a feature of heterogeneous data. We define the function that compresses the high-level semantic representation as follows:where denotes the original features of the image and text, is defined as the parameter under each modal pass , and .

The high-level semantic features extracted by the feature extractor are encoded and compressed by the encoder to obtain a true-value semantics with strong representational power and containing highly semantic features, which we then reconstruct back into a representation of the heterogeneous data by means of a decoder, which we define as follows:

We input the features of the image (text) into the decoder and the semantic information obtained is then mapped to the feature space of the text (image) by the decoder to achieve semantic alignment between modalities. After obtaining the reconstructed features of heterogeneous data, to facilitate cross-modal information interaction. We semantically align the original image features with the text reconstructed by the decoder. To ensure that the resulting compressed feature vector represents the original high-dimensional feature representation, we align the high-dimensional features with the encoded features once as well, achieving intramodal semantic alignment.

(1) Intermodal. To facilitate information interaction between different data and achieve cross-modal semantic interaction, we use the semantic features obtained by the feature extractor of one modality to be mapped to the corresponding semantic space of another modality after being decoded by the auto-encoder. represents the vector representation after mapping the text features to the image feature space, and represents the vector representation obtained by mapping the image features to the text feature space. We construct the cross-modal semantic feature matrix and . Alignment of different modal types is achieved by minimizing cross-modal semantic losses, with the following loss function:

The total intermodal loss is expressed as follows:

We can leverage the high-level semantic feature representations between two modalities for cross-modal alignment, and we achieve cross-modal heterogeneous data alignment by computing the minimization .

(2) Intramodal. To ensure the representativeness of the semantic information within the modality and to reduce semantic feature loss, we also perform intramodal constraints within the same modality, and we align the features extracted from the original image with the higher-level semantic representations encoded by the auto-encoder. Ensuring representability and completeness of high-level semantic information within a modality by minimizing , we construct the image modal feature matrix as after auto-encoder to obtain the features of the hidden state, which is denoted by . The text features are also represented by for the original extracted features and for the features decoded by auto-encoder. We define the intramodal losses as follows:

Therefore, we construct a semantic alignment method with intramodal and intermodal alignment, which achieves intramodal semantic alignment by aligning the high-level semantic representation extracted by the visual coder and the text encoder with the compressed semantic features of the feature after auto-encode, ensuring that the high-latitude modal data can be restored with a small number of high-level features, aligning the heterogeneous data with the original modal features through the mapping of the decoder, enabling information interaction across the modal data, and achieving intramodal and intermodal alignment. We define the loss of cycle-alignment as follows:

3.2.3. Hash Encoding Learning

After feature extraction and cycle semantic alignment, the semantic information of the text and visual data can be extracted and interlinked in a high-quality way. In the area of cross-modal retrieval, we aim to make semantically more similar heterogeneous data more closely related, by finding semantically related data samples from one modality in the dataset from query points in another modality according to a defined similarity metric. By converting the query points into a hash code, the corresponding modal information can be retrieved more quickly. With the AE (auto-encoder) mapping, we can fully extract the high-latitude feature encoding corresponding to each modality during the training phase. We perform the mapping of the hash encoding through the AE generated feature vector, and due to the feature extraction and reconstruction semantic operations, we use the true value to construct the hash encoding and generate hash codes via the function. We compute this pairwise cosine similarity matrix by defining them as , which is used to represent the generated hash matrix. The visualization of the feature generation hash encoding is shown in Figure 5. The hash matrix of text modalities is denoted as and the hash matrix of the image part is denoted as . For the matrix elements, we calculate by using the following cosine similarity:

In addition inspired by [22], to make fuller use of the semantic information described jointly by image text pairs, we construct cross-modal hash code similarity matrices where colinear image text pairs have the most similar labels or categories compared to other modal data, and the elements on the diagonal are better as they should be closer to 1, decomputed into hash codes for image text pairs, and minimizing the loss of colinear instances as follows:

Regarding other elements, we use diagonal similarity loss to bridge the connection between different modalities, e.g., the same pair of image text similarity should be independent of location information and only related to feature information, bridging the semantics of the image text pairs together by minimizing the diagonal loss, which we define as follows:

The total loss on is as follows:

After autoencoder encoding, we map the obtained features to our hash codes through hash functions, and we use these hash codes to construct the similarity matrices and . In addition, we introduce a new similarity matrix that we obtained by mapping the hash function , which is constructed from image-text labels. We do not use labels for bootstrapping in the training phase, but introducing label information mainly to calculate the hash loss.

Whereas hashing methods can speed up the retrieval process, mapping truth-valued features to hash codes still results in some missing information, leading to suboptimal solutions for retrieval. In hash encoding learning, we also need to pay attention to the semantic relationships between data from different modalities, and similarity information across modalities is a central task in cross-modal retrieval. Based on this, we align the features within individual modalities with the generated hash codes to ensure that the generated hash codes are more realistic representations of the original data information. is the modal adjustment parameter that allows more flexibility to ensure our semantic similarity.

We are constructing a joint feature matrix that integrates the text feature matrix with the image feature matrix in a weighted way, which is represented by only one common matrix . The ɑ is a hyperparameter that can be used to weight the feature matrix of images and text.

In optimising our hash encoding based on matrix alignment.

The total loss between modes is as follows:

Require: Image set ; text set ;
   Batch size set , hash code length , Max epoch .
Ensure: Deep Feature extract functions , and ;
   encoder function set , and ;
   Hash coding functions , and .
(1)Initialize the pretrained extractor parameters: .
(2)Whiledo
(3)   ;
(4)   Extract the depth characteristics of each mode: , ;
(5)   Encode the features to get the hidden states, by ;
(6)   Using the hidden states to generate truth matrix and hash codes;
(7)   Decode the hidden states to generate heterogeneous features and
(8)   Calculate the objective function;
(9)   Back propagate the gradient with the chain rule;
(10)   Update the whole parameters;
(11)end while
3.3. Optimisation

We combine these losses to construct our total loss function as follows:

Moreover, during our training process, the cyclic semantic interaction module uses truth codes, and during the training process, if the truth codes are converted into hash codes, some information will be lost, and the truth features are more conducive to the training of the model, and the truth codes generated after multiple modal interactions are closer to the hash codes. However, the generated truth codes cannot be gradient-derived because they are discrete values. To solve this problem, inspired by , we transform them into binary hash codes via with the following function definition:

The proposed CCAH algorithm is shown in Algorithm 1.

4. Experiment

Datasets: our experiments were tested on three cross-modal retrieval datasets, including MIRFlickr-25K [32], NUS-WIDE [33], and MS COCO [34], to validate the effectiveness of our proposed model. The datasets are described as follows:MIRFlickr-25K: MIRFlickr contains 25,000 image-text pairs collected from the Flickr website. Each image text pair is saved as an instance. And for text patterns, after DJSRH [14], each text will be sorted and tagged with occurrence characteristics and transformed into a BOW (bag-of-words) vector.NUS-WIDE: NUS-WIDE consists of 269,648 pairs of multimodal data containing 81 categories, with each multimodal instance containing an image and corresponding label. For simple processing, we selected the 10 most frequent categories from the original 81 categories and the 186,577 tagged instances in all pairs. The text of each instance was represented as a 500-dimensional bag-of-words (BOW) vector. We collated the index vector of the most frequent 1,000 text labels.MS COCO: MS COCO was originally collected for the image understanding task and contains 123,287 images. For each image, a text description and a 91-dimensional semantic label are given. The experiment contains 87,081 images with category information and uses a 2,000-dimensional bag-of-words vector to represent the textual information. Of these, 5,000 image-text pairs were randomly selected as the query set and the remaining image-text pairs were used as the retrieval set. For the training set, 10,000 pairs were randomly sampled from the retrieval set.

4.1. Implementation Details

We used CLIP as a feature extractor for image modality and GAT as a feature extractor for text modality. We used cyclic modal interaction to achieve semantic alignment within and between modalities (Intramodal and intermodal). We use hidden features of one modality to reconstruct features of another modality and carefully set some hyperparameters to assist learning. We analyze the sensitivity of these parameters based on experiments. Finally, we selected our parameters as , and , batch-size is 16, the learning rate is 0.005 for both image and text modalities, the SGD optimization strategy is used, and the weight decay is set to .

4.2. Baseline and Validation

Evaluation criteria: we use three cross-modal common datasets, MIRFlickr-25K, NUS-WIDE, and MS COCO to validate our model. For MIRFlickr-25K and NUS-WIDE, we follow [14, 16, 17] and sample 2,000 instances as query points and the remainder as query database. Due to the overwhelming amount of data in MIRFlickr-25K and NUS-WIDE, we randomly sampled from one of the datasets in the database set for training. For fairness in training, we took some instances from each class in the first round of training and randomly sampled them in the remaining stages. In the MS COCO dataset, we take 10,000 instances as the retrieval set and the remainder makes up the database set. In our experiments, we take MAP and precision @ top-curves as the model judging criteria.

To validate our CCAH model, we compare it with some common cross-modal approaches. Shallow cross-modal Hashing: CVH [1], IMH [15], LCMH [35], CMFH [16], LSSH [9], RFDH [36], FSH [37], and STMH [38]. Deep cross-modal Hashing: DBRC [24], UDCMH [17], DJSRH [14], DSAH [22], JDSH [39], MGAH [40], JIMRH [23], HNH [25], and DUCH [41]. The results of CCAH compared to other models are shown in Figure 6.

We compare with previous work on the MIRFlickr-25K and NUS-WIDE datasets, where we used a benchmark of MAP@50. The total retrieval accuracy of our CCAH model demonstrates better results than previous work in different coding lengths as shown in Table 1.

As can be seen, our experimental data demonstrate excellent results on two widely used datasets, with significant gains in both image retrieval text and text retrieval image on MIRFlickr-25K, and slightly worse results for image retrieval text on the NUS-WIDE dataset, but significant gains in text retrieval image accuracy, and gains in overall retrieval accuracy. The NUS-WIDE (tc-10) dataset was used, taking the most common 10 classes as the composition of the dataset. As the NUS-WIDE dataset is relatively large, it is not possible to ensure that the classes of the sample points taken are equal when sampling the sample points, and the data is more sparse when constructing the adjacency matrix, leading to a reduction in the efficiency of image retrieval of the text. To validate our theory, guided by DAEH [42], we tested again on the MS COCO dataset, which uses class 81. We used MAP@5000 to evaluate our model and the results are shown in Table 2.

4.3. Ablation Experiment

We experimentally validate the effect of different modules on the accuracy and we validate the model on the MIRFlickr dataset for 128 bits. We have also made other attempts. In the encoding and compression phase, we adopt a two-way model where the compressed vector reconstructs both its original features and the original features of the heterogeneous data, rather than just the features of the heterogeneous data. We validated this on the MIRFlickr and NUS-WIDE datasets. The results show that if we add homogeneous feature reconstruction, there is a relative improvement in image retrieval of text, but the accuracy of text retrieval of images decreases (Table 3).

In Table 4, we perform ablation experiments on different modules to demonstrate the effectiveness of our proposed method.

4.4. Visualization of the Learned Representation

To visualize the effectiveness of the proposed CCAH, we use t-SNE to visualize the learned representation of images, text on the Flickr-25K dataset (Figure 5). The original feature representation of the images and text are shown in Figures 5(a) and 5(c), respectively. It can be seen that the distributions of these modalities have large differences and it is difficult to distinguish the samples by the original representations. Figures 5(b) and 5(d) gives the distribution of the learned representations of the images and text, respectively. It can be seen from the figures that the proposed CCAH method helps to distinguish samples with different semantic classes and some clusters show distinguished intervals.

4.5. Hyperparameter Sensitivity

We further validated our parameters , , and on three datasets using 128 bits coding lengths. is the influence factor by which we optimize our hash matrix with the eigenvalues into an alignment of the hash code, and we find that the best results are obtained when  = 1.5. is the parameter for aligning images and text across modalities. It is known that image modalities contain richer semantic features than text modalities (Figure 1), so when weighting images and text, the image component is weighted more than the text, and our model achieves the best results when  = 0.8. is the parameter that balances the hash encoding with the original features and also boosts the intramodal and intermodal coefficients. The visualization of hyperparametric sensitivity is shown in Figure 7.

4.6. Comparing Other Models

On the 3 cross-modal common datasets mentioned above, our results are significantly improved compared to other models, and our total retrieval accuracy in top-k exceeds previous methods in all cases. We added the GAT network, which successfully constructs adjacency matrices employing graph neighbors to attentionally boost semantic feature-poor text modalities with higher accuracy compared to traditional bag-of-words features. Using CLIP to extract image features, the CLIP large-scale pretrained model can extract features from images at a finer level. We construct a cyclic semantic alignment module to construct the semantic features of the heterogeneous modes by using the hidden state vector of each mode from the self-encoder, compared to using a binary code to construct the features, the true value information is more representative of the mode features and a lot of useful information is lost by using the binary code.

We perform validation of our model on the MS COCO dataset and we mark the detected images with manual regions. In text retrieved images, text marked in red is the feature word of the text (corresponding to the image marker region); in image retrieved text, text marked in red indicates that the retrieval result does not quite match the description of the image Figure 8.

5. Conclusion

In this paper, we propose a novel deep unsupervised cross-modal hashing method, CLIP-based cycle alignment hashing (CCAH) for unsupervised vision-text retrieval. We construct a cycle alignment module that allows for more flexible exploitation of high-level semantic information within and across modalities. To further bridge the gap between the two modalities, we use the hidden state vector of one modality to reconstruct the features of the other modality, enabling cross-modal data to be mutually characterized. Extensive experiments on three benchmark datasets show that CCAH outperforms several state-of-the-art methods in multimodal data retrieval tasks.

Data Availability

The data and code that support the results of this study are openly available in CCAH at https://github.com/CQYIO/CCAH.git.

Disclosure

A preprint has previously been released [43].

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Authors’ Contributions

Mingyong Li proposed the model idea and guided the writing of the paper and participated in the revision of the paper. Longfei Ma is responsible for the paper and experimental implementation. Yewen Li is responsible for model result validation. Mingyuan Ge is responsible for data validation.

Acknowledgments

This work was partially supported by Chongqing Natural Science Foundation of China (Grant nos. CSTB2022NSCQ-MSX1417), the Science and Technology Research Program of Chongqing Municipal Education Commission (Grant no. KJZD-K202200513), and Chongqing Normal University Fund (Grant no. 22XLB003).