Next Article in Journal
Ensemble of Deep Convolutional Learning Classifier System Based on Genetic Algorithm for Database Intrusion Detection
Next Article in Special Issue
High Edge-Quality Light-Field Salient Object Detection Using Convolutional Neural Network
Previous Article in Journal
TweezBot: An AI-Driven Online Media Bot Identification Algorithm for Twitter Social Networks
Previous Article in Special Issue
Palmprint Translation Network for Cross-Spectral Palmprint Recognition
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Unsupervised Deep Pairwise Hashing

1
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China
2
School of Software Engineering, Shenzhen Institute of Information Technology, Shenzhen 518172, China
3
School of Computer Science and Engineering, Univeristy of Electronic Science and Technology of China, Chengdu 611731, China
*
Author to whom correspondence should be addressed.
Electronics 2022, 11(5), 744; https://doi.org/10.3390/electronics11050744
Submission received: 19 January 2022 / Revised: 21 February 2022 / Accepted: 21 February 2022 / Published: 28 February 2022

Abstract

:
Although unsupervised deep hashing is potentially very useful for tackling many large-scale tasks, its performance is still far below satisfactory. Additionally, its performance might be significantly improved by effectively exploiting the pair similarity relationship among training data, but the attained similarity matrix usually contains noisy information, which often largely decreases the model performance. To alleviate this issue, in this paper, we propose a novel unsupervised deep pairwise hashing method to effectively and robustly exploit the similarity information between training samples and multiple anchors. We first create an ensemble anchor-based pairwise similarity matrix to enhance the robustness of similarity and dissimilarity relations between training samples and anchors. Afterwards, we propose a novel loss function to directly and robustly take advantage of the similarity and dissimilarity information via a weighted cross-entropy loss, and make use of a square loss to reduce the gap between latent binary vectors and binary codes, and another square loss to form consensus predictions of latent binary vectors. Extensive experiments on large-scale benchmark databases demonstrate the effectiveness of the proposed method, which outperforms recent state-of-the-art unsupervised hashing methods with significantly better ranking performance.

Graphical Abstract

1. Introduction

Hashing has attracted considerable attention for tackling large-scale tasks because it can encode originally high-dimensional data into short binary codes while maintaining the similarity of neighbors, thereby leading to significant gains in computation and storage costs [1,2]. Hashing can be roughly categorized into two main classes, supervised and unsupervised, based on whether semantic labels are used. Supervised hashing [3,4] usually requires a large amount of labels to achieve satisfactory performance; however, label annotation is usually time-consuming and expensive. By contrast, unsupervised hashing [5] does not need semantic labels and aims to discover and, meanwhile, encode the significant intrinsic patterns or structures hidden in data into binary codes. Thus, unsupervised hashing has the potential for large-scale applications.
Early efforts focus on data-independent hashing methods [1,6], which utilize random projections or permutations to construct hash functions, and they usually require long bits to attain high precision per hash table and multiple tables to improve the recall. Data-dependent hashing usually produces more compact binary codes with higher precision and recall. Although numerous data-dependent hashing methods have been proposed and achieved promising performance on various similarity measures, such as Euclidean distance and 1 -norm distance [7], they are still far from being satisfactory for many tasks via the semantic similarity measure. Most of them [5,8,9] learn hash functions using hand-crafted features, which might not be able to represent the image content [10] optimally.
Recently, because convolutional neural networks (CNNs) exhibit the powerful capability of automatically learning feature representations, several unsupervised deep hashing methods [11,12,13,14,15] adopt CNNs to learn image representations and hash functions. Most of these methods [11,12,13] utilize either the quantization loss or data reconstruction error to train models without considering the similarity relationship among data, thereby decreasing their retrieval performance on some applications. To address this problem, similarity-adaptive deep hashing (SADH) [15] takes into account the pair similarity among training data and alternately proceeds over three major modules: training deep hash models, updating a similarity graph, and learning binary codes. Although SADH achieves better performance than most previous hashing algorithms, its performance might still be restricted by the noisy information in the similarity graph, i.e., positive values are often given to some dissimilar pairs. In addition, anchor-based models have made significant advances in semi-supervised deep hashing [16] and a scalable optimization mechanism has been proposed [17].
Motivated by the observations mentioned earlier, in this paper, we propose a novel, robust, yet straightforward method, unsupervised deep pairwise hashing (UDPH), to effectively and robustly utilize the pair similarity between training data and unlabeled anchors. The framework of the proposed UDPH is presented in Figure 1. The major contributions of this paper are listed as follows:
  • Different from existing anchor-based methods, we enhance the robustness of similarity relations between training data and unlabeled anchors by creating an anchor-based pairwise similarity matrix to preserve their similarity and build a robust ensemble matrix with weighted average;
  • We propose a novel loss function composed of three terms: a weighted cross-entropy loss to exploit the similarity information between training data and multiple anchors, a mean square loss to reduce the gap between latent binary vectors and desired codes, and another mean square loss to form consensus predictions of latent binary vectors;
  • Extensive experiments on large-scale benchmark databases illustrate the superior performance of UDPH over recent state-of-the-art methods. Additionally, ablation experiments also demonstrate the effectiveness of the three terms in the proposed loss function.
Figure 1. The framework of the proposed method, UDPH, which utilizes the VGG-16 model as our backbone network. MSE denotes the mean square error/loss, and WCE means the weighted cross-entropy loss. WCE utilizes the similarity obtained from the ensemble pairwise similarity matrix as the weight, MSE with γ 1 is to calculate the loss between absolute values of latent binary vectors and vectors of ones, and MSE with γ 2 is to calculate the loss between latent binary vectors and target latent binary vectors. f θ ( · ) represents a convolutional neural network and H denotes latent binary vectors. α 1 and α 2 are non-negative values to adjust the weight of S and H , respectively.
Figure 1. The framework of the proposed method, UDPH, which utilizes the VGG-16 model as our backbone network. MSE denotes the mean square error/loss, and WCE means the weighted cross-entropy loss. WCE utilizes the similarity obtained from the ensemble pairwise similarity matrix as the weight, MSE with γ 1 is to calculate the loss between absolute values of latent binary vectors and vectors of ones, and MSE with γ 2 is to calculate the loss between latent binary vectors and target latent binary vectors. f θ ( · ) represents a convolutional neural network and H denotes latent binary vectors. α 1 and α 2 are non-negative values to adjust the weight of S and H , respectively.
Electronics 11 00744 g001

2. Related Work

In this section, we briefly review some popular unsupervised non-deep and deep hashing algorithms, and introduce their differences to the proposed method, UDPH.
Unsupervised non-deep hashing usually adopts hand-crafted features to learn hash functions. The popular hashing algorithms [5,8,18,19] learn binary codes via the strategy of “relaxation + thresholding”, which might degrade their performance due to the accumulated quantization error between binary codes and its relaxed matrix. To alleviate this issue, numerous discrete hashing algorithms [9,20,21,22,23] have been proposed to directly learn binary codes.
Unsupervised deep hashing usually utilizes CNNs to learn image features and hash functions. Deep hashing (DH) [11], DeepBit [13], and unsupervised deep binary descriptors (UDBD) [24] mainly adopt the quantization loss to learn image representations and hash functions. Unsupervised hashing with a binary deep neural network (UH-BDNN) [12] utilizes the reconstruction loss to encourage the similarity among samples. Discriminative attributes representations (DAR) [25] firstly trains a CNN coupled with unsupervised discriminative clustering and then utilizes the cluster membership as a soft supervision to learn hash functions. Unsupervised triplet hashing (UTH) [26] exploits an unsupervised triplet loss to minimize the distance between an anchor image and its rotated version, while maximize the distance between the anchor image and a random image. HashGAN [14] adopts three networks including a generator, a discriminator, and an encoder to learn hash functions. Similarity-adaptive deep hashing (SADH) [15] constructs a similarity graph using the pair similarity between real training data and anchors, and learns hash functions via alternately proceeding over three major modules: training deep hash models, updating a similarity graph, and learning binary codes. Because of its effectively exploration of similarity information among training data, SADH has achieved state-of-the-art retrieval performance on CIFAR-10 [27] and NUS-WIDE [28] databases. However, its performance might be still restricted by the noisy information in the similarity graph. Compared to SADH, UDPH takes advantage of more robust similarity information by creating a strong ensemble anchor-based pairwise similarity matrix. Additionally, unlike SADH, which approximately solves an NP-hard problem to learn binary codes for model training, the proposed UDPH directly utilizes a novel loss function to robustly train models for exploring the semantic similarity information among training data and anchors.

3. Methodology

In this section, we firstly define an anchor-based pairwise similarity matrix and its ensemble version, and then propose a novel loss function to effectively and robustly exploit the similarity information between training samples and anchors for model training.

3.1. Anchor-Based Pairwise Similarity Matrix

Given training data X = x i i = 1 n and an L-layer deep hashing network f θ ( · ) (please see Figure 1), n is the number of samples and θ denotes the network parameters. Note that we utilize f θ l ( x i ) ( 1 l L ) to represent the output of the l-th layer for the sample x i . Because n is usually very large, it is inefficient or even impractical to calculate the pair similarity of any two training samples. Fortunately, the similarity relationship among training data can be propagated through multiple anchors [8,29]. We randomly select m ( m < < n ) samples from X as anchors A = a j j = 1 m ( A X ) to construct a scalable anchor-based pairwise similarity matrix S R n × m . Additionally, because samples that are close in the feature space should be close in the output space (local consistency) [30], we select the p ( t ) closest neighbors of each training sample from anchors and then calculate their similarities, where t is the current number of training epochs and p ( t ) is a piecewise linear function dependent on t to gradually exploit more useful information. However, when only utilizing the similarity information to train models, the features will easily collapse together. To avoid this issue and, meanwhile, exploit more useful information, we select the p ( t ) farthest anchors of each training sample as non-neighbors and then calculate their dissimilarities.
Specifically, let x f = f θ ( L 1 ) ( x ) R d and A f = f θ ( L 1 ) ( A ) R m × d denote feature representations of one training sample x and anchors A at the ( L 1 )-th layer, respectively. By [9,15], their similarities can be calculated by leveraging a nonlinear data-to-anchor mapping ( R d R m ) :
s s ( x ) = δ s 1 e D 2 ( x f , a 1 f ) ϱ s , δ s 2 e D 2 ( x f , a 2 f ) ϱ s , , δ s m e D 2 ( x f , a m f ) ϱ s T / M s ,
where δ s j 0 , 1 , and δ s j = 1 if a j is one of the p ( t ) closest anchors of x based on the distance function D ( · ) (e.g., Euclidean distance), ϱ s is a bandwidth parameter for similarity calculation, and M s = j = 1 m δ s j e D 2 ( x , a j ) ϱ s so that s s ( x ) 1 = 1 . Similarly, we calculate their dissimilarities by:
s d ( x ) = δ d 1 e D 2 ( x f , a 1 f ) ϱ d , δ d 2 e D 2 ( x f , a 2 f ) ϱ d , , δ d m e D 2 ( x f , a m f ) ϱ d T / M d ,
where δ d j 0 , 1 and δ d j = 1 if a j is one of the p ( t ) farthest anchors of x according to the distance function D ( · ) , ϱ d is a bandwidth parameter for dissimilarity calculation, and M d = j = 1 m δ d j e D 2 ( x , a j ) ϱ d leads to s d ( x ) 1 = 1 . Then, we can obtain the anchor-based pairwise similarity matrix S = s ( x 1 ) , s ( x 2 ) , , s ( x n ) T R n × m , where s ( x ) = s s ( x ) s d ( x ) . Let M represent the neighbor set and C denote the non-neighbor set. For clarity, S can be calculated by:
s i j = e D 2 ( x i f , a j f ) ϱ s M s ( x i , a j ) M e D 2 ( x i f , a j f ) ϱ d M d ( x i , a j ) C 0 o t h e r w i s e .
To attain a robust relationship between training data and anchors, we create a strong ensemble anchor-based pairwise similarity matrix S ˜ by applying a weight average to S within multiple previous training epochs, e.g., S ˜ = α 1 S ˜ + ( 1 α 1 ) S , where α 1 0 , 1 is a momentum term to determine how far the ensemble reaches into the training history.

3.2. Formulation and Procedure

Hashing is to project original data from a high-dimensional space into a low-dimensional binary space while preserving their similarity relations. Specifically, given the L-layer hashing network f θ ( · ) , for any sample x , suppose f θ L ( x ) R r to be the output of the L-th layer in the network, where r is the number of hash bits. Its hash function is defined as: h ( x ) = s g n ( f θ L ( x ) ) , where s g n ( · ) is a non-linear function with the definition s g n ( z ) = 1 for z > 0 , otherwise s g n ( z ) = 1 .
For one training sample x i and one anchor a j , there exists r h ( x i ) h ( a j ) r , where ∘ denotes the inner product. In order to make h ( x i ) h ( a j ) > 0 if ( x i , a j ) M and h ( x i ) h ( a j ) < 0 when ( x i , a j ) C , we define u i j = σ ( λ h ( x i ) h ( a j ) ) = 1 1 + e λ h ( x i ) h ( a j ) to represent the similarity between x i and a j in the low-dimensional binary space, and u i j 1 if ( x i , a j ) M , and u i j 0 when ( x i , a j ) C , where σ ( · ) represents the sigmoid function and λ > 0 is a positive constant to regularize the value of h ( x i ) h ( a j ) .
Because the function s g n ( · ) is non-differential, it is usually replaced by a relaxed differential function for model training. There are many choices for relaxed differential functions; for simplicity, here, we choose a differential hyperbolic tangent function t a n h ( · ) in the hashing network. Then, we can obtain a latent binary vector h = t a n h ( f θ L ( x ) ) 1 , 1 r of x . Let B and B a be the index set of mini-batch data randomly selected from training data X and anchors A , respectively. To exploit the similarity information contained in the ensemble pairwise similarity matrix S ˜ , one common strategy is to first learn binary codes by solving a non-differential optimization problem and to then utilize them for model training. However, it is usually difficult and time-consuming to solve the non-differential problem. To avoid this issue, we propose a novel strategy by using a weighted cross-entropy loss function to directly train networks. Because S ˜ denotes the similar weight of pairs, we introduce a matrix W for the cross-entropy loss function in order to directly represent whether the pair is similar, where w i j is dependent on s ˜ i j to determine whether x i and a j are similar, i.e., w i j = 1 when s ˜ i j > 0 and w i j = 0 when s ˜ i j < 0 . Then, the weighted cross-entropy loss is:
J w c e = 1 i B , j B a s ˜ i j i B , j B a s ˜ i j ( w i j l o g u i j + ( 1 w i j ) l o g ( 1 u i j ) ) s . t . h i = t a n h ( f θ L ( x i ) ) , h j = t a n h ( f θ L ( a j ) ) , u i j = 1 1 + e λ h i h j T ,
where · is an absolute value function. Note that when s ˜ i j 0 ( j B a ), we do not set s ˜ i j = 1 or 1 in order to decrease the effect of noisy similarity information.
However, there exists a gap between the latent binary vector h and desired binary codes h ( x ) , thereby potentially decreasing the model performance [15]. To reduce this gap, we utilize a square loss as follows:
J h m s e = h i 1 r 2 2 ,
where 1 r R r is a row vector with all entries being ones.
Recent semi-supervised methods [31,32,33] illustrate that forming consensus predictions under different configurations (such as training epochs, dropout, and augmentation conditions) for each training sample can improve the model performance when exploring the semantic information of unlabeled data. Inspired by these methods, we aim to form a consensus prediction of the latent binary vector for each training sample in order to boost the model performance. Specifically, similar to [32], we create a target latent binary vector h ˜ i for x i by applying exponential moving average (EMA) to h i of multiple previous training epochs, i.e., we first accumulate h i into an ensemble vector h i e by h i e = α 2 h i e + ( 1 α 2 ) h i , and then calculate h ˜ i = h i e / ( 1 α 2 t ) , where α 2 0 , 1 is a momentum term to determine how much of h i e is affected by previous training epochs, 1 α 2 t is to correct the startup bias, and t is the current number of training epochs. Then, we minimize the difference between h i and h ˜ i with the following square loss:
J s m s e = h i h ˜ i 2 2 .
To simultaneously exploit the pairwise similarity information, reduce the gap between h and h ( x ) , and form consensus prediction for each training sample, we integrate the three terms (Equations (2)–(4)) to learn the model parameters θ as follows:
J ( θ ) = J w c e + 1 l B r i B ( γ 1 J h m s e + γ 2 J s m s e ) = 1 i B , j L s ˜ i j i B , j L s ˜ i j ( w i j l o g u i j + ( 1 w i j ) l o g ( 1 u i j ) ) + 1 l B r i B ( γ 1 h i 1 r 2 2 + γ 2 h i h ˜ i 2 2 ) s . t . h i = t a n h ( f θ L ( x i ) ) , h j = t a n h ( f θ L ( a j ) ) , u i j = 1 1 + e λ h i h j T ,
where l B is the length of B, and γ 1 0 and γ 2 0 are to weight the corresponding two regularization terms, respectively.
Based on Equation (5), we can adopt any optimizer, e.g., Adam [34], to learn the model parameters θ . For clarity, we present the detailed procedure of solving Equation (5) to learn θ in Algorithm 1: UDPH. Note that, to boost the model performance, training samples are usually augmented; we utilize g ( · ) to denote the augmentation function. Additionally, f θ ( g ( x i B ) , g ( a j B a ) ) denotes the output of the L-th layer followed by the function t a n h ( · ) for the sample x i and anchor a j . After obtaining θ , we can attain binary codes of each training or query data x by: h ( x ) = s g n ( f θ L ( x ) ) .
Algorithm 1: UDPH
Input: Data X = x i i = 1 n , anchors A = a j j = 1 m , bit number r, parameters λ , γ 1 , and γ 2 , piecewise linear function p ( t ) , ensembling momentums α 1 , and α 2 , training epoch number T, network with parameters θ : f θ ( · ) , stochastic input augmentation function: g ( · )
Output: Parameters θ
  1:
Initialization:
  •       Initialize parameters θ by the pre-trained VGG-16
  •       model on ImageNet;
  •       Construct S by Equation (1);
  •        S ˜ S ,        ▹ ensemble pairwise similarity matrix;
  •        H e 0 n × m , ▹ ensemble latent binary vectors;
  •        H ˜ 0 n × m , ▹ target latent binary vectors;
  2:
fortin ( 1 , T ) do:
  3:
    for each mini-batch B and B a  do:
  4:
          h i B , h j B a f θ ( g ( x i B ) , g ( a j B a ) ) ;
  5:
         loss ← Equation (5);
  6:
         updating θ using optimizers,
  •             e.g., Adam [34];
  7:
    end for;
  8:
    Construct S by Equation (1);
  9:
    S ˜ α 1 S ˜ + ( 1 α 1 ) S ;
10:
    H e α 2 H e + ( 1 α 2 ) H ;
11:
    H ˜ H e / ( 1 α 2 t ) ;
12:
end for.

4. Experiments

To evaluate the proposed UDPH, we conduct extensive experiments on two large-scale benchmark databases: CIFAR-10 [27] and NUS-WIDE [28]. CIFAR-10 has 60,000 color 32 × 32 images belonging to 10 classes on average. These images are split into one training set with 50,000 images and one testing set containing 10,000 images. NUS-WIDE is composed of 269,648 images collected from Flickr [28]. There are, totally, 81 semantic concepts, with each image containing multiple labels. Similar to [8,15], we select the 21 most frequent labels for evaluation and, in total, obtain around 195,834 color images. Following [15], for these two databases, we randomly select 100 images from each class to construct the query set and use the rest as a training/gallery set.

4.1. Experimental Settings

We compare UDPH against six popular non-deep unsupervised hashing algorithms (LSH [1], SH [5], AGH [8], ITQ [21], SpH [18], and SGH [19]) and eight state-of-the-art unsupervised deep hashing algorithms (DH [11], UAR [25], UH-BDNN [12], DeepBit [13], HashGAN [14], UTH [26], UDBD [24], and SADH [15]). For the non-deep hashing algorithms, we evaluate them by using hand-crafted features. Specifically, each image in CIFAR-10 is represented by a 512-dimensional GIST vector [35], and each one in NUS-WIDE is represented as a 500-dimensional bag of words (BoW) feature vector. Additionally, we also show their performance on deep features extracted from the f c 7 layer of the VGG-16 model pre-trained on the ImageNet database. Among the eight deep hashing algorithms, DeepBit, UTH, UDBD, and SADH utilize the VGG-16 model as their backbone networks. For the proposed UDPH, we empirically set the function and adopt the parameters γ 1 = 0.01 , γ 2 = 0.1 , α 1 = 0.9 , α 2 = 0.6 , λ = 0.8 on both CIFAR-10 and NUS-WIDE, except λ = 1.6 on CIFAR-10 at 16-bit. We randomly select 500 images from the training data of the two databases as anchors, i.e., m = 500 .
Following [12,15], we adopt semantic similarity as the ground truth in our experiments. For NUS-WIDE, two images are neighbors if they share at least one common label. We evaluate the performance of the aforementioned hashing algorithms by using mean average precision (MAP), MAP@1000, Precision@5000, and Precision@1000. Here, MAP denotes the mean of the average precision of query images over all images in the gallery set; MAP@1000 is the MAP calculated over the top 1000 returned images from the gallery set. Precision@5000 means the rate of correctly retrieved samples from the top 5000 ranked images. A similar definition is applied to Precision@1000. We run the experiments five times and report the average results.

4.2. Experimental Results and Analysis

Table 1 presents retrieval results of the non-deep and deep hashing algorithms at 16-, 32- and 64-bit, selecting 100 query images per class and using the remaining ones as a training/gallery set from CIFAR-10 and NUS-WIDE databases, respectively. Note that we do not present the results of DH, DAR, Hash-GAN, UTH, and UDPD on NUS-WIDE, because there is no publicly reported results. As we can see from Table 1, non-deep hashing algorithms with deep features extracted from the pre-trained VGG-16 model achieve better performance than that using hand-crafted features, probably because the data distribution of ImageNet is similar to that of CIFAR-10 and NUS-WIDE. Additionally, SADH outperforms the other non-deep and deep hashing algorithms except UDPH, because it effectively exploits the adaptive pairwise similarity among data. Moreover, UDPH obtains superior performance over all the non-deep and deep hashing algorithms, especially in terms of the metric MAP. Specifically, on CIFAR-10, the gain of UDPH in MAP is from a relative 2.87% to a relative 14.78% over the best competitor, SADH; on NUS-WIDE, the gain of UDPH in MAP ranges from 1.01% to 12.46%, relatively, over the best competitor; it also obtains better performance in terms of MAP@1000, Precision@5000, and Precision@1000 on the two databases except the Precision@5000 at 16-bit on NUS-WIDE. These results demonstrate the effectiveness and strength of the proposed UDPH. Note that SADH usually achieves its best MAP with short binary codes, e.g., 16-bit, while UDPH obtains better performance with an increasing number of bits. This might be because SADH can effectively preserve the similarity information of the low-rank graph matrix by using short binary codes, but UDPH with longer binary codes can better preserve the similarity relationship between training data and anchors.
On CIFAR-10, some popular methods, including DBD-MQ [36] and GraphBit [37], utilize 50,000 images as a training/gallery set and 10,000 images as a query set, and they adopt relatively shallow CNNs. Following the experimental protocols in DBD-MQ and GraphBit, to better illustrate the strength of the proposed UDPH, we adopt a shallow network AlexNet [38] as the backbone network, which is pre-trained on the ImageNet database. Additionally, UDPH with AlexNet adopts the same parameter settings as that with VGG-16. Table 2 presents their raking performance in terms of MAP@1000. This further demonstrates the superior performance of UDPH. Specifically, its gain is 17.40%, 18.84%, and 18.35%, relatively, over the best competitor GraphBit at 16-, 32-, and 64-bit, respectively.

4.3. Ablation Studies

Here, we analyze the influence of the ensemble anchor-based pairwise similarity matrix S ˜ and the two terms, J h m s e and J s m s e , on UDPH using VGG-16 as the backbone network. We randomly select 100 images per class from the CIFAR-10 database to construct a query set, and 500 images per class from the remaining ones to construct a set for training and retrieval. Figure 2a shows the MAP of UDPH using four different settings of S ˜ . Specifically, (i) “ α 1 = 0.9 , weight” means using the entries in S ˜ as the weight of J w c e ; (ii) “ α 1 = 0.9 , equal” means s ˜ i j = 1 if s ˜ i j > 0 and s ˜ i j = 1 when s ˜ i j < 0 ; (iii) “ α 1 = 0 , update S ” denotes S ˜ = S with updating S every training epoch; (iv) “ α 1 = 0 , fix S ” represents S ˜ = S without updating S during the training process. Figure 2a suggests that S ˜ can boost the model performance and smooth the training process, i.e., the best or sub-best MAP is achieved at the last several training epochs. Figure 2b presents the MAP of UDPH without J h m s e , i.e., γ 1 = 0 . It illustrates the effectiveness J h m s e and the significance of reducing the gap between latent binary vector and binary codes. Figure 2c displays the result of UDPH without J s m s e , i.e., γ 2 = 0 . It demonstrates that forming consensus predictions of latent binary vectors can also improve model performance.

5. Conclusions

In this paper, we propose a novel unsupervised deep pairwise hashing method, which effectively and robustly takes advantage of the similarity information between training samples and anchors. We first construct an anchor-based pairwise similarity matrix, upon which we create a strong and robust ensemble pairwise similarity matrix to preserve their similarity and dissimilarity relations. Then, we propose a novel loss function consisting of a weighted cross-entropy loss, which utilizes the similarity and dissimilarity between training samples and anchors as the weight to explore their semantic similarity relationship, a square loss to reduce the gap between latent binary vectors and binary codes, and another square loss to form consensus predictions of latent binary vectors for boosting model performance. Experiments on benchmark databases demonstrate the strength of the proposed method and the effectiveness of each term in the proposed loss function. In the future, it is very promising to apply the robust ensemble pairwise similarity matrix and the weighted cross-entropy loss on unsupervised or semi-supervised deep methods because they can effectively explore the semantic similarity information hidden in unlabeled data. Exploring advanced backbones, such as ResNet, to further improve performance is another research direction.

Author Contributions

Conceptualization, X.S.; methodology, X.S. and Y.M.; writing—original draft preparation, Y.M.; writing—review and editing, X.S. and Z.G.; supervision, Q.L. and Z.G.; funding acquisition, Q.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Intelligent perception and computing innovation platform of the Shenzhen Institute of Information Technology (No. PT2019E001), and the Guangdong v2x data security key technology and the expanded application R&D Industry Education Integration Innovation Platform (No. PT2021C002).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: CIFAR-10, https://www.cs.toronto.edu/~kriz/cifar.html; NUS-WIDE, https://lms.comp.nus.edu.sg/wp-content/uploads/2019/research/nuswide/NUS-WIDE.html (accessed on 18 January 2022).

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

  1. Gionis, A.; Indyk, P.; Motwani, R. Similarity Search in High Dimensions via Hashing. In Proceedings of the VLDB ’99 Proceedings of the 25th International Conference on Very Large Data Bases, Edinburgh, Scotland, UK, 7–10 September 1999; pp. 518–529. [Google Scholar]
  2. Torralba, A.; Fergus, R.; Weiss, Y. Small codes and large image databases for recognition. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
  3. Liu, W.; Wang, J.; Ji, R.; Jiang, Y.G.; Chang, S.F. Supervised hashing with kernels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2074–2081. [Google Scholar]
  4. Xia, R.; Pan, Y.; Lai, H.; Liu, C.; Yan, S. Supervised hashing for image retrieval via image representation learning. In Proceedings of the Twenty-eighth AAAI Conference on Artificial Intelligence, Quebec City, QC, Canada, 27–31 July 2014. [Google Scholar]
  5. Weiss, Y.; Torralba, A.; Fergus, R. Spectral hashing. In Proceedings of the 21st International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–10 December 2008; pp. 1753–1760. [Google Scholar]
  6. Broder, A.Z.; Charikar, M.; Frieze, A.M.; Mitzenmacher, M. Min-wise independent permutations. J. Comput. Syst. Sci. 2000, 60, 630–659. [Google Scholar] [CrossRef] [Green Version]
  7. Shen, F.; Liu, W.; Zhang, S.; Yang, Y.; Shen, H.T. Learning Binary Codes for Maximum Inner Product Search. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 11–18 December 2015; pp. 4148–4156. [Google Scholar]
  8. Liu, W.; Wang, J.; Kumar, S.; Chang, S.F. Hashing with Graphs. In Proceedings of the 28th International Conference on Machine Learning, Washington, DC, USA, 28 June–2 July 2011; pp. 1–8. [Google Scholar]
  9. Liu, W.; Mu, C.; Kumar, S.; Chang, S.F. Discrete Graph Hashing. Adv. Neural Inf. Process. Syst. 2014, 27, 3419–3427. [Google Scholar]
  10. Li, W.J.; Wang, S.; Kang, W.C. Feature learning based deep supervised hashing with pairwise labels. In Proceedings of the IJCAI’16 Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–15 July 2016; pp. 1711–1717. [Google Scholar]
  11. Liong, V.E.; Lu, J.; Wang, G.; Moulin, P.; Zhou, J. Deep hashing for compact binary codes learning. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2475–2483. [Google Scholar]
  12. Do, T.T.; Doan, A.D.; Cheung, N.M. Learning to Hash with Binary Deep Neural Network. In European Conference on Computer Vision 2016; Springer: Cham, Switzerland, 2016; pp. 219–234. [Google Scholar]
  13. Lin, K.; Lu, J.; Chen, C.S.; Zhou, J. Learning Compact Binary Descriptors with Unsupervised Deep Neural Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1183–1192. [Google Scholar]
  14. Ghasedi Dizaji, K.; Zheng, F.; Sadoughi, N.; Yang, Y.; Deng, C.; Huang, H. Unsupervised deep generative adversarial hashing network. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3664–3673. [Google Scholar]
  15. Shen, F.; Xu, Y.; Liu, L.; Yang, Y.; Huang, Z.; Shen, H.T. Unsupervised deep hashing with similarity-adaptive and discrete optimization. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 3034–3044. [Google Scholar] [CrossRef] [PubMed]
  16. Shi, X.; Guo, Z.; Xing, F.; Liang, Y.; Yang, L. Anchor-Based Self-Ensembling for Semi-Supervised Deep Pairwise Hashing. Int. J. Comput. Vis. 2020, 128, 2307–2324. [Google Scholar] [CrossRef] [Green Version]
  17. Shi, X.; Xing, Z.; Zhang, Z.; Sapkota, M.; Guo, Z.; Yang, L. A Scalable Optimization Mechanism for Pairwise Based Discrete Hashing. IEEE Trans. Image Process. 2021, 30, 1130–1142. [Google Scholar] [CrossRef] [PubMed]
  18. Heo, J.P.; Lee, Y.; He, J.; Chang, S.F.; Yoon, S.E. Spherical hashing. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2957–2964. [Google Scholar]
  19. Jiang, Q.Y.; Li, W.J. Scalable graph hashing with feature transformation. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015. [Google Scholar]
  20. Kulis, B.; Darrell, T. Learning to hash with binary reconstructive embeddings. In Proceedings of the 22nd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 7–10 December 2009; pp. 1042–1050. [Google Scholar]
  21. Gong, Y.; Lazebnik, S.; Gordo, A.; Perronnin, F. Iterative Quantization: A Procrustean Approach to Learning Binary Codes for Large-Scale Image Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2916–2929. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  22. Li, X.; Hu, D.; Nie, F. Large graph hashing with spectral rotation. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
  23. Do, T.T.; Le Tan, D.K.; Pham, T.T.; Cheung, N.M. Simultaneous feature aggregating and hashing for large-scale image search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6618–6627. [Google Scholar]
  24. Lin, K.; Lu, J.; Chen, C.S.; Zhou, J.; Sun, M.T. Unsupervised deep learning of compact binary descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1501–1514. [Google Scholar] [CrossRef] [PubMed]
  25. Huang, C.; Change Loy, C.; Tang, X. Unsupervised learning of discriminative attributes and visual representations. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5175–5184. [Google Scholar]
  26. Huang, S.; Xiong, Y.; Zhang, Y.; Wang, J. Unsupervised Triplet Hashing for Fast Image Retrieval. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017, Mountain View, CA, USA, 23–27 October 2017; pp. 84–92. [Google Scholar]
  27. Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; Citeseer: Pennsylvania, PA, USA, 2009. [Google Scholar]
  28. Chua, T.S.; Tang, J.; Hong, R.; Li, H.; Luo, Z.; Zheng, Y. Nus-wide: A real-world web image database from national university of singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval, Santorini Island, Greece, 8–10 July 2009; pp. 1–9. [Google Scholar]
  29. Shi, X.; Xing, F.; Xu, K.; Sapkota, M.; Yang, L. Asymmetric discrete graph hashing. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
  30. Zhou, D.; Bousquet, O.; Lal, T.; Weston, J.; Schölkopf, B. Learning with local and global consistency. Adv. Neural Inf. Process. Syst. 2003, 16, 321–328. [Google Scholar]
  31. Rasmus, A.; Berglund, M.; Honkala, M.; Valpola, H.; Raiko, T. Semi-supervised Learning with Ladder Networks. Adv. Neural Inf. Process. Syst. 2015, 28, 3546–3554. [Google Scholar]
  32. Laine, S.; Aila, T. Temporal ensembling for semi-supervised learning. arXiv 2016, arXiv:1610.02242. [Google Scholar]
  33. Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. arXiv 2017, arXiv:1703.01780. [Google Scholar]
  34. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  35. Oliva, A.; Torralba, A. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vis. 2001, 42, 145–175. [Google Scholar] [CrossRef]
  36. Duan, Y.; Lu, J.; Wang, Z.; Feng, J.; Zhou, J. Learning deep binary descriptor with multi-quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1183–1192. [Google Scholar]
  37. Duan, Y.; Wang, Z.; Lu, J.; Lin, X.; Zhou, J. GraphBit: Bitwise interaction mining via deep reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8270–8279. [Google Scholar]
  38. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2012; pp. 1097–1105. Available online: https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html (accessed on 22 February 2022).
Figure 2. Ranking performance in terms of MAP of UDPH with different settings on different training epochs at 32-bit on CIFAR-10 using VGG-16 as the backbone: (a) different settings of S ˜ , (b) with/without the term J h m s e in Equation (5), (c) with/without the term J s m s e in Equation (5).
Figure 2. Ranking performance in terms of MAP of UDPH with different settings on different training epochs at 32-bit on CIFAR-10 using VGG-16 as the backbone: (a) different settings of S ˜ , (b) with/without the term J h m s e in Equation (5), (c) with/without the term J s m s e in Equation (5).
Electronics 11 00744 g002
Table 1. Retrieval results (%) in terms of MAP, MAP@1000, Precision@5000, and Precision@1000 on CIFAR-10 and NUS-WIDE. † denotes the implemented results based on the provided codes, * means the results copied from Shen et al., 2018 [15], and the other results are directly copied from the corresponding publications. The best accuracy are in bold and the second-best results of each database are underlined.
Table 1. Retrieval results (%) in terms of MAP, MAP@1000, Precision@5000, and Precision@1000 on CIFAR-10 and NUS-WIDE. † denotes the implemented results based on the provided codes, * means the results copied from Shen et al., 2018 [15], and the other results are directly copied from the corresponding publications. The best accuracy are in bold and the second-best results of each database are underlined.
MethodMAPMAP@1000Precision@5000Precision@1000
16-bit32-bit64-bit16-bit32-bit64-bit16-bit32-bit64-bit16-bit32-bit64-bit
CIFAR-10
LSH   13.1814.0014.9219.0521.1023.8314.3015.8217.0715.8218.2120.53
SH   12.8512.6512.5120.6921.0320.5214.2414.0614.1216.6616.9816.97
AGH   14.3113.5213.4422.7422.1223.5416.1115.5715.5819.0219.3020.48
ITQ   15.5215.9416.4924.1326.0027.5617.4118.1718.8920.0622.1223.39
SpH   14.2814.5315.2721.6823.3126.3815.9416.8017.9218.3420.0222.55
SGH   14.5115.0515.3722.9724.9226.3815.8217.6717.8619.7921.3422.34
LSH+VGG   13.7115.8119.5420.4526.1334.0315.0918.3723.0917.6422.4529.54
SH+VGG   22.1419.6518.1840.2638.8938.4825.8423.4122.0734.0532.5831.76
AGH+VGG   31.4328.2626.5545.0547.2448.7934.0232.2130.7942.7743.7544.90
ITQ+VGG   31.9332.2133.7645.5850.5353.8635.2435.7537.1742.3145.9448.93
SpH+VGG   19.8424.2326.0033.4841.3445.0222.8828.1330.0928.7236.7840.26
SGH+VGG   23.9324.3027.1542.0144.1248.4827.4028.5231.4836.4938.7443.14
DH16.1716.6216.96------23.7926.0027.70
DAR16.8217.0117.21------24.5426.6228.06
UH-BDNN   * 30.1030.8931.18---33.9734.4835.00---
HashGAN29.9431.4732.5344.6546.3448.12---41.7643.6245.51
DeepBit   * 15.9519.1620.96---18.0222.2724.36---
UTH---28.6630.6632.41------
UDBD21.7020.6423.0726.3627.9234.05------
SADH   * 38.7038.4937.68---41.8041.5641.15---
UDPH39.8140.68 43 . 25 46 . 11 52 . 52 58 . 17 42 . 08 43 . 06 44 . 31 42 . 95 49 . 72 54 . 23
NUS-WIDE
LSH   36.0636.1937.1639.9540.9843.0139.1739.6341.4339.4840.3242.28
SH   34.4734.9634.9741.7843.0041.9437.8038.7738.0740.2241.4040.41
AGH   35.7435.8735.7542.3043.2644.4140.2941.4441.7141.5642.3643.46
ITQ   38.2538.6438.7545.3546.3647.0542.9743.8544.2044.2545.2345.71
SpH   37.2637.4237.8343.5844.8246.3041.8242.6943.7742.7643.9345.25
SGH   37.3937.3337.4145.6645.7945.8243.1142.9143.0244.6044.5644.62
LSH+VGG   38.9539.5644.5346.5550.5762.8243.4245.6455.6945.2848.6560.32
SH+VGG   44.7442.5941.5467.2166.4066.6958.6555.5253.9364.8463.0162.54
AGH+VGG   49.9149.7748.3870.5972.6773.9466.4867.7967.6469.6571.4972.63
ITQ+VGG   54.7655.2055.5570.2174.1476.3268.9270.4371.5570.2274.0174.69
SpH+VGG   47.4050.4451.3364.2870.1872.9458.9763.8366.1062.6468.2470.90
SGH+VGG   46.9847.7150.0169.4371.4974.8961.8163.0766.6767.3069.2272.70
UH-BDNN   * 39.2240.3242.06---45.5451.3457.72---
DeepBit   * 54.2651.7254.74---70.1869.6072.74---
SADH   * 60.1457.9956.33---71.4573.8875.04---
UDPH60.7561.2963.35 71 . 89 76.6077.8770.4974.5275.5671.3175.6576.88
Table 2. Retrieval results (%) in terms of MAP@1000 on CIFAR-10 with AlexNet as the backbone. The results of DBD-MQ and GraphBit are directly copied from the publications. The best accuracy results are in bold and the second-best results are underlined.
Table 2. Retrieval results (%) in terms of MAP@1000 on CIFAR-10 with AlexNet as the backbone. The results of DBD-MQ and GraphBit are directly copied from the publications. The best accuracy results are in bold and the second-best results are underlined.
MethodMAP@1000
16-bit32-bit64-bit
DBD-MQ21.5326.5031.85
GraphBit32.1536.7439.90
UDPH 37 . 81 43 . 66 47 . 22
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Ma, Y.; Li, Q.; Shi, X.; Guo, Z. Unsupervised Deep Pairwise Hashing. Electronics 2022, 11, 744. https://doi.org/10.3390/electronics11050744

AMA Style

Ma Y, Li Q, Shi X, Guo Z. Unsupervised Deep Pairwise Hashing. Electronics. 2022; 11(5):744. https://doi.org/10.3390/electronics11050744

Chicago/Turabian Style

Ma, Ye, Qin Li, Xiaoshuang Shi, and Zhenhua Guo. 2022. "Unsupervised Deep Pairwise Hashing" Electronics 11, no. 5: 744. https://doi.org/10.3390/electronics11050744

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop