Optimized Feature Space Learning for Generating Efficient Binary Codes for Image Retrieval

In this paper we propose an approach for learning low dimensional optimized feature space with minimum intra-class variance and maximum inter-class variance. We address the problem of high-dimensionality of feature vectors extracted from neural networks by taking care of the global statistics of feature space. Classical approach of Linear Discriminant Analysis (LDA) is generally used for generating an optimized low dimensional feature space for single-labeled images. Since, image retrieval involves both multi-labeled and single-labeled images, we utilize the equivalence between LDA and Canonical Correlation Analysis (CCA) to generate an optimized feature space for single-labeled images and use CCA to generate an optimized feature space for multi-labeled images. Our approach correlates the projections of feature vectors with label vectors in our CCA based network architecture. The neural network minimize a loss function which maximizes the correlation coefficients. We binarize our generated feature vectors with the popular Iterative Quantization (ITQ) approach and also propose an ensemble network to generate binary codes of desired bit length for image retrieval. Our measurement of mean average precision shows competitive results on other state-of-the-art single-labeled and multi-labeled image retrieval datasets.


INTRODUCTION
I Mages often convey information which words can barely express. If we consider famous paintings, we recognize that we are not capable of exactly describing all our visual impressions. Still, one typically tries to find words to describe the visual content and the higher semantics of images. At present, automatic understanding of images is of high practical relevance. It is used in medicine for disease detection or in autonomous driving for scene understanding [1], [2]. Image retrieval can help us to cope with the continuously increasing image datasets in our big data era. In image retrieval, we are given a query image, and we search for the most similar images on a vast gallery dataset. The most common approach for image retrieval is to find a feature space where similar images are mapped closely together based on a given metric, e.g., the Euclidean distances [3]. Thus, querying for similar images becomes equivalent to a nearest neighbor search in this feature space. Classical methods such as Bag of Visual Words [4] and Fisher vectors [5] can generate such a feature space.
These methods can generate a global descriptor which can describe the images but are sometimes incapable of capturing semantic concepts. The problem of capturing higher semantic information is addressed in classification as well.
In image classification, AlexNet [6] by Krizhevsky et al. outperformed all existing image classification methods significantly leading to the rise of Convolutional Neural Networks (CNNs). CNNs are the basis for all modern state-of-the-art methods in image classification. Furthermore, it was shown that features extracted from CNNs are suitable for image retrieval and outperform classical feature descriptors [7], [8]. As these features were first extracted from CNNs trained for classification, the CNN's loss did not accurately fit the purpose of image retrieval. In contrast to a standard classification situation, images are often labeled by several concepts in image retrieval. Consequently, a classification loss is unreasonable or cannot be applied here. Therefore, novel methods usually incorporate a pairwise or a triplet loss to generate a good feature space for a retrieval system [9].
In content-based image retrieval, we analyze the content of the images to find similar images in a vast database which can contain millions of images [10]. The picture we use for the query is referred to as the query image, and the database is called the gallery dataset. Note that for devising the image retrieval methods, one typically expects that for few images labeled data is given. This is necessary to decide which semantics of the images should be captured by the image retrieval method [11]. This portion of the dataset will be referred to as the training dataset. It is important that we need to derive appropriate features suitable for an efficient and effective retrieval system. It turns out that feature vectors generated by CNNs provide much better representations for the semantics of the images [12]. For instance, the convolutional layers 1-5 of AlexNet derive low-level and mid-level information while the final fully connected layers capture higher semantics of the images. It is necessary to understand which semantics of the pictures should be captured by a retrieval system. As we often cannot describe images by a single concept, it is evident that image retrieval methods have to deal with multi-labeled images as well. While for the simple case of single-labeled images a classification could basically capture the general semantic information, this is not the case for multi-labeled images. For multi-labeled images, we could predict the category labels. This will not lead to a arXiv:2001.11400v1 [eess.IV] 30 Jan 2020 compact feature space suitable for image retrieval, especially, if the number of categories is large as it is difficult to get compact feature space as images contain multiple concepts. The reason why we want to find an appropriate feature space is that finding similar images simplifies to the nearest neighbor search on such a feature space [13].
Besides the quality of the retrieved results, the efficiency of the retrieval system is a crucial aspect. The challenging sizes of image datasets demand an efficient retrieval system [13]. The typical solution for speeding up the system is to use binary codes instead of the original feature vectors. As a result, we can compute Hamming distances to rank the gallery images for a query item. The binary codes for the feature vectors shall give an approximation of the original Euclidean distances. The main advantage here is that the Hamming distances are much faster to compute than the Euclidean distances, especially on modern CPU architectures, Hamming distance computation can be executed faster by an XOR operation followed by a popcnt instruction, i.e., a bit-count.
Hashing methods, which provide such a binarization, are therefore indispensable for image retrieval. In hashing, we try to find a hash function that maps our data item to a hash value. Several hash functions are typically used to form the final binary code, which represents an image for the approximate nearest neighbor search. The hash mapping is in general not injective, i.e., we map different data items to the same hash code [13]. From the image retrieval perspective, we try to map feature vectors to short hash codes while preserving the important similarity information among the original feature vectors. Furthermore, as the probability of collision for similar images is desired to be very high, the similarity-preserving hashing discussed here is different from the traditional hashing methods. Conventional methods typically try to avoid collisions, i.e., they try to map different data items to different hash codes. In image retrieval, we want to assign the similar binary codes for semantically similar images and distinct binary codes for dissimilar images.

Related Work
Hashing methods can be divided into data-independent and datadependent approaches [13], [14]. Data-independent approaches do not fit their hashing functions onto the data. Instead, they have to rely on general objectives and are therefore practically insufficient for an effective image retrieval system. Locality-Sensitive Hashing (LSH) schemes are dominant in this field. They look at the approximate nearest neighbor problem from a probabilistic point of view. It is natural to wonder if we can find better projections when we know the underlying distribution of our data. Datadependent hashing methods use the data, i.e., the feature vectors in image retrieval, to find better hash functions. They can be further subdivided into supervised and unsupervised methods. Supervised methods exploit additional information, e.g., label information, to improve the performance of the generated hash codes. On the other hand, unsupervised methods utilize solely the data-distribution with no additional information. Indeed, it turns out that we can find better projections by data-dependent approaches. Thus, research efforts moved to learning-based and data-dependent methods. A well-known unsupervised data-dependent hashing method is spectral hashing proposed by Weiss et al. [15] which is closely related to spectral clustering. They show how hashing projections can be learned from data distributions. The binary codes generated by spectral hashing show decent performance, yet especially for a low number of bits, spectral hashing was shown to be inferior to Iterative Quantization (ITQ) [16]. The latter is a simple and very effective unsupervised data dependent binarization method and can be interpreted as fitting an i-dimensional hypercube to the i-dimensional feature vectors in a way such that the quantization loss is minimized. This approach is computationally efficient and scales well with the data size. Because of its appealing characteristics, it is incorporated in our proposed binarization scheme after learning an optimized feature space.
Recently, supervised deep learning based hashing methods such as Deep Discrete Supervised Hashing (DDSH) [12], Deep Supervised Hashing with Triplet Labels (DTSH) [11], Deep Supervised Hashing with Pairwise Labels (DPSH) [9], Data Sensitive Hashing (DSH) [17], Deep Hashing Network (DHN) [18] have been very successful in image retrieval. Instead of using traditional feature descriptors which are binarized in the second step by classical approaches, deep hashing frameworks usually learn the feature vectors and the binary codes simultaneously. The main problem of the deep hashing approach is that the gradient of the sign-function used for binarization is always zero except at the point zero such that the gradient cannot be calculated to train the CNN end-to-end. Thus, backpropagation cannot be used. It is noteworthy that unsupervised deep hashing methods exist as well [19], [20], [21], [22]. However, supervised hashing methods are dominant in this field and provide more effective binary codes. The supervised information can be either class label information, pairwise information or triplet labels. Another example of a successful deep hashing method is HashNet [23] which is also often used as a baseline for benchmarking deep hashing frameworks. HashNet uses pairwise similarity information. In the HashNet approach, the slope of a hyperbolic-tangent is reduced till it approximates the sign-function, during training to generate binary codes. However, if one directly starts with a high slope, the network could not train well as the gradient would almost vanish. We will provide benchmarks from our proposed method against this approach which is the current state-of-the-art.

This Paper
In this paper, we propose to generate an optimized feature space which takes care of the global statistics of the feature space such that the between-class scatter is maximum and within-class scatter is minimum to generate a low-dimensional representation space. Linear Discriminant Analysis (LDA) [24] is a popular approach to learn an optimized low-dimensional representation space, but is applicable only for single-labeled images. We utilize the relationship between Canonical Correlation Analysis (CCA) [25] and LDA and propose a novel feature space learning approach for both single and multi-labeled images. The generated features are binarized using the popular ITQ approach. We also propose a network ensemble method to concatenate the binary codes to generate codewords of desired bit length. We denote the proposed method as Deep Canonical Correlation Feature (DCCF) extraction method in this paper. The retrieval results are measured on standard datasets such as CIFAR-10 [26] which is singlelabeled and NUS-WIDE [27] and MS-COCO [28] which are multi-labeled datasets. We have compared the performance with other state-of-the-art hashing methods. We also use the feature space visualization using t-SNE [29] and plot the training curves to show that our feature space learning network's training reaches the lower bound which basically corresponds to the sum of the correlation coefficients of CCA. The major contributions of this paper are summarized below: • Our approach learns an optimized low-dimensional feature space for single-labeled images by utilizing the equivalence between LDA and CCA and addresses the problems of the DeepLDA [30] approach (Section 2.3); • We extend the optimized feature space learning to multilabeled images with CCA approach by designing appropriate category indicator matrices (Section 3.1, 3.2); • We elaborate from a linear algebra point of view to show how the correlation maximization between the two vectors of CCA can be interpreted for multi-labeled images (Section 3.3); • We design the loss function based on the correlation coefficients of CCA for training our network to learn the optimized feature space (Section 4.1); • We propose an ensemble technique for generating binary codes of desired bit length from the features obtained from our proposed feature extraction technique followed by ITQ (Section 6).
The paper is organized as follows. Section 2 explains the theoretical fundamentals of LDA and CCA. Section 3 explains CCA in detail. In Section 4, we discuss the proposed feature extraction approach. Section 5 and 6 explain how we generate the binary codes from the learned low-dimensional feature space. Experimental results are summarized in Section 7, conclusions and future work are discussed in Section 8.

PRELIMINARIES
In this Section, we explain how Linear Discriminant Analysis (LDA) [24] and Canonical Correlation Analysis (CCA) [25] can generate a low dimensional feature space. We also explain the equivalence between LDA and CCA. We also briefly introduce the neural network based approach for learning LDA, called DeepLDA [30]. This approach is not efficient for retrieval in case of multi-labeled images as DeepLDA learns the LDA projection which is the dimensionality reduction for single-labeled images. We address this problem by using CCA for correlating the projections between the feature vectors and label vectors. For single-labeled images, we utilize the equivalence between LDA and CCA and find the C − 1 correlation coefficients, where C is the number of classes. For multi-labeled images, we project to the low-dimensional 1 space C.

Linear Discriminant Analysis
In Section 1, we have already mentioned briefly about deep hashing as a possibility for generating highly efficient and effective binary codes for image retrieval. Deep hashing networks learn effective feature representations internally. The loss functions applied are often built on pairwise, or triplet information, so it is questionable if they can reflect the global statistics of the data. This motivated us for designing an LDA based retrieval system. LDA is a method for dimensionality reduction such that the lowdimensional feature space could have high inter-class variance 1. Due to memory restrictions of the GPU, we cannot ensure all categories are present in the training batches. Therefore, we project to a low dimensional space < C as explained in Subsection 7.3. and small intra-class variance. LDA observes the global structure of the data by estimating covariance matrices [30], [31]. Fisher proposed this approach for two class problem and learned the projection matrix which separates the two classes by maximizing the ratio of the between-class scatter to the within-class scatter [31]. The between-class scatter matrix S B and the within-class scatter matrix S W for a general multi-class problem is defined as: where C denotes the number of classes, x ci is the i th feature vector of class c, n c is the number of images belonging to class c, andx c is the mean of the feature vectors belonging to class c. The total number of images is denoted as n andx represents the global mean. LDA finds the projection matrix A * ∈ R t×i , such that the objective is maximized. Here t denotes the high-dimensional feature vector and i corresponds to the LDA dimensions. Maximizing this ratio generates the matrix A * which consists of those eigenvectors a i of the matrix S −1 W S B corresponding to the largest eigenvalues. The number of discriminant directions, i is at most C − 1 which is the rank of S −1 W S B . Thus, we can only project to a space with maximum C − 1 dimensions [32].
The very first attempt in learning the LDA projection using neural networks, DeepLDA was proposed by Dorfer et al. [30] in 2016. DeepLDA applies LDA-based loss to train a CNN end-toend. By doing this, a feature space can be found which is suitable for classification. The authors propose to maximize the eigenvalues in all C − 1 eigenvector directions such that discrimination is achieved in all directions. Moreover, all the classes have to be mutually exclusive. In the context of image retrieval, this means that only single-labeled images could be used when DeepLDA is applied for the feature extraction. However, with multi-labeled images, we could use a variant of LDA called CCA. Generating an optimized feature space for multi-labeled images is the main problem we are addressing in this paper. For that, we introduce the concept of CCA next.

Canonical Correlation Analysis
Hotelling first proposed Canonical Correlation Analysis (CCA) in 1936 [25]. It is a very general concept in statistics, and many tests in statistics can be regarded as special forms of CCA. CCA can be viewed as a generalization of multiple regression [32], where we want to find linear projections to maximally correlate data vectors to a one-dimensional predictor variable. In CCA, this is extended such that we maximize the correlation between the data vectors and vectors from a second dataset. Given two datasets, we try to find the best linear projection such that the correlation in the projected space for the two data vectors is maximized. To describe CCA, we start with two column vectors representing the random variables for the two data vectors, X and Y as: where q denotes the number of random variables, which will be the batchsize in context of neural networks. In CCA, we want to find linear projections of the two data vectors, such that the correlation, corr(a T X, b T Y) is maximized which can be expressed as: where ρ is the correlation coefficient and a * and b * are the optimal projections.

Equivalence between Linear Discriminant Analysis and Canonical Correlation Analysis
Maurice Bartlett first proved the equivalence between LDA and CCA in 1938 [33]. In our paper, exactly this relationship is used to generate an optimum feature space using CNN architecture. Bartlett precisely described the relationship between the eigenvalues (λ LDA ) of the LDA and the eigenvalues (λ CCA ) of the CCA as: Knowing this exact relationship between the eigenvalues is useful for designing the loss function for a network generating an optimized feature space as described in Subsection 4.1. A problem of the DeepLDA [30] approach is that the eigenvalues are not bounded. Hence, the network tends to produce trivial solutions when the negative sum of all eigenvalues is used as the loss as proposed in the DeepLDA approach. A network trained with such an objective tries to push one eigenvalue to a very large value, while the other eigenvalues stay close to zero. This is basically the trivial solution of LDA. This means that only already well separated classes are pushed further apart in the feature space.
From (3), we can see that the eigenvalues of CCA are upper bounded by one (λ CCA ≤ 1). Therefore, a network trained with an objective built upon the CCA eigenvalues cannot push the eigenvalue of one dimension to an arbitrarily large value since it will never get larger than one. Instead, once a dimension is well separated, i.e., if the corresponding CCA eigenvalue is close to one, the eigenvalues of the other dimensions have to be increased by the network for minimizing the loss. This is due to the fact that the loss function is designed as the sum of the correlation coefficients and the network tries to maximize this correlation in all the discriminant dimensions (cf. Sec. 4.1). This is a significant advantage of the CCA based loss function in a CNN architecture, because the network will generate a feature space with a good separation in all dimensions.

GENERATION OF DATA VECTORS OF CANONICAL CORRELATION ANALYSIS
In this section, we explain how we can generate the two data vectors of CCA for the case of single-labeled and multi-labeled images.

Canonical Correlation Analysis for Single-labeled Images
First we discuss how we generate the two data vectors for singlelabeled images. We use the notation x ∈ R C for an image feature vector, where C is the number of classes. Matrix X ∈ R q×C contains the feature vectors x as the rows, where q denotes the batchsize. Now we consider the case, where we construct the second data vector of CCA, i.e., the label vector y ∈ R C , where C denotes the number of classes, by: We will denote the matrix Y ∈ R q×C as the matrix which contains the sampled versions of the label vectors y as its rows. For our training images, we have the label information given such that we can construct the second data vector Y as: We call this second data vector as category indicator matrix.

Canonical Correlation Analysis for Multi-labeled Images
In this subsection, we explain how we can generate the two data vectors for multi-labeled images. By using a category indicator matrix as the second data vector and by utilizing the relationship (3), we can have the same excellent properties of LDA by applying CCA for single-labeled images. The problem of DeepLDA approach for feature extraction is that it cannot be applied to multi-labeled images since the classes are not mutually exclusive.
Here comes the advantage of CCA as it can be easily extended to multi-labeled images. We use the straight-forward extension by modifying the category indicator matrix. A modified category indicator matrix Y is as shown in Fig. 2. Thus, the rows of this matrix will contain a one indicating the corresponding category of that particular image. The output of our CNN, X is still used as the first data vector for CCA in case of multi-labeled images. Thus, we could easily extend the CCA approach to the CNN architecture for multi-labeled images to generate a good feature space.

Canonical Correlation Analysis -Linear Algebra Point of View
Now we elaborate, how the correlation between the two data vectors which is maximized by CCA can also be interpreted from a linear algebra point of view. It can be easily derived that the covariance satisfies the properties of an inner product in a vector space [34]. For the random variables U and V, the angle between two vectors is given by: where corr indicates the correlation and cov indicates the covariance and the standard deviation of X and Y are denoted by σ X σ Y . We see that the correlation coefficient ρ is exactly the cosine of the angle between the two random variables. Note that the linear projections found by CCA essentially project to new random variables which can be written as X and Y. Therefore, we can interpret CCA in a way that we are trying to minimize the angle between our first data vector and our second data vector in the projected space. This explains why our extension to multi-labeled images is meaningful. In Fig. 1, a cube for 3-dimensional feature vectors is depicted. In this case, the different colors correspond to images depicting the concepts listed in the columns of the matrix. The rows of this matrix are essentially label vectors which are shown in Fig. 1. The label vectors point to the different corners of the cube. data vector X is generated from feature vectors and the second data vector Y from labels. The CCA is then computed on this random variables. The network is finetuned using the loss function (8).
From the linear algebra point of view, it makes sense to apply CCA for multi-labeled images as we will minimize the angles between the projected network output and the projected label vectors by maximizing the correlation. When the feature vectors from the network output point in the same direction as the label vectors, we have a well-separated feature space. The intuition with the given cube, also explains why it is reasonable to combine our feature space generation with Iterative Quantization (ITQ) for the final binarization part. This combination and the specific reason for choosing ITQ will be explained in Section 5 in detail.

PROPOSED FEATURE EXTRACTION TECHNIQUE
Now, we explain in detail the framework of Deep Canonical Correlation Feature (DCCF) extraction. We have already explained the equivalence of LDA to CCA in Subsection 2.3 and explained how the two data vectors for CCA can be generated for singlelabeled and multi-labeled images. Now, we will show how we can apply CCA in a neural network architecture by incorporating the category indicator matrix as the second data vector. Andrew et al. [35] have already used CCA in a neural network architecture. They use their framework to correlate speech and articulatory data. We correlate the output of a CNN which was given an image as input with specially designed category indicator matrices as the second data vector. The structure of our framework is depicted in Fig. 2. The first data vector, X is obtained from the final layer of the neural network and the second data vector, Y is formed from the category indicator matrix and the correlation between the two data vectors are computed. The loss function Ł, with which the network is trained is explained in the next subsection. With our proposed method, we can exploit the equivalence of CCA to LDA for single-labeled images and make it simultaneously applicable to multi-labeled images. Eventually, we can find an optimized feature space by this procedure which can be easily binarized by ITQ to obtain binary codes.

Design of Loss Function for Neural Network
The critical aspect for training our network end-to-end is the loss computation. To discuss the derivation of the loss function, we first present how we can solve CCA as a generalized eigenvalue problem, which could theoretically be implemented as the loss function for training the CNN architecture. For solving CCA, we want to find the optimal projections a * and b * which maximize the correlation coefficients ρ in (2). First we define the sampled covariance matrices Σ 11 = cov(X, X), Σ 12 = cov(X, Y), and (2) can be expressed as: ρ is invariant to rescaling the projections a and b [36]. Therefore, we can constrain them such that a T Σ 11 a = 1 and b T Σ 22 b = 1.
Mardia et al. [32] have already derived the eigenvalue solution of CCA by using Singular Value Decomposition (SVD) which is briefly summarized here. First, we define the matrices K, N 1 , N 2 , M 1 and M 2 as: The eigenvectors of the matrix M 1 are the solutions of the optimal linear projections a * i , where i is the number of CCA dimensions. It is easy to see that the eigenvectors of M 2 are the solutions for the optimal b * i due to the symmetry of CCA, i.e., we can interchange the two data vectors. Now, we deviate from the proof in [32] and directly show that if α is an eigenvector of N 1 , then a = Σ −1/2 11 α is an eigenvector of M 1 as shown below: Thus, if we find the eigenvectors α of N 1 we can easily compute the eigenvectors a = Σ 11 −1/2 α of M 1 which are the solutions for the optimal linear projections for the first data vector. The same holds for b = Σ −1/2 22 β due to the symmetry of CCA, where β is the corresponding eigenvector of N 2 . The important aspect is that we can easily compute the eigenvectors of N 1 by an SVD of K as: where k is the rank of the matrix. The diagonal matrix is of special interest as this one reveals the maximum value of the objective (4) for the optimal linear projections a * i and b * i since: where i denotes the number of correlation coefficeints of CCA.
Thus, the singular value σ i will be the result of the maximization for the optimal linear projections a * i and b * i . This means that the singular values obtained are exactly the maximum correlation coefficients and therefore bounded by one (σ i ≤ 1). The correlation coefficients are generally bounded by −1 ≤ ρ ≤ 1. This can also be inferred from the linear algebra point of view as the cosine of the angle between two projected random variables and it has the same limits (cf. Sec. 3.3). From (6), we know that the eigenvalues of N 1 and M 1 are the same, λ CCA . This means that the squared singular values derived by the SVD of K are also the eigenvalues of M 1 . By using the relationship in (3), we can express the canonical correlation coefficients in terms of the eigenvalues found by CCA for mutually-exclusive classes by: Therefore, to compute the loss for the proposed neural network framework, we first calculate the matrix K in (5). For the computation of K, the root inverse matrices of the sampled covariance matrices are necessary. They can be computed utilizing a symmetric eigenvalue decomposition on the sampled covariance matrices. Then, we compute the SVD of K to compute the singular values σ i which correspond to the correlation coefficients ρ i . Finally, we can return the negative sum of the singular values as the loss. This is because, the neural network loss function has to be minimized and hence we take the negative of the sum. The network targets to maximize the correlation coefficients as maximizing the correlation coefficients will align projected network output and projected label vectors. Therefore, at the end of training the singular values, equivalently the correlation coefficients are maximized. The loss Ł which should be minimized to maximize the correlation coefficients is: The algorithm for feature space learning using the proposed DCCF method is summarized in Algorithm 1.

Gradient Computation
The important aspect of the loss function is that it must be differentiable regarding its input. Andrew et al. [35] provide a proof which is vital for our work as they prove that CCA is differentiable with respect to its inputs. In our case, we do not need the differentiation regarding our category indicator matrix which constitutes the second data vector for CCA. Still, we need Calculate the symmetric eigenvalue decomposition of : where λ −1/2 denotes the element-wise root inverse of the eigenvalues.
. 5. Compute the SVD of K to get the largest k singular values σ.
WhereX andȲ are the centered data vectors and. ∇ 11 and ∇ 12 are defined with respect to the singular value decomposition of K = αdiag(σ)β as in [35]:

Baseline Architecture
So far, we have elaborated how CCA can be applied in our case in a CNN architecture in Section 4. Our baseline architecture extracts the first data vector in our proposed approach. The common choice in the deep hashing community is AlexNet [6] or CNN-F [12], [23], [37]. The convolutional layers of the CNN-F architecture are presented in Table 1. Local Response Normalization is applied on conv1 and conv2 layers. The network uses rectified linear unit as non-linearity and the convolutional layers are followed by two fully connected layers of dimension 4096. The final fully connected layer consists of C nodes, which is the number of classes. AlexNet and CNN-F are usually pretrained on ImageNet such that they can be used as a feature extractor for images which depict a variety of concepts. Then the network is used for generating the first data vector X. The second data vector Y is formed by the category indicator matrix and the network is trained with the derived loss function, as discussed in Section 4.1, to learn the low-dimensional feature space.

GENERATION OF BINARY CODES
Once we obtain an optimized low-dimensional feature space as discussed in Section 4, the next step is to binarize the feature vectors. We used the Iterative Quantization (ITQ) approach proposed by Gong et al. [16] for binarization of our feature vectors and briefly explain this approach here. In particular, we propose a combination of our devised feature extraction framework, Deep Canonical Correlation Analysis Feature (DCCF) extraction with ITQ. A crucial part of ITQ is the dimensionality reduction. Typically, PCA is used to project to a lower-dimensional space where a hypercube is fitted onto the feature vectors. However, CCA can also be used which was already proposed in the initial ITQ paper. The important criterion is that the projection is onto an orthogonal basis which is fulfilled by CCA as the projected features are uncorrelated. We use CCA loss to train our CNN. Thus, we have a CCA projection onto an orthogonal basis in our feature extraction framework. Hence, the criteria for the ITQ scheme are fulfilled. The central idea made in this paper is that instead of applying ITQ after a linear CCA, it can be naturally extended to our neural-network-based feature extraction method. The proposed DCCF method can therefore be considered as the dimensionality reduction step before ITQ. In short our DCCF extraction method provides: 1) An optimized feature space suitable for image retrieval.
2) A dimensionality reduction suitable for ITQ.

Iterative Quantization
Here we explain the ITQ scheme briefly [16]. We start with q zerocentered feature vectors 2 g r ∈ R i . These feature vectors form the rows of the data matrix G ∈ R q×i , after CCA projection. Here q is the batchsize and i is the length of the feature vector which corresponds to the number of correlation coefficients. Starting from this data matrix, we want to generate a binary code matrix B ∈ {−1, +1} q×c , where c is the length of the generated code. As in the data-independent case, the p th bit belonging to the r th feature vector can be expressed as: b r,p = h p (g r ) = sign(g r w p ), where h denotes the hashing function. The vectors w p denote the projections we want to learn. The sign-function on a matrix is the element-wise sign-function, and we can express (9) for all bits, i.e., the entire binary code, in the matrix form as: where W denotes the matrix which contains the hashing hyperplanes. R is the optimal rotation matrix which rotates the data such that the quantization loss, Q of the feature vectors is minimized as given below: Here, · F denotes the Frobenius norm of the matrix. Gong et al. initialize the rotation matrix R with a random matrix since it is proven that random projection distributes the variance effectively [38]. In this iterative scheme, first, we will look for an optimal B given the random initialization matrix R from (10). Once we have 2. As we will use ITQ in the context of image retrieval, we consider here directly the feature vectors {g 1 , g 2 , . . . , gq}, gr ∈ R i from the gallery dataset. updated B, the R matrix is updated such that (11) is minimized. Finding the optimal R according to (11) becomes easy when we remember the orthogonal Procrustes problem. In the orthogonal Procrustes problem, we are trying to find an optimal rotation to align two sets of points. Here, for a fixed B, (11) is exactly this orthogonal Procrustes problem as we are looking for a rotation matrix that aligns the matrices B and GWR. Consequently, the solution can be found by an SVD which provides the solution to the orthogonal Procrustes problem as B T GW = UΣV T and then we can directly update R as VU T . These steps are repeated several times iteratively to find the final binary code B.

Combination with ITQ
Once we find an optimized feature space with our Deep Canonical Correlation Feature (DCCF) extraction, we can combine it with an unsupervised binarization scheme to generate effective and efficient binary codes for image retrieval. The main advantage of this two-step approach is that while we need labels for the supervised feature learning, the quantization loss can be minimized on the unlabeled feature vectors as well. Consequently, we need a small labeled subset of the gallery dataset to train our CNN by our proposed loss function. For the binarization step, we can use all gallery images. We can directly find low-dimensional feature representations of the images by our proposed method, instead of first generating a high-dimensional feature space followed by a dimensionality reduction. The dimensionality of our final features are bounded by the number of classes and by the fact whether they are mutually exclusive. The dimensionality of our final feature vector is: if the classes are mutually exclusive C, otherwise.
Note that this is only an upper bound for the dimensionality. Of course, we can always project to a low-dimensional space. In the following section, we assume that our feature vectors are in an i-dimensional space. By applying ITQ on the gallery images, we try to fit an i-dimensional hypercube to our feature vectors. We can find i-bits with a low-correlation since the hashing hyperplanes are orthogonal [16]. In our case they are orthogonal since the rotation matrix, R found by ITQ is orthogonal as well as our projected feature vectors generated by the network. Our final binarization method with ITQ is depicted in Fig. 3. For the images, we compute the feature embedding by our DCCF method. Then, we rotate the feature vector matrix according to the learned projection matrix by ITQ. Finally, we use the sign thresholding by ITQ to generate the binary codes for the given input image. The combination of our DCCF method with ITQ is referred in this paper as Deep Canonical Correlation Hashing (DCCH).

NETWORK ENSEMBLE
The combination with ITQ leads to generation of efficient binary codes for image retrieval. However, as we use our DCCF method, we are bound by the number of mutually exclusive classes. For instance, in the case of CIFAR-10, we could only project to a 9dimensional feature space. Thus, we could generate only 9 bits by ITQ. Since, the benchmarks of the current state-of-the-art methods use higher number of bits such as 12, 16, 24, 32, 48 and 64 [12], we need to generate more bits for a fair comparison. We propose an ensemble technique to increase the bit size. The idea is that during the training, randomness is involved. The sampled covariance The feature vectors X are extracted from the fully connected layer and after applying CCA, we get the projected feaure vectors G. These features are then binarized by ITQ to obtain the binarized codes B. The full binarization pipeline is reffered in this paper as Deep Canonical Correlation Hashing (DCCH).
matrices which are needed in the loss calculation of our DCCF method are constructed based on our random batches. Since the images within a batch are selected randomly, the estimation of the covariance matrices differs during each iteration. Consequently, the learned feature vectors differ. The binarization through ITQ also involves randomness due to the initial random projection R which is used for distributing the variance across the different feature dimensions effectively (cf. Sec. 5.1). Therefore, the bits generated through different trainings on the same training data and the binarization will result in different binary codes during each iteration. The same observation is commonly utilized for classification [39]. The central idea is that different networks vote for the classes and the major vote will be the final classification result. For image retrieval, ensemble techniques were proposed by Huang et al. [40]. These ensemble techniques mainly aim at finding features by an ensemble of multiple networks with a different structure. Our proposal is different as we use the ensemble of the same baseline architecture to generate the final sub codes which are simply concatenated afterwards. By doing this, our ensemble technique can be interpreted as a voting among networks.
If we want to generate more than C − 1 bits with our DCCH method, we need to use an ensemble technique. However, if we definitely know that C − 1 is greater than the desired number of bits, no ensemble technique is required. Our ensemble approach for an ensemble of two networks is depicted in Fig. 4. We know by the properties of ITQ that the bits generated by each network are uncorrelated to other bits generated by the same network. However, if we concatenate the bits of the different networks, we cannot ensure that the bits between the networks are uncorrelated. Therefore, we propose to adapt the dimensionality reduction for binary features by Rublee et al. to our ensemble technique [41]. Rublee et al. discuss the case where we have many binary tests and we want to choose only a few of them which is a dimensionality reduction in the form of B i → B n . The selected bits, should be uncorrelated to create an effective subset of the original bits. The authors propose a greedy search scheme which we can adapt to our method as explained in Algorithm 2 below: If the search found fewer bits than the desired number of bits, the threshold for the correlation among the bits has to be decreased. Note that the procedure for selecting the bits only has to be done once on the gallery dataset. For a query image, we can then generate the bits from the outputs of the network corresponding to the bits which were chosen by the greedy search. Algorithm 2 . Greedy search scheme for generating longer codes 1. Choose the i bits from any network from the ensemble and compute the matrix, M ∈ R q×i , where q is the batch size. 2. Choose the next bit from any other network and form a vector, v ∈ R 1×i . 3. Measure the correlation of v with each row of M. 4. If the correlation is below a certain threshold concatenate it to your bitcode. 5. If we have found the desired number of bits, then stop. 6. Else, continue with step 2.  By doing this, we reduce the correlation among the bits generated from our ensemble network. In the case of CIFAR-10, we could, e.g., train five networks for the feature space generation and the following ITQ. Then we choose the bits of one of these networks. We could then select bits from the other networks which are uncorrelated with our already chosen bits according to the greedy search scheme. Note that it makes sense that if our 10 th bit is chosen from a network, we should first look for other bits from this particular network as these bits are uncorrelated to the 10 th bit. The greedy search scheme can also be performed on the data of all gallery images because no labels are required for estimating the correlation among the bits.

EXPERIMENTS
Deep Discrete Supervised Hashing (DDSH) 3 [12] is one of the most recent state-of-the-art hashing methods and we follow exactly their experimental setup for the datasets CIFAR-10 (singlelabeled) and NUS-WIDE (multi-labeled). To have a fair comparison, we use the same baseline architecture, which is CNN-F as the feature extractor. We also use the same data splits for the training, query, and gallery datasets as described by the authors of DDSH. All benchmarks involving a GPU were conducted on a GeForce GTX TITAN-X with 12 GB memory. We have used the parameters given in Table 2 for our experiments which were found by a random grid search [42]. The conducted benchmarks are especially interesting in the case of multi-labeled images since we addressed the extension of DeepLDA to multi-labeled images by our method. Therefore, we conduct experiments in another multi-labeled dataset, the MS-COCO dataset [28]. DDSH does not provide the results for MS-COCO. So, we use the commonly used HashNet [23] as the baseline for the comparison in this case.

Evaluations
In this section, we summarize the experimental evaluations. We have plotted the training loss for all the experiments. We have also visualized the feature space to see the distribution of feature embeddings learned by our feature extraction technique. In all the three datasets, we have measured the mean average precision to measure the retrieval performance.

Training Loss and Visualization of Feature Space
One crucial aspect of our method is its feature extraction. For learning a good feature space, we train our model with the training objective as given in (8). When we get close to this minimum bound, we could infer that we trained our model in the optimum way. Thus, we will inspect the training loss for the mentioned datasets to evaluate our results. Once our model is trained till the loss reaches the lower bound, a feature space suitable for image retrieval is generated. We use t-Distributed Stochastic Neighbor Embedding (t-SNE) [29], a popular tool to visualize the feature space of the training images in 2D.

Mean Average Precision
Our final objective is the generation of binary codes suitable for image retrieval. Thus, we need benchmarks which can assess the performance of image retrieval frameworks. The essential characteristics of an image retrieval system for a given query image are its precision and recall. We measure these characteristics with respect to the Hamming ranking of the gallery images for a given query image [3]. For single-labeled images, the relevance of an image indicates if the retrieved image belongs to the same class [10]. For multi-labeled images, a retrieved image is usually defined as relevant if it shares at least one category label with the query image [12]. Precision is defined as the ratio of the number of relevant retrieved images to the total number of retrieved images. Besides the precision of our retrieved results, recall is the second important measure [3]. Recall is the ratio of the relevant retrieved images to all relevant images in our gallery dataset [10]. The mAPscore is the mean of the Average Precision (AP). AP score is defined as the average of precision when recall varies from 0 to 1. In general, we will report the average of applying our experiment five times for a more stable mAP-score.

Experimental Results on the CIFAR-10 Dataset
The CIFAR-10 [26] dataset is widely used for benchmarking the performance of machine learning algorithms. It comprises 10 classes which are mutually exclusive. CIFAR-10 comprises 60,000 images which are of size 32 × 32. The typical training/test split consists of 50,000 training images and 10,000 test images. However, using such a large training dataset in comparison to a small test dataset is unreasonable since the retrieval method should only fine-tune an existing feature extractor on a small training subset. If all training images are used for the training and later constitute the gallery dataset, image retrieval could be substituted by a simple classification [43]. Consequently, the current image retrieval benchmarks use a training set of only 5,000 training images for CIFAR-10 [11], [12], [23]. Following DDSH [12], 500 images are randomly sampled from the dataset for each class label resulting in 5,000 training images. Furthermore, 100 images per class are randomly sampled for the query dataset which comprises 1,000 images in total. We adopt the usual approach that the gallery dataset includes the training dataset as in the case of DDSH [12].  CIFAR-10 contains single-labeled images. Thus, the equivalence to LDA holds in this case. So, we can project maximally to a 9-dimensional space. Hence, we use the ensemble technique proposed in Section 6 for generating longer binary codes.  Fig. 5a, the training loss for our proposed method is depicted. For single-labeled images, the equivalence between LDA and CCA holds, and therefore the index of correlation coefficients is C − 1 = 9. The maximum value for each correlation coefficient is 1. We can clearly observe that we reach the lower bound of loss min = −9. As we use 9 correlation coefficients by projecting to 9 dimensions, this is the minimum value which can be achieved. It indicates that our training has converged properly substantiating the generation of an optimized feature space. In Fig. 6a, the 2-dimensional t-SNE visualization of our 9-dimensional feature space is illustrated. We depict the feature space for the training images based on category information. Different colors represent different classes. Note that the 2-dimensional embedding by t-SNE cannot capture the complete neighboring structure in the 9dimensional space which we use for binarization. Still, we can see that a nearest neighbor search on such a feature space would lead to excellent retrieval results. Furthermore, this feature embedding shares the properties of DeepLDA due to the equivalence of CCA and LDA for single-labeled images, i.e., we have a high inter-class variance and a small intra-class variance.

Mean Average Precision
In Table 3, the mAP-scores are summarized for different bit lengths. Deep Discrete Supervised Hashing (DDSH) [12], Deep Supervised Discrete Hashing (DSDH) [44], Deep Supervised Hashing with Triplet Labels (DTSH) [11], Deep Supervised Hashing with Pairwise Labels (DPSH) [9], Data Sensitive Hashing (DSH) [17], Deep Hashing Network (DHN) [18], are deep hashing based methods. Neighborhood Discriminant Hashing (NDH) [45], Column Sampling Based Discrete Supervised Hashing (COS-DISH) [46], Supervised Discrete Hashing (SDH) [47], Fast supervised Hashing (FastH) [48], and Latent Factor Hashing (LFH) [49] are supervised learning methods. Our method outperforms all baseline approaches. Note, that all deep hashing baseline models and our method use CNN-F as the feature extractor for a fair comparison. Furthermore, the mAP-score for the proposed DCCF method for 9 bits, i.e., our method without an ensemble technique, is mAP 9−bits = 0.7796. This mAP-score is the mean of 20 runs and the 0.95 confidence interval for this mean is [0.7758, 0.7835]. Thus, we can see that even by using only 9 bits our method outperforms the state-of-the-art methods for 12 bits. This reveals that the gain in the performance does not come from the use of the ensemble technique alone, but our proposed feature space generation followed by the binarization could generate efficient binary codes.

Experimental Results on the NUS-WIDE Dataset
So far, we presented the experimental result on single-labeled image dataset. In this case, our method can exploit the equivalence to LDA. For multi-labeled images, finding a suitable feature space is inherently more difficult. We developed our feature space optimization method such that it is generally applicable to image  [12]. The scores for DTSH are cited from [11]. The same experimental setup was used for all methods (see [11] Table 2. The parameters remain almost unchanged in comparison to CIFAR-10. However, NUS-WIDE is much more complex due to the larger number of categories and the fact that the images are annotated by multiple labels. To capture this complexity of the data we use a larger batch size to have a better estimation of the covariance matrices. Note that we could project to a higher dimensional space to avoid the use of the ensemble technique. In particular, we could design a loss based on the sum of up to 81 correlation coefficients for NUS-WIDE. In such a situation, at least as many categories as correlation coefficients have to be present in a batch for the estimation of the covariance matrices during the loss computation. This implies that we need a larger batch size. In general, we cannot ensure that all of these categories are present in each batch due to the limitation of the GPU memory. Therefore, we only project to a 10-dimensional space and use our proposed ensemble technique for generating longer codes.

Training Loss and Learned Feature Space
The training loss for our 10-dimensional projections is depicted in Fig. 5b. We know through the bounds of the correlation coefficient that the minimal achievable loss is −10. The training loss comes close to this theoretical bound of −10.
For the t-SNE visualization of multi-labeled images each data point should be highlighted by multiple colors to indicate the different labels for each image. Therefore, we present the t-SNE visualization where we highlight only some exemplary categories to evaluate if an appropriate feature space was found by our method. In Fig. 6b, we highlight the pictures which contain the category 'person'. We can see that images sharing this category are close in the resulting feature space. As the number of pictures containing this category is large, it is likely that there are images which also share a more dominant category, e.g., a car in the foreground. This explains why we can see different sub-clusters and not all person images are within one big cluster. In Fig. 6c, we highlight the images which contain the category 'animal'. Again, we see a separation of the animal images to the other images. We can see sub-clusters as well. These sub-clusters can indicate images belonging to sub-categories, e.g., images depicting a dog.

Mean Average Precision
In Table 4, the mAP-scores for NUS-WIDE are summarized. For a lower number of bits, our method based on category information have a slight decrease in mAP values than the state-of-the-art methods. When calculating the mAP, images sharing multiple categories are treated as similar as images which share only one category. Our method based on category information will tend to preserve more information resulting in a feature space where images sharing more categories are closer to each other than images sharing a few categories. This property is not considered in the benchmarks. The achieved mAP-scores validate this assumption.

Experimental Results on the MS-COCO Dataset
To further assess the performance of our proposed method, we use the MS-COCO dataset [28] which is a multi-labeled dataset where each image is labeled by 80 object categories. Jian et al. [12] do not provide the results for this dataset in their DDSH paper. Hence, we follow the experimental setup of the state-ofthe-art HashNet [23] method by Cao et al. We will not use the ensemble technique in this case to elaborate on the performance of our method based solely on the proposed feature extraction in combination with ITQ. According to Cao et al., we prune the images with no category information from the dataset and obtain 122,218 images by combining the training and validation images. We randomly sample 5,000 images for the query set.  [12]. The same experimental setup was used for all methods (see [12] [12] 0.7911 0.8165 0.8217 0.8259 DSDH [44] 0.7916 0.8059 0.8063 0.8180 DPSH [9] 0.7882 0.8085 0.8167 0.8234 DSH [17] 0.7622 0.7940 0.7968 0.8081 DHN [18] 0.7900 0.8101 0.8092 0.8180 NDH [45] 0.7015 0.7351 0.7447 0.7449 COSDISH [46] 0.7303 0.7643 0.7868 0.7993 SDH [47] 0.7385 0.7616 0.7697 0.7720 FastH [48] 0.7412 0.7830 0.7948 0.8085 LFH [49] 0.7049 0.7594 0.7778 0.7936 The remaining images constitute the gallery dataset. We randomly sample 10,000 training images from this gallery dataset. The parameters for the training were found by a random search and are summarized in Table 2. To have a fair comparison to HashNet, we use the same baseline model, i.e., AlexNet instead of CNN-F in this case. Since we evaluate the performance of our model without the ensemble technique in this case, we choose the projected dimension during training to the number of bits.

Training Loss and Learned Feature Space
We consider the training loss for the different bit constellations 16, 32, 48, and 64. In Fig. 5c, the loss for our 16-dimensional space is depicted. We can see that the loss reaches approximately −15.
The theoretical minimum is −16. In Fig. 5d, the loss for the 32 dimensional space corresponding to the 32 bit variant is plotted.
Here as well, the loss is less which is expected since the theoretical minimum is −32 in this case. Since it is a mutli-labeled image dataset, the projections obtained by CCA cannot bring the angle between the label vectors and feature vectors exactly to zero which would have resulted in maximum correlation coefficeints. So the lower bound cannot be exactly reached in this case in contrast to the single-labeled CIFAR-10 dataset. Similar observations were visualized for training networks with 48 and 64 bit variants. For the t-SNE visualization, we consider the 64-dimensional feature space. As MS-COCO is a multi-labeled dataset, we cannot easily present all category information within one t-SNE visualization. Therefore, we chose three sample categories to give an impression of the learned feature space. In Fig. 6d, we highlight the feature embeddings of the images which share the category information 'car'. We can see that car images are in general grouped closely together. However, we also have different clusters which separate car images. The reasons for these sub-clusters are the presence of more dominant categories within these images. In Fig. 6e, we can see an example of such a dominant category. The feature embeddings corresponding to plane images are marked in red.
In this case, we could see a clear distinction of the corresponding cluster to the rest of the feature embeddings. If we look at Fig.  6d again, we see that although there are car images in some images which also depict planes, the plane category is dominant in the feature space. The reason for this is that many images from various scenes depict cars. However, planes are usually the dominant object within the pictures due to their size. In Fig. 6f  space suitable for image retrieval was found since images sharing the same categories are close together.
For 32, 48, and 64 bits our method performs better than the stateof-the-art. For 16-bits our DCCH method have comparable performance with HashNet and DHN (ESR). The gain in performance through our method comes mainly due to our effective feature extraction by DCCF and the combination with ITQ.

Example Queries
Finally, we present a few example queries based on our proposed method. To visualize an example of our DCCH method based on category labels, the results of retrieval results for 64 bits on MS-COCO are shown in Fig. 7. In Fig. 8, we show a case where we have wrong retrieval probably because the query image contains 'road' information but images are not labeled with that concept. We can see that our method deals with the multi-labeled query images very well. Overall, the example queries reflect the results of the mAP-scores which verify that the generated binary codes are well suited for image retrieval. Fig. 9 shows  [23]. The same experimental setup was used for all methods (see [23] [51] 0.5932 0.6034 0.6045 0.6099 CNNH [52] 0.5642 0.5744 0.5711 0.5671 SDH [47] 0.5545 0.5642 0.5723 0.5799 KSH [53] 0.5212 0.5343 0.5343 0.5361 ITQ-CCA [16] 0.5659 0.5624 0.5297 0.5019 ITQ [16] 0.5818 0.6243 0.6460 0.6574 BRE [54] 0.5920 0.6224 0.6300 0.6336 SH [15] 0.4951 0.5071 0.5099 0.5101 LSH [55] 0.4592 0.4856 0.5440 0.5849 an example query result on the NUS-WIDE dataset. Interestingly, the retrieved images also contain images with birds flying similar to the semantic content of the query image. Retrieval results have more or less the same background information which is the blue color of sky, ice, and water for the top three retrieved images respectively.

CONCLUSIONS AND FUTURE WORK
In this paper, we propose a novel approach for learning a lowdimensional feature space. We utilize the global statistics of the feature space which the pairwise and triplet based approaches fail to use. Canonical Correlation Analysis (CCA) is used to correlate the projection of feature vectors and the label vectors to obtain a low-dimensional feature space. The visualization of training error curves prove that we are able to maximize the correlation in all directions achieving maximum class separability in all the discriminant directions which is also evident from the feature space visualizations. For single-labeld images, the obtained feature space has maximum between-class scatter and minimum within-class scatter as we utilize the equivalence between Linear Discriminant Analysis and CCA here. Once we generate a compact low-dimensional feature space, we binarize it by using the popular ITQ (Iterative Quantization) approach. Since the length of our binary codes was bounded by the number of image categories, we proposed an ensemble network technique to generate codes of desired bit length. The measurement of mAP shows that our method significantly outperform state-of-the-art binarization schemes for image-retrieval. A possible future extension of our method would be to the multi-modal search. Here, we could incorporate label information from another source, e.g., textual information. For instance, captions in the form of sentences might be available for the training images instead of category labels. In this case, we could find a word embedding for representing the textual annotations as vectors. These vectors could directly be used as the second data vector within our DCCF learning such that we correlate the output of the CNN with this embedding. It is reasonable to assume that our proposed method would achieve superior performance on multi-modal image retrieval.