Towards Simultaneous Image Compression and Indexing for Scalable Content-Based Retrieval in Remote Sensing

Due to the rapidly growing remote-sensing (RS) image archives, images are usually stored in a compressed format for reducing their storage sizes. Thus, most of the existing content-based RS image retrieval systems require fully decoding images (i.e., decompression) that is computationally demanding for large-scale archives. To address this issue, we introduce a novel approach devoted to simultaneous RS image compression and indexing for scalable content-based image retrieval (denoted as SCI-CBIR). The proposed SCI-CBIR prevents the requirement of decoding RS images before image search and retrieval. To this end, it includes two main steps: 1) deep-learning-based compression and 2) deep-hashing-based indexing. The first step effectively compresses RS images by employing a pair of deep encoder and decoder neural networks and an entropy model. The second step produces hash codes with a high discrimination capability for RS images by employing pairwise, bit-balancing, and classification loss functions. For the training of the SCI-CBIR approach, we also introduce a novel multistage learning procedure with automatic loss weighting techniques to characterize RS image representations that are appropriate for both RS image indexing and compression. The proposed learning procedure enables automatically weighting different loss functions considered for the proposed approach instead of computationally demanding grid search. Experimental results show the effectiveness of the proposed approach when compared to widely used approaches in RS. The code of the proposed approach is available at https://git.tu-berlin.de/rsim/SCI-CBIR.

the large-scale approximate nearest neighbor search problems 40 for RS CBIR due to its high time-efficient (in terms of 41 both storage and speed) and accurate search capability within 42 huge image archives. Hashing methods map high-dimensional 43 image features into compact binary hash codes [4]. Then, 44 image retrieval can be achieved by calculating the Hamming 45 distances with simple bitwise XOR operations [5]. Several 46 hashing methods are presented in RS [6], [7], [8], [9], [10], 47 [11], [12], [13], [14], [15], [16]. The traditional hashing meth-48 ods extract hand-crafted image features and map them into 49 low-dimensional binary codes by using hashing functions [6], 50 [7], [8]. In these methods, image feature extraction and hash 51 code generation are separately applied. Thus, they are not 52 capable of simultaneously optimizing feature learning and 53 hash code learning, which results in the limited capability 54 of generated hash codes to represent the high-level semantic 55 content of RS images. Recently, several deep-hashing-based 56 indexing methods are introduced in RS to address this issue. 57 As an example, in [10] a deep hashing neural network (DHNN) 58 is introduced to learn high-level semantic features and compact 59 hash codes in an end-to-end manner. To improve the training 60 stability of deep neural networks (DNNs) while learning hash 61 codes, DHNN generates the continuous approximations of 62 hash codes during training while exploiting quantization loss 63 to push the approximated hash codes towards the discrete 64 values. In greater detail, the likelihood pairwise loss is utilized 65 in DHNN to preserve the similarity of images on their hash 66 codes. However, the pairwise loss can lead similar images 67 to cluster together in a small portion of the Hamming space 68 that prevents generating discriminative hash codes. To avoid 69 this problem, in [11], a deep-hashing convolutional neural 70 network (DHCNN) is introduced to employ image labels for 71 learning more discriminative hash codes. To this end, DHCNN 72 learns to predict image labels together with generating hash 73 codes by jointly optimizing cross-entropy loss with pairwise 74 and quantization losses. Despite the success of pairwise loss 75 in these methods, triplet loss has been found more effec-76 tive than pairwise loss by introducing a margin threshold 77 between the similar and dissimilar images [17]. Accord-78 ingly, in [12], a metric-learning-based deep hashing network 79 (MiLaN) is introduced to combine quantization loss with 80 triplet loss. In addition, MiLaN also employs bit-balancing 81 loss for maximizing code variance and information by forcing has been briefly presented in [19] with limited experimental 142 analysis. This article extends our work by introducing a 143 detailed description of the proposed approach with a detailed 144 experimental analysis of two large-scale benchmark archives. 145 Furthermore, several new experiments are conducted and their 146 results are commented on. The main contributions of this work 147 are summarized as follows. 148 • As a first time in RS, the proposed SCI-CBIR approach 149 simultaneously applies RS image compression and 150 indexing and thus does not require RS image decoding 151 before CBIR that can save a significant amount of time 152 for operational applications.

153
• The proposed multistage learning procedure automati-154 cally weights all the considered loss functions that allow 155 us to: 1) learn appropriate RS image representations for 156 both image compression and indexing; 2) eliminate com-157 putationally demanding grid search; and 3) automatically 158 achieve different rate-distortion tradeoff points.

159
• The proposed SCI-CBIR approach is independent of 160 image compression and indexing methods being selected 161 and can operate with any DNN-based method.

162
The rest of this article is organized as follows. Section II 163 presents the related works on RS image compression and 164 RS CBIR in the compressed domain. Section III introduces 165 the proposed SCI-CBIR approach. Section IV describes the 166 considered RS image archives and the experimental setup, 167 while Section V provides the experimental results. Section VI 168 concludes our article.

170
In this section, we survey the existing methods for 171 RS image compression and RS CBIR on a compressed 172 domain. Traditional RS image compression methods are cate-173 gorized into three groups: 1) prediction-based methods, which 174 predict each spectral band based on the other bands and encode 175 the prediction residuals to bitstreams (e.g., CCDCS-123 multi-176 and hyperspectral image compression standard [20]); 2) vector 177 quantization methods, which independently reduce the clusters 178 of image pixels with similar characteristics by grouping them 179 together (e.g., mean-normalized vector quantization [21]); 180 and 3) transform-based methods, which map RS images to 181 transform domain (e.g., Karhunen-Loéve transform [22], dis-182 crete cosine transform [23], discrete wavelet transform [24], 183 etc.) representations and thus reduce the correlation among 184 image pixels. Although prediction-based compression meth-185 ods apply lossless compression and embody a low computa-186 tional complexity, their compression ratio is generally low, 187 which makes them infeasible for large-scale RS archives. 188 Vector quantization methods provide a higher compression 189 ratio than prediction-based methods. However, training these 190 methods and generating required codebooks can be com-191 putationally demanding. Transform-based methods generally 192 provide a high compression ratio and speed of computation 193 and thus are widely used for RS image compression on 194 operational archives. Among several transform-based methods, 195 JPEG 2000 [25] became very popular in RS due to its 196 multiresolution paradigm, scalability, and high compression 197 ratio. JPEG 2000 algorithm is widely used to compress 198 RS images acquired by most of the recent satellites (such as 199 Sentinel-2 [26]).     system reduces significantly the retrieval time compared to 257 those that require full decoding, it still requires a partial 258 decompression that may require significant time for opera-259 tional CBIR applications.

260
As mentioned above, DL-based image compression methods 261 are much more successful to preserve the perceptual quality of 262 images at lower bit-rate values compared to JPEG2000 [27]. 263 According to our knowledge, our SCI-CBIR approach is the 264 first study in the framework of the scalable CBIR on the 265 DL-based compressed domain in RS.

267
Let X = {x 1 , . . . , x M } be an RS image archive that includes 268 M noncompressed images, where x t is the tth image in the 269 archive. We assume that a training set T ⊂ X is available, 270 where ∀x i ∈ T is associated with a set of class labels 271 l i ∈ {0, 1} K and K is the number of classes.

272
The proposed SCI-CBIR approach aims to achieve accurate 273 CBIR in a scalable way without any need for decompression 274 of RS images before CBIR. Accordingly, SCI-CBIR simulta-275 neously: 1) compresses each image x i ∈ X into a bitstream 276 and 2) indexes each image through a q bit hash code b i 277 (which is stored in a hash table for scalable CBIR). This is 278 achieved based on two steps: 1) DL-based compression and 279 2) deep-hashing-based indexing. For the training of SCI-CBIR, 280 we introduce a multistage learning procedure to automati-281 cally define different loss weights and rate-distortion trade-282 off points. Fig. 1 shows an illustration of the proposed 283 SCI-CBIR approach, which is explained in detail in 284 Sections III-A-III-C. The DL-based compression step of the proposed SCI-CBIR 287 approach aims to compress each RS image to a minimum 288 length bitstream, which is efficiently stored and utilized for 289 reconstructing the image with a minimum amount of distor-290 tion. By following the recent advances in DL-based image 291 compression, this step employs a pair of encoder-decoder 292 DNNs for learning to reconstruct RS images and an entropy 293 model for reducing the length of bitstreams (i.e., bit-rate opti-294 mization). Accordingly, this step includes three main blocks: 295 1) image encoding; 2) compression decoding; and 3) entropy 296 modeling.

297
Let f : X → Y be an image encoder that maps the image x i 298 to its latent y i , where Y is the set of all latents for X . The 299 first block of this step transforms x i into its quantized latent 300 representationŷ i as follows: where Q(a) = a is a rounding function that converts a 303 into its nearest integer (i.e., quantization) and θ f is the 304 encoder parameters. During training, Q(a) is replaced by Let g : Y →X be a decoder that maps the latent y i into the 307 reconstructed imagex i , whereX is the set of reconstructed 308 images. The second block of this step reconstructs x i from its 309 quantized representation as follows: where θ g is the decoder parameters. The third block of this where p x is approximated over the images of T . The rate term 328 L R is the cross entropy between the entropy model qŷ and the d is the distortion metric, for which we utilize multiscale points.

334
It is worth noting that the proposed SCI-CBIR approach The deep-hashing-based indexing step of the proposed 353 SCI-CBIR approach aims to map the latent representation of 354 each RS image (which is characterized in the first step) into its 355 discriminative hash code, which preserves the semantic image 356 content. Then, hash codes are indexed in a hash table for all 357 RS images in the archive, where semantically similar images 358 are in the same hash bucket. To this end, this step includes 359 three main blocks:1) index decoding; 2) hash code generation; 360 and 3) class prediction. Let t : Y, θ t → E be a decoder that 361 maps the latent y i into the corresponding image embedding 362 for indexing e i associated with the image 363 where θ t is the decoder parameters. The index decoding block 364 employs t for characterizing image embeddings by extract-365 ing and decoding semantically informative features specific 366 to indexing based on the latent representations of images. 367 Accordingly, t is composed of the attention layer introduced 368 in [33] followed by convolutional layers. similarity. An image pair shares either no common labels or 384 all its labels for hard similarity, while an image pair shares 385 some of its labels for soft similarity.
T , x j ∈ T , i = j } be a set of all image pairs in T . The SPL 387 function is defined as follows: where s o i j and s h i j are the pairwise similarities between x i and 393 x j and their hash codes, respectively. m i j defines whether 394 (x i , x j ) is associated with soft similarity (m i j = 0) or hard 395 similarity (m i j = 1). γ is a weighting parameter between 396 different types of similarities. For balancing the distribution 397 of hash code bits by maximizing their variance, we adapt the 398 bit-balancing loss [41] for image pairs as follows: where 1 is a vector with all elements 1. L B enforces the hash 401 codes to contain the equal numbers of −1 and 1. To further 402 enhance the discriminative capability of hash codes, we for-403 mulate the classification loss over image pairs as follows: 405 By considering the above-mentioned losses defined for 406 the first step of our SCI-CBIR approach, the final hashing 407 objective is formulated as follows: where w P , w B , w N are the loss weights. We note that the 2) Bit-Rate Optimization: To accurately achieve different 453 rate-distortion points, in the second stage, L D is continued 454 to optimize together with a bit-rate loss L R with a learning 455 rate η 2 . Most of the existing DL-based image compression 456 methods require multiple trainings with different λ values to 457 achieve different tradeoff points for (3). Unlike them, in this 458 stage, we reformulate (3) as a multiobjective optimization 459 problem and employ a multiple-gradient descent algorithm 460 (MGDA) [42] for automatically achieving the set of optimal 461 tradeoffs points as the set of Pareto optimal solutions. Let 462 g D = ∇ θ C L D and g R = ∇ θ C L R be the gradient vectors of L D 463 and L R , respectively, over the parameters The gradient descent direction for a Pareto optimal solution 465 (which leads to an optimal tradeoff point) is obtained by 466 optimizing the following problem: where w D and w R are estimated as follows: , otherwise.

(9) 472
After obtaining u by solving (8), the parameters θ C are 473 updated as follows: (10) 475 Since the distortion loss is converged in the learning recon-476 struction stage, w D ≈ 1 and w R ≈ 0 at the beginning of 477 this stage. As this stage continues, L R decreases until the 478 first Pareto solution is found by (9). Then, by increasing η 2 , 479 w R is gradually increased to reach another Pareto solution. 480 Thus, by adjusting the learning rate itself, this stage allows 481 obtaining the set of optimal rate-distortion tradeoff points 482 without operating multiple trainings and applying computa-483 tionally demanding grid search of λ.

3) Learning Hash Codes:
The last stage involves optimizing 485 all the losses associated with both steps of our approach to 486 learning RS image latents compatible with both RS image 487 indexing and compression. To this end, this stage employs 488 two learning rates η C 3 and η H 3 for the losses of the first and 489 second steps, respectively. It is worth noting that since the 490 losses L D and L R are optimized in the first two stages, we keep 491 η C 3 < η H 3 to prevent the domination of image compression over 492 image indexing. Since the different rate-distortion points are 493 achieved in the second stage, the overall objective is written 494 for a rate-distortion point as follows: where w D and w R are estimated for the specific rate-distortion 499 point in the previous step. To automatically find the weights 500 w P , w B , w N instead of time demanding grid search, we utilize 501 automatic loss weighting techniques. Accordingly, the update 502 rules for the SCI-CBIR parameters are written as follows:  We trained the proposed approach by using a stochastic 551 gradient descent algorithm.

552
As discussed in Section III-C, the training of the proposed 553 approach is divided into three stages. In the first stage, the first 554 step of the proposed approach was optimized for the distortion 555 loss only and η 1 was updated according to the MS-SSIM value 556 averaged on the validation set V as follows: The second stage starts when the distortion loss value 559 reaches its convergence. The learning rate η 2 was set to 10 −5 560 at the beginning of the stage. After the first Pareto point was 561 obtained, η 2 is increased to 9 × 10 −5 . In the third stage, the 562 second step of the proposed approach was jointly trained with 563 the first step, while the learning rate η H 3 was set to 10 −4 . 564 η C 3 was varied as η C 3 = 0, 10 −8 , 10 −4 , while automatic loss 565 weighting technique was varied among projecting conflicting 566 gradients (PCGrad) [47], dynamic weight average (DWA) [48], 567 and equal weighting. All the experiments were conducted on 568 NVIDIA Tesla V100 GPUs. Experimental results were pro-569 vided in terms of MS-SSIM and bit rate (bpp) for compression 570 performances, while precision (P), recall (R), mean average 571 precision (mAP), and retrieval time were used for comparing 572 retrieval performances. It is worth noting that we mapped 573 MS-SSIM values into decibel (dB) scale as suggested in [33]. 574 The retrieval metrics P, R, and mAP were averaged on the 575 15 most similar images. 576 We conducted experiments to: 1) perform a sensitivity 577 analysis and 2) compare the proposed SCI-CBIR approach 578 with standard approaches. In detail, we compare the results of 579 the first step of SCI-CBIR with those obtained by applying 580 image compression with a recurrent neural network (denoted 581 as IC-RNN) [49] and JPEG 2000 [25]. We compare the results 582 of the second step of SCI-CBIR with those obtained by the 583 second step of our approach trained on fully decompressed 584 data (denoted as SI-CBIR). We compare the results of the 585 proposed SCI-CBIR trained by using our multistage learning 586 procedure with those trained by using a standard learning 587 procedure. For IC-RNN, we utilized MS-SSIM as the distor-588 tion measure and updated the learning rate using (13). It was 589 trained with six RNN iterations for 280 epochs. For SI-CBIR, 590 we trained the second step of our approach followed by the 591 image encoder of the first step with the same hyperparameters 592 and the loss functions L P , L B , and L N . SI-CBIR is not capable 593 of simultaneous compression and indexing and thus requires 594 decoding before indexing. For the standard learning procedure, 595 we jointly trained all the losses required for compression and 596 indexing in a single learning procedure. For the loss weights, 597 we varied the weight of the distortion loss L D and kept the 598 rest equal to the control rate-distortion tradeoff.   attention layer has been observed for the MLRSNet archive 650 (not reported for space constraints).

651
In the third set of trials, we analyzed the effect of 652 different activation functions of the hash code generation 653 block. Table III shows the corresponding retrieval results for 654 BigEarthNet-S2. One can observe from the table that using the 655 Greedy hash activation function achieves the highest precision 656 and MAP scores with comparable recall scores. It is due to the 657 fact that the Greedy hash function does not require applying 658 the quantization loss on the discrete hash codes. Accordingly, 659 this function minimizes the quantization error compared to 660 other activation functions [46]. Thus, we set Greedy hash 661 as the activation function for the rest of the experiments. 662 We observed similar behavior for the MLRSNet archive (not 663 reported for space constraints).

664
In the fourth set of experiments, we assessed the effect 665 of different automatic loss weighting techniques (which are 666 applied in the third stage of our learning procedure) on 667 retrieval performance. Table IV shows the corresponding 668  TABLE IV   RESULTS OBTAINED BY THE PROPOSED SCI-CBIR  FOR DIFFERENT AUTOMATIC LOSS WEIGHTING  TECHNIQUES (BIGEARTHNET-S2 ARCHIVE)          In the third set of trials, we analyzed the effectiveness of 766 the proposed multistage learning procedure by comparing it 767 with the standard learning procedure. Table VII shows the 768 compression and retrieval results obtained by the proposed 769 SCI-CBIR approach trained with the proposed multistage 770 and standard learning procedures for the BigEarthNet-S2 771 archive. By assessing the table, one can see that the proposed 772 SCI-CBIR approach with our multistage procedure provides 773 higher scores of CBIR metrics and MS-SSIM values compared 774 to SCI-CBIR with the standard learning procedure at similar 775 bpp values. This is due to the fact that when a single 776 learning procedure with equal loss weights is utilized as in 777 standard learning procedure, learning objectives for indexing 778 and compression are conflicting with each other independent 779 of the different rate-distortion tradeoff points (which is con-780 trolled by λ in the standard learning procedure). This prevents 781 accurately learning RS image compression together with RS 782 image indexing. Unlike the standard learning procedure, due 783 to the proposed multistage learning procedure, our approach is 784 capable of simultaneously learning both tasks in an effective 785 way by automatically: 1) weighting different loss functions; 786 and 2) finding rate-distortion tradeoff points. Similar behav-787 ior of the proposed multistage learning procedure has been 788 observed for the MLRSNet archive (not reported for space 789 constraints).

791
This article introduces a novel approach (denoted as 792 SCI-CBIR) to simultaneously compress and index RS images 793 for scalable CBIR. The SCI-CBIR approach is characterized 794 by two steps that are simultaneously applied based on a 795 novel multistage learning procedure. The first step is the 796 DL-based compression step, where RS images are first mapped 797 into their latent representations and then reconstructed back 798 from the latents by exploiting a pair of encoder and decoder 799 DNNs. An entropy model is utilized to generate bitstreams 800 for a rate-distortion tradeoff point. The second step is the 801 deep-hashing-based indexing step, where hash codes of RS 802 images are generated from their latent representations. With 803 the proposed multistage learning procedure, all the parameters 804 of SCI-CBIR are learned within three consecutive stages: 805 1) minimizing a distortion loss to model reconstruction; (which is required for most of the CBIR systems in RS). 820 We underline that this is a very important advantage, partic-