Skip to main content
Log in

Effective triplet mining improves training of multi-scale pooled CNN for image retrieval

  • Special Issue Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

In this paper, we address the problem of content-based image retrieval (CBIR) by learning images representations based on the activations of a Convolutional Neural Network. We propose an end-to-end trainable network architecture that exploits a novel multi-scale local pooling based on the trainable aggregation layer NetVLAD (Arandjelovic et al in Proceedings of the IEEE conference on computer vision and pattern recognition CVPR, NetVLAD, 2016) and bags of local features obtained by splitting the activations, allowing to reduce the dimensionality of the descriptor and to increase the performance of retrieval. Training is performed using an improved triplet mining procedure that selects samples based on their difficulty to obtain an effective image representation, reducing the risk of overfitting and loss of generalization. Extensive experiments show that our approach, that can be effectively used with different CNN architectures, obtains state-of-the-art results on standard and challenging CBIR datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. The code and models are released at the following address: https://github.com/fede-vaccaro/NetVlad.

  2. In the experiments we performed it on the training dataset.

  3. https://github.com/cvdfoundation/google-landmark.

  4. https://www.kaggle.com/confirm/cleaned-subsets-of-google-landmarks-v2.

References

  1. Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: CNN architecture for weakly supervised place recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition CVPR, NetVLAD (2016)

  2. Azizpour, H., Razavian, A.S., Sullivan, J., Maki, A., Carlsson, S.: From generic to specific deep representations for visual recognition. In: Proceedings of CVPR Workshops (2015)

  3. Babenko, A., Lempitsky, V.: Aggregating deep convolutional features for image retrieval. In: Proceedings of ICCV (2015)

  4. Babenko, A., Lempitsky, V.: Aggregating local deep features for image retrieval. In: Proceedings of ICCV, pp. 1269–1277 (2015)

  5. Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: Proceedings of ECCV (2014)

  6. Ballan, L., Bertini, M., Uricchio, T., Del Bimbo, A.: Social media annotation. In: Proceedings of CBMI, pp. 229–235 (2013)

  7. Cao, B., Araujo, A., Sim, J.: Unifying deep local and global features for image search. In: Proceedings of ECCV, Springer, pp. 726–743 (2020)

  8. Castells, T., Weinzaepfel, P., Revaud, J.: Superloss: a generic loss for robust curriculum learning. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 4308–4319. Curran Associates Inc, New York (2020)

    Google Scholar 

  9. Delhumeau, J., Gosselin, P.-H., Jegou, H., Pérez, P.: Revisiting the VLAD image representation. In: Proceedings of ACM MM (2013)

  10. El-Nouby, A., Neverova, N., Laptev, I., Jégou, H.: Training vision transformers for image retrieval. arXiv preprint arXiv:2102.05644 (2021)

  11. Ercoli, S., Bertini, M., Del Bimbo, A.: Compact hash codes for efficient visual descriptors retrieval in large scale databases. IEEE Trans. Multimedia (TMM) 19(11), 2521–2532 (2017)

    Article  Google Scholar 

  12. Gong, Y., Wang, L., Guo, R., Lazebnik, S.: Multi-scale orderless pooling of deep convolutional activation features. In: Proceedings of ECCV (2014)

  13. Gordo, A., Almazán, J., Revaud, J., Larlus, D.: Learning global representations for image search. In: Proceedings of ECCV, Deep image retrieval (2016)

  14. Gordo, A., Almazan, J., Revaud, J., Larlus, D.: End-to-end learning of deep visual representations for image retrieval. Int. J. Comput. Vis. 124(2), 237–254 (2017)

    Article  MathSciNet  Google Scholar 

  15. Gu, Y., Li, C., Xie, J.: Attention-aware generalized mean pooling for image retrieval. arXiv preprint arXiv:1811.00202 (2018)

  16. Hausler, S., Garg, S., Xu, M., Milford, M., Fischer, T.: Patch-NetVLAD: multi-scale fusion of locally-global descriptors for place recognition. In: Proceedings of CVPR, pp. 14141–14152 (2021)

  17. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, pp. 770–778 (2016)

  18. Hu, Z., Bors, A.G: Conditional attention for content-based image retrieval. In: Proceedings of BMVC, York (2020)

  19. Iscen, A., Tolias, G., Gosselin, P.-H., Jégou, H.: A comparison of dense region detectors for image search and fine-grained classification. IEEE Trans. Image Process. 24(8), 2369–2381 (2015)

    Article  MathSciNet  Google Scholar 

  20. Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P., Schmid, C.: Aggregating local image descriptors into compact codes. IEEE Trans. Pattern Anal. Mach. Intell. 34(9), 1704–1716 (2012)

    Article  Google Scholar 

  21. Jégou, H., Douze, M., Schmid, C.: Improving bag-of-features for large scale image search. Int. J. Comput. Vis. 87(3), 316–336 (2010)

    Article  Google Scholar 

  22. Kalantidis, Y., Mellina, C., Osindero, S.: Cross-dimensional weighting for aggregated deep convolutional features. In: Hua, G., Jégou, H. (eds.) Proceedings of ECCV Workshops, Springer, Cham, pp 685–701 (2016)

  23. Kingma, D.P., Ba, J.: A method for stochastic optimization. In: Proceedings of ICLR, Adam (2014)

  24. Li, X., Uricchio, T., Ballan, L., Bertini, M., Snoek, C.G.M., Bimbo, A.D.: Socializing the semantic gap: a comparative survey on image tag assignment, refinement, and retrieval. ACM Comput. Surv. 49(1), 1–39 (2016)

    Article  Google Scholar 

  25. Li, X., Jin, K., Long, R.: End-to-end semantic-aware object retrieval based on region-wise attention. Neurocomputing 359, 219–226 (2019)

    Article  Google Scholar 

  26. Martínez-Cortés, T., González-Díaz, I., Díaz de María, F.: Training deep retrieval models with noisy datasets: bag exponential loss. Pattern Recognit. 112, 107811 (2021)

  27. Mikulik, A., Perdoch, M., Chum, O., Matas, J.: Learning vocabularies over a fine quantization. Int. J. Comput. Vis. 103(1), 163–175 (2013)

    Article  MathSciNet  Google Scholar 

  28. Mishkin, D., Radenović, F., Matas, J.: Learning affine regions via discriminability. In: Proceedings of ECCV, Repeatability is Not Enough (2018)

  29. Mohedano, E., McGuinness, K., O’Connor, N.E., Salvador, A., Marques, F., Giro-i Nieto, X.: Bags of local convolutional features for scalable instance search. In: Proceedings of ACM ICMR (2016)

  30. Morère, O., Lin, J., Veillard, A., Duan, L.-Y., Chandrasekhar, V., Poggio, T.: Nested invariance pooling and RBM hashing for image instance retrieval. In: Proceedings of ACM ICMR (2017)

  31. Ng, T., Balntas, V., Tian, Y., Mikolajczyk, K.: Second-order loss and attention for image retrieval, SOLAR (2020)

  32. Ong, E.-J., Husain, S., Bober, M.: Siamese network of deep Fisher-vector descriptors for image retrieval. arXiv preprint arXiv:1702.00338 (2017)

  33. Ozaki, K., Yokoo, S.: Large-scale landmark retrieval/recognition under a noisy and diverse dataset. arXiv preprint arXiv:1906.04087 (2019)

  34. Perronnin, F., Sánchez, J., Mensink, T.: Improving the Fisher kernel for large-scale image classification. In: Proceedings of ECCV, (2010)

  35. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: improving particular object retrieval in large scale image databases. In: Proceedings of CVPR (2008)

  36. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: Proceedings of CVPR (2007)

  37. Radenović, F., Iscen, A., Tolias, G., Avrithis, Y., Chum, O.: Large-scale image retrieval benchmarking. In: Proceedings of CVPR, Revisiting Oxford and Paris (2018)

  38. Radenović, F., Tolias, G., Chum, O.: Unsupervised fine-tuning with hard examples. In: Proceedings of ECCV, CNN Image Retrieval Learns from BoW (2016)

  39. Radenović, F., Tolias, G., Chum, O.: Fine-tuning CNN image retrieval with no human annotation. IEEE Trans. Pattern Anal. Mach. Intell. 41(7), 1655–1668 (2018)

    Article  Google Scholar 

  40. Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for visual recognition. In: Proceedings of CVPR Workshop of DeepVision (2014)

  41. Razavian, A., Sullivan, J., Maki, A., Carlsson, S.: A baseline for visual instance retrieval with deep convolutional networks. ITE Trans. Media Technol. Appl. 4, 12 (2014)

    Google Scholar 

  42. Mopuri, K.R., Babu, R.V.: Object level deep feature pooling for compact image representation. In: Proceedings of CVPR Workshops (2015)

  43. Revaud, J., Almazan, J., Rezende, R.S., de Souza, C.R.: Training image retrieval with a listwise loss. In: Proceedings of ICCV, Learning with Average Precision (2019)

  44. Schuster, R., Wasenmuller, O., Unger, C., Stricker, D.: SDC-Stacked dilated convolution: a unified descriptor network for dense matching tasks. In: Proceedings of CVPR, pp. 2556–2565 (2019)

  45. Shi, X., Qian, X.: Exploring spatial and channel contribution for object based image retrieval. Knowl. Based Syst. 186, 104955 (2019)

    Article  Google Scholar 

  46. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  47. Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos. In: Proceedings of ICCV (2003)

  48. Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell. 22(12), 1349–1380 (2000)

    Article  Google Scholar 

  49. Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of CNN activations. In: Proceedings of ICLR (2016)

  50. Uricchio, T., Bertini, M., Seidenari, L., Del Bimbo, A.: Fisher encoded convolutional Bag-of-Windows for efficient image retrieval and social image tagging. In: Proceedings of ICCV International Workshop on Web-Scale Vision and Social Media (VSM) (2015)

  51. Vaccaro, F., Bertini, M., Uricchio, T., Del Bimbo, A.: Image retrieval using multi-scale CNN features pooling. In: Proceedings of ACM ICMR (2020)

  52. Wang, X., Zhu, L., Yang, Y.: T2VLAD: global-local sequence alignment for text-video retrieval. In: Proceedings of CVPR, pp. 5079–5088 (2021)

  53. Wu, C.-Y., Manmatha, R., Smola, A.J., Krähenbühl, P.: Sampling matters in deep embedding learning. In: Proceedings of ICCV (2017)

  54. Xie, L., Hong, R., Zhang, B., Tian, Q.: Image classification and retrieval are ONE. In: Proceedings of ACM ICMR (2015)

  55. Xu, J., Shi, C., Qi, C., Wang, C., Xiao, B.: Unsupervised part-based weighting aggregation of deep convolutional features for image retrieval. In: Proceedings of AAAI (2018)

  56. Xu, J., Wang, C., Shi, C., Xiao, B.: Weakly supervised soft-detection-based aggregation method for image retrieval. In: CoRR. arXiv preprint arXiv:1811.07619 (2018)

  57. Yue-Hei Ng, J., Yang, F., Davis, LS.: Exploiting local features from deep networks for image retrieval. In: Proceedings of CVPR Workshops (2015)

  58. Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Zhang, Z., Lin, H., Sun, Y., He, T., Mueller, J., Manmatha, R.., Li, M., Smola, A: ResNeSt: split-attention networks (2020)

  59. Zheng, L., Yang, Y., Tian, Q.: SIFT meets CNN: a decade survey of instance retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 40(5), 1224–1244 (2017)

    Article  Google Scholar 

  60. Zheng, L., Zhao, Y., Wang, S., Wang, J., Tian, Q.: Good practice in CNN feature transfer. arXiv preprint arXiv:1604.00133 (2016)

  61. Zhou, J., Ying, W.: Learning visual instance retrieval from failure: Efficient online local metric adaptation from negative samples. IEEE Trans. Pattern Anal. Mach. Intell. 42(11), 2858–2873 (2019)

    Google Scholar 

Download references

Acknowledgements

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research. This work was supported by European Union’s Horizon 2020 research and innovation programme under grant number 951911 - AI4Media

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tiberio Uricchio.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vaccaro, F., Bertini, M., Uricchio, T. et al. Effective triplet mining improves training of multi-scale pooled CNN for image retrieval. Machine Vision and Applications 33, 16 (2022). https://doi.org/10.1007/s00138-021-01260-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-021-01260-z

Keywords

Navigation