MapReduce-based clustering for near-duplicate image identification

Zhao, Wanqing; Luo, Hangzai; Peng, Jinye; Fan, Jianping

doi:10.1007/s11042-016-4060-4

MapReduce-based clustering for near-duplicate image identification

Published: 16 November 2016

Volume 76, pages 23291–23307, (2017)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Wanqing Zhao¹,
Hangzai Luo¹,
Jinye Peng¹ &
…
Jianping Fan²

433 Accesses
2 Citations
Explore all metrics

Abstract

In this paper, an effective algorithm is developed for tackling the problem of near-duplicate image identification from large-scale image sets, where the LLC (locality-constrained linear coding) method is seamlessly integrated with the maxIDF cut model to achieve more discriminative representations of images. By incorporating MapReduce framework for image clustering and pairwise merging, the near duplicates of images can be identified effectively from large-scale image sets. An intuitive strategy is also introduced to guide the process for parameter selection. Our experimental results on large-scale image sets have revealed that our algorithm can achieve significant improvement on both the accuracy rates and the computation efficiency as compared with other baseline methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Microsoft COCO: Common Objects in Context

Image segmentation evaluation: a survey of methods

Article 18 April 2020

A comprehensive survey of image segmentation: clustering methods, performance parameters, and benchmark datasets

Article 09 February 2021

References

Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: Proceedings of the 16th international conference on World Wide Web, pp. 131–140. ACM
Broder AZ (1997) On the resemblance and containment of documents. In: Compression and Complexity of Sequences 1997. Proceedings, pp. 21–29. IEEE
Broder AZ, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the web. Computer Networks and Isdn Systems 29(8-13):1157–1166
Article Google Scholar
Cherian A, Morellas V, Papanikolopoulos N (2012) Robust sparse hashing. In: Proceedings / ICIP... International Conference on Image Processing, pp. 2417–2420
Chum O, Perdoch M, Matas J (2009) Geometric min-hashing: Finding a (thick) needle in a haystack. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 17–24. IEEE
Chum O, Philbin J, Zisserman A, et al. (2008) Near duplicate image detection: min-hash and tf-idf weighting. In: BMVC, vol. 810, pp. 812–815
Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the twentieth annual symposium on Computational geometry, pp. 253–262. ACM
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Dong W, Wang Z, Charikar M, Li K (2012) High-confidence near-duplicate image detection. In: Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
Elsayed T, Lin J, Oard DW (2008) Pairwise document similarity in large collections with mapreduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, pp. 265–268. Association for Computational Linguistics
Foo JJ, Zobel J, Sinha R (2007) Clustering near-duplicate images in large collections. In: Proceedings of the international workshop on Workshop on multimedia information retrieval, pp. 21–30
Hama H, Zin TT, Tin P (2009) A hybrid ranking of link and popularity for novel search engine. International Journal of Innovative Computing. Inf Control 5 (11):4041–4049
Google Scholar
Hsieh LC, Wu GL, Hsu YM, Hsu W (2014) Online image search result grouping with mapreduce-based image clustering and graph construction for large-scale photos. J Vis Commun Image Represent 25(2):384–395
Article Google Scholar
Hsieh LC, Wu GL, Lee WY, Hsu W (2012) Two-stage sparse graph construction using minhash on mapreduce. In: IEEE International Conference on Acoustics, pp. 1013–1016
Kim S, Wang XJ, Zhang L, Choi S (2015) Near duplicate image discovery on one billion images. In: 2015 IEEE Winter Conference on, Applications of Computer Vision (WACV), pp. 943–950
Lee DC, Ke Q, Isard M (2010) Partition min-hash for partial duplicate image discovery. In: European Conference on Computer Vision, pp. 648–662. Springer
Liu T, Rosenberg C, Rowley H, et al. (2007) Clustering billions of images with large scale nearest neighbor search. In: Applications of Computer Vision, 2007. WACV’07. IEEE Workshop on, pp. 28–28. IEEE
Peng J, Shen Y, Fan J (2013) Cross-modal social image clustering and tag cleansing. J Vis Commun Image Represent 24(7):895–910
Article Google Scholar
Salakhutdinov R, Hinton GE (2007) Learning a nonlinear embedding by preserving class neighbourhood structure. In: International Conference on Artificial Intelligence and Statistics, pp. 412–419
Sivic J, Zisserman A (2003) Video google: A text retrieval approach to object matching in videos. In: Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pp. 1470–1477. IEEE
Thomee B, Shamma DA, Friedland G, Elizalde B, Ni K, Poland D, Borth D, Li LJ (2015) The new data and new challenges in multimedia research. arXiv preprint. arXiv:1503.01817
Vonikakis V, Jinda-Apiraksa A, Winkler S (2014) Photocluster: A multi-clustering technique for near-duplicate detection in personal photo collections. In: Computer Vision Theory and Applications (VISAPP), 2014 International Conference on, pp. 153–161
Wang H, Zhu F, Xiao B, Wang L, Jiang YG (2014) Gpu-based mapreduce for large-scale near-duplicate video retrieval. Multimedia Tools & Applications 74(23):10,515–10,534
Article Google Scholar
Wang J, Kumar S, Chang SF (2012) Semi-supervised hashing for large-scale search. Pattern Analysis and Machine Intelligence. Tran IEEE 34(12):2393–2406
Google Scholar
Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 3360–3367. IEEE
Wang XJ, Zhang L, Liu C (2013) Duplicate discovery on 2 billion internet images. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2013 IEEE Conference on, pp. 429–436
Weiss Y, Torralba A, Fergus R (2009) Spectral hashing. In: Advances in neural information processing systems, pp. 1753–1760
Xie L, Tian Q, Zhou W, Zhang B (2014) Fast and accurate near-duplicate image search with affinity propagation on the imageweb. Comput Vis Image Underst 124:31–41
Article Google Scholar
Yang C, Peng J, Fan J (2012) Image collection summarization via dictionary learning for sparse representation. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1122– 1129
Zheng L, Wang S, Liu Z, Tian Q (2013) Lp-norm idf for large scale image search. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 1626–1633. IEEE

Download references

Acknowledgments

This research is partly supported by National Science Foundation of China under Grant 61272285, National High-Technology Program of China (863 Program, Grant No.2014AA015201), Program for Changjiang Scholars and Innovative Research Team in University (No.IRT13090), and Program of Shaanxi Province Innovative Research Team (No.2014KCT-17).

Author information

Authors and Affiliations

School of Information and Technology, Northwest University of China, Xi’an, China
Wanqing Zhao, Hangzai Luo & Jinye Peng
Department of Computer Science, UNC-Charlotte, Charlotte, NC, 28223, USA
Jianping Fan

Authors

Wanqing Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Hangzai Luo
View author publications
You can also search for this author in PubMed Google Scholar
Jinye Peng
View author publications
You can also search for this author in PubMed Google Scholar
Jianping Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hangzai Luo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, W., Luo, H., Peng, J. et al. MapReduce-based clustering for near-duplicate image identification. Multimed Tools Appl 76, 23291–23307 (2017). https://doi.org/10.1007/s11042-016-4060-4

Download citation

Received: 24 March 2016
Revised: 29 September 2016
Accepted: 10 October 2016
Published: 16 November 2016
Issue Date: November 2017
DOI: https://doi.org/10.1007/s11042-016-4060-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MapReduce-based clustering for near-duplicate image identification

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

Image segmentation evaluation: a survey of methods

A comprehensive survey of image segmentation: clustering methods, performance parameters, and benchmark datasets

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MapReduce-based clustering for near-duplicate image identification

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

Image segmentation evaluation: a survey of methods

A comprehensive survey of image segmentation: clustering methods, performance parameters, and benchmark datasets

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation