Skip to main content

A Novel Approach for Detecting Near-Duplicate Web Documents by Considering Images, Text, Size of the Document and Domain

  • Conference paper
  • First Online:
ICCCE 2020

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 698))

Abstract

Web mining is a part of data mining in which the web consists of enormous amount of data. The search engines faces large amount of problems due to the presence of Near duplicate documents in web which leads to irrelevant answers. The performance and reliability of search engines are critically affecting since the near duplicate documents present in web. For detection of near duplicate web documents two attempts are found in the literature. The former considered domain and size of the document and the later considered text and image as the search parameters. This article proposes a novel approach combining the parameters such as text, image, size and domain of the document to detect near duplicate documents. The approach extracts the keywords and images of the crawled document and compares them with the existing documents for similarity measure. If the similarity score measure value is less than 19.5 and image comparison value is greater than 70%, then it is detected as near duplicate document.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Liu L, Lu Y, Suen CY (2015) Variable-length signature for near-duplicate image matching. IEEE Trans Image Process 24(4):1282–1296

    Article  MathSciNet  MATH  Google Scholar 

  2. Landge A, Mane P (2016) Near duplicate image matching techniques. In: 2016 international conference on information communication and embedded systems (ICICES)

    Google Scholar 

  3. Qiu J, Zeng Q (2010) Detection and optimized disposal of near-duplicate pages. In: 2010 2nd international conference on future computer and communication

    Google Scholar 

  4. Arun PR, Sumesh MS (2015) Near-duplicate web page detection by enhanced TDW and simHash technique. In: 2015 international conference on computing and network communications (CoCoNet’15), 16–19 December 2015, Trivandrum, India

    Google Scholar 

  5. Naseem R, Anees S, Muneer K, Syed Farook K (2013) Near duplicate web page detection with analytic feature weighting. In: 2013 third international conference on advances in computing and communications

    Google Scholar 

  6. Hu Y, Li M, Yu N (2018) Efficient near-duplicate image detection by learning from examples. In: 2008 IEEE international conference on multimedia expo

    Google Scholar 

  7. Yıldız B, Demirci MF (2016) Distinctive interest point selection for efficient near-duplicate image retrieval. In: 2016 IEEE southwest symposium on image analysis and interpretation (SSIAI)

    Google Scholar 

  8. Duan M, Xie X, Wu X, Ma W-Y (2008) Visual pattern weighting for near-duplicate image retrieval. In: 2008 IEEE international conference on multimedia and expo

    Google Scholar 

  9. Wu L, Liu J, Yu N, Li M (2008) Query oriented subspace shifting for near-duplicate image detection. In: 2008 IEEE international conference on multimedia and expo

    Google Scholar 

  10. Sun Z, Wang C, Jia K (2011) Near-duplicate video clips detection with motion based video fingerprinting. In: 2011 4th international congress on image and signal processing

    Google Scholar 

  11. Narayana VA, Gaddameedhi S, Koppula VK, Raju KS (2018) Framework for proficient proof of identity of duplicate and near-duplicate images and image distances using high-disguisable image fragment. In: 5th IEEE international conference on parallel, distributed and grid computing (PDGC-2018), 20–22 December 2018, Solan, India

    Google Scholar 

  12. He Y, Gao J (2018) Detecting short near-duplicates with semantic relations. In: 018 IEEE 9th international conference on software engineering and service science (ICSESS)

    Google Scholar 

  13. Du Q, Liu W, Li G, Tang Y (2012) Near duplicate detection using MapReduce. In: 2012 2nd international conference on computer science and network technology (ICCSNT)

    Google Scholar 

  14. Luan X, Xie Y, Guo Y, He J, Zhang L, Zhang X (2017) A fast near-duplicate keyframe detection method based on local features. In: 017 17th IEEE international conference on communication technology

    Google Scholar 

  15. Chang T-Y, Tai S-C, Lin G-S (2015) A near-duplicate video retrieval method based on zernike moments. In: Proceedings of APSIPA annual summit and conference

    Google Scholar 

  16. Chou C-L, Chen H-T, Lee S-Y (2015) Pattern-based near duplicate video retrieval and localization on web-scale videos

    Google Scholar 

  17. Harbin, P.R. China (2012) Book retrieval based on near-duplicate image matching. In: 2012 9th international conference on fuzzy systems and knowledge discovery (FSKD 2012)

    Google Scholar 

  18. Vidyulatha M, Narayana VA (2018) Detection of near duplicate documents by considering the domain to which the documents belongs. Int J Emerging Trends Technol Sci 9(4):629–639. (ISSN: 2348–0246 (online))

    Google Scholar 

  19. Ide I, Shamot Y (2010) Classification of Near duplicate video segments based on their appearance patterns. In: 2010 international conference on pattern recognition

    Google Scholar 

  20. Uysal MS, Beecks C, Sabinasz D, Seidl T (2015) Effective content-based near-duplicate video detection. In: 2015 IEEE international symposium on multimedia

    Google Scholar 

  21. Niu X, Xie Y, Li C, Luan X (2016) Near-duplicate keyframe detection based on gray-scale pyramid. In: 2016 IEEE international conference on signal and image processing

    Google Scholar 

  22. Manku GS, Jain A, Sarma AD (2007) Detecting near-duplicates for web crawling. In: Proceedings of the 16th international conference on world wide web, pp 141–150

    Google Scholar 

  23. Narayana VA, Premchand P, Govardhan A (2009) A novel and efficient approach for near duplicate page detection in web crawling. https://doi.org/10.1109/iadcc.2009.4809238

  24. Narayana VA, Premchand P, Govardhan A (2012) Performance and comparative analysis of the two contrary approaches for detecting near duplicate web documents in web crawling. Int J Comput Appl 59(3):22–29

    Google Scholar 

  25. Pi B, Fu S, Wang W, HanS (2009) SimHash-based effective and efficient detecting of near-duplicate short messages

    Google Scholar 

  26. Gong C, Huang Y, Cheng X, Bai S (2008) Detecting near-duplicates in large-scale short text databases

    Google Scholar 

  27. Roul RK, Mittal S, Joshi P (2014) Efficient approach for near duplicate document detection using textual and conceptual based techniques. In: Kumar Kundu M, Mohapatra D, Konar A, Chakraborty A (eds) Advanced Computing, Networking and Informatics - Volume 1. Smart Innovation, Systems and Technologies, vol 27. Springer, Cham

    Google Scholar 

  28. Prasanna Kumar J, Govindarajulu P (2013) Near-duplicate web page detection: “an efficient approach using clustering, sentence feature and fingerprinting. Int J Comput Intell Syst 6(1):1–13

    Article  Google Scholar 

  29. Sravanthi G, Narayana VA (2018) An efficient approach for detection of near replicas documents by considering both the text & the images. J Adv Res 10(03-Special Issue):417–424

    Google Scholar 

  30. Zaheer MD, Narayana VA (2019) A strategy for near-deduplication web document considering both domain &size of the document. Int J Comput Appl (2278–3075) 8(4S2)

    Google Scholar 

  31. XNDDF (2015) Towards a framework for flexible near-duplicate document detection using supervised and unsupervised learning. In: International conference on intelligent computing, communication, & convergence (ICCC-2015)

    Google Scholar 

  32. Ho P-T, Kim S-R (2014) Fingerprint-based near-duplicate document detection with applications to SNS spam detection. Int J Distrib Sens Netw 2014, 8. Article ID 612970

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Bhavani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bhavani, M., Narayana, V.A., Sreevani, G. (2021). A Novel Approach for Detecting Near-Duplicate Web Documents by Considering Images, Text, Size of the Document and Domain. In: Kumar, A., Mozar, S. (eds) ICCCE 2020. Lecture Notes in Electrical Engineering, vol 698. Springer, Singapore. https://doi.org/10.1007/978-981-15-7961-5_123

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-7961-5_123

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-7960-8

  • Online ISBN: 978-981-15-7961-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics