Skip to main content

Performance Assessment of Selected Techniques and Methods Detecting Duplicates in Data Warehouses

  • Conference paper
  • First Online:

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1173))

Abstract

A significant and current research problem, as well as a practical one, is the problem of deduplication in databases. The solution of this problem is applicable, e.g., in the context of the following situations in which are stored apparently different records, which actually refer to the same entity (objects, individuals, etc.) in the real world. In such cases, the purpose is to identify and reconcile such records or to eliminate duplication. The paper describes algorithms for finding duplicates and implements them in the developed data warehouse. Efficiency and effectiveness tests were also carried out for sample data contained in individual tables of the warehouse. The work aims to analyze the existing methodologies for detecting similarities and duplicates in data warehouses, to implement algorithms physically, and to test their effectiveness and efficiency. A large scale of data created by IoT devices leads to the consumption of communication bandwidth and disk space because the data is highly redundant. Therefore, correct deduplication of information is necessary to eliminate redundant data.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Dymora, P., Mazurek, M.: Anomaly detection in IoT communication network based on spectral analysis and hurst exponent. Appl. Sci. 9(24), 5319 (2019). https://doi.org/10.3390/app9245319

    Article  Google Scholar 

  2. Yan, H., Li, X., Wang, Y., Jia, Ch.: Centralized duplicate removal video storage system with privacy preservation in IoT. Sensors 18(6), 1814 2018

    Article  Google Scholar 

  3. González-Serrano, L., Talón-Ballestero, P., Muñoz-Romero, S., Soguero-Ruiz, C., Rojo-Álvarez, J.L.: Entropic statistical description of big data quality in hotel customer relationship management. Entropy 21(4), 419 (2019)

    Article  Google Scholar 

  4. Bahmani, Z., Bertossi, L., Vasiloglou, N.: ERBlox: combining matching dependencies with machine learning for entity resolution. Int. J. Approx. Reason. 83, 118–141 (2017)

    Article  MathSciNet  Google Scholar 

  5. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19, 1–16 (2007)

    Article  Google Scholar 

  6. Pinto, F., Santos, M.F., Cortez, P., Quintela, H.: Data pre-processing for database marketing. In: Data Gadgets, Workshop: Malaga, Spain, pp. 76–84 (2004)

    Google Scholar 

  7. Saberi, M., Theobald, M., Hussain, O.K., Chang, E., Hussain, F.K.: Interactive feature selection for efficient customer recognition in contact centers: dealing with common names. Expert Syst. Appl. 113, 356–376 (2018)

    Article  Google Scholar 

  8. Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. Proc. VLDB Endow. 9, 684–695 (2016)

    Article  Google Scholar 

  9. Lin, M.J., Yang, C.Z., Lee, C.Y., Chen, C.C.: Enhancements for duplication detection in bug reports with manifold correlation features. J. Syst. Softw. 121, 223–233 (2016)

    Article  Google Scholar 

  10. Adil, S.H., Ebrahim, M., Ali, S.S.A., Raza, K.: Performance analysis of duplicate record detection techniques. Eng. Technol. Appl. Sci. Res. 9, 4755–4758 (2019)

    Google Scholar 

  11. Shah, Y.A., Zade, S.S., Raut, S.M., Shirbhate, S.P., Khadse, V.U., Date, A.P.: A survey on data extraction and data duplication detection. Int. J. Recent Innovation Trends Comput. Commun. 6(5), 77–82 (2018)

    Google Scholar 

  12. Guo, L., Wang, W., Chen, F., Tangi, X., Wang, W.: A similar duplicate data detection method based on fuzzy clustering for topology formation. Przegląd Elektrotechniczny (Electr. Rev.) 88(1), 26–30 (2012). ISSN 0033-2097, R. 88 NR 1b/2012

    Google Scholar 

  13. Yujian, L., Bo, L.: A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29, 1091–1095 (2007)

    Article  Google Scholar 

  14. Babar, N.: https://dzone.com/articles/the-levenshtein-algorithm-1?source=post_page. Accessed 14 Dec 2019

  15. Wang, Y., Qin, J., Wang, W.: Efficient approximate entity matching using Jaro-Winkler distance. In: Bouguettaya, A., et al. (eds.) Web Information Systems Engineering – WISE 2017, WISE 2017. Lecture Notes in Computer Science, vol. 10569. Springer, Cham (2017)

    Google Scholar 

  16. Pandya, S.D., Virparia, P.V.: Context free data cleaning and its application in mechanism for suggestive data cleaning. Int. J. Inf. Sci. 1(1), 32–35 (2011). https://doi.org/10.5923/j.ijis.20110101.05

    Article  Google Scholar 

  17. Angeles, M.P., Espino-Gamez, A., Gil-Moncada, J.: Comparison of a Modified Spanish phonetic, Soundex, and Phonex coding functions during data matching process. In: Conference Paper, June 2015. https://doi.org/10.1109/iciev.2015.7334028

  18. Mandal, A.K., Hossain, M.D., Nadim, M.: Developing an efficient search suggestion generator, ignoring spelling error for high speed data retrieval using Double Metaphone Algorithm. In: Proceedings of 13th International Conference on Computer and Information Technology (ICCIT 2010) (2010). https://doi.org/10.1109/iccitechn.2010.5723876

  19. Uddin, M.P., et. al.: High speed data retrieval from National Data Center (NDC) reducing time and ignoring spelling error in search key based on double Metaphone algorithm. Int. J. Comput. Sci. Eng. Appl. (IJCSEA) 3(6) (2013). https://doi.org/10.5121/ijcsea.2013.3601

    Article  Google Scholar 

Download references

Acknowledgments

We are thankful to the graduate student Andrzej Wilusz of Rzeszów University of Technology, for supporting us in the collection of useful information.

Funding

This work is financed by the Minister of Science and Higher Education of the Republic of Poland within the “Regional Initiative of Excellence” program for years 2019–2022. Project number 027/RID/2018/19, the amount granted 11 999 900 PLN.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mirosław Mazurek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dymora, P., Mazurek, M. (2020). Performance Assessment of Selected Techniques and Methods Detecting Duplicates in Data Warehouses. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds) Theory and Applications of Dependable Computer Systems. DepCoS-RELCOMEX 2020. Advances in Intelligent Systems and Computing, vol 1173. Springer, Cham. https://doi.org/10.1007/978-3-030-48256-5_22

Download citation

Publish with us

Policies and ethics