Performance Assessment of Selected Techniques and Methods Detecting Duplicates in Data Warehouses

Dymora, Paweł; Mazurek, Mirosław

doi:10.1007/978-3-030-48256-5_22

Performance Assessment of Selected Techniques and Methods Detecting Duplicates in Data Warehouses

Paweł Dymora¹⁹ &
Mirosław Mazurek¹⁹

Conference paper
First Online: 22 May 2020

626 Accesses
2 Citations

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1173))

Abstract

A significant and current research problem, as well as a practical one, is the problem of deduplication in databases. The solution of this problem is applicable, e.g., in the context of the following situations in which are stored apparently different records, which actually refer to the same entity (objects, individuals, etc.) in the real world. In such cases, the purpose is to identify and reconcile such records or to eliminate duplication. The paper describes algorithms for finding duplicates and implements them in the developed data warehouse. Efficiency and effectiveness tests were also carried out for sample data contained in individual tables of the warehouse. The work aims to analyze the existing methodologies for detecting similarities and duplicates in data warehouses, to implement algorithms physically, and to test their effectiveness and efficiency. A large scale of data created by IoT devices leads to the consumption of communication bandwidth and disk space because the data is highly redundant. Therefore, correct deduplication of information is necessary to eliminate redundant data.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Dymora, P., Mazurek, M.: Anomaly detection in IoT communication network based on spectral analysis and hurst exponent. Appl. Sci. 9(24), 5319 (2019). https://doi.org/10.3390/app9245319
Article Google Scholar
Yan, H., Li, X., Wang, Y., Jia, Ch.: Centralized duplicate removal video storage system with privacy preservation in IoT. Sensors 18(6), 1814 2018
Article Google Scholar
González-Serrano, L., Talón-Ballestero, P., Muñoz-Romero, S., Soguero-Ruiz, C., Rojo-Álvarez, J.L.: Entropic statistical description of big data quality in hotel customer relationship management. Entropy 21(4), 419 (2019)
Article Google Scholar
Bahmani, Z., Bertossi, L., Vasiloglou, N.: ERBlox: combining matching dependencies with machine learning for entity resolution. Int. J. Approx. Reason. 83, 118–141 (2017)
Article MathSciNet Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19, 1–16 (2007)
Article Google Scholar
Pinto, F., Santos, M.F., Cortez, P., Quintela, H.: Data pre-processing for database marketing. In: Data Gadgets, Workshop: Malaga, Spain, pp. 76–84 (2004)
Google Scholar
Saberi, M., Theobald, M., Hussain, O.K., Chang, E., Hussain, F.K.: Interactive feature selection for efficient customer recognition in contact centers: dealing with common names. Expert Syst. Appl. 113, 356–376 (2018)
Article Google Scholar
Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. Proc. VLDB Endow. 9, 684–695 (2016)
Article Google Scholar
Lin, M.J., Yang, C.Z., Lee, C.Y., Chen, C.C.: Enhancements for duplication detection in bug reports with manifold correlation features. J. Syst. Softw. 121, 223–233 (2016)
Article Google Scholar
Adil, S.H., Ebrahim, M., Ali, S.S.A., Raza, K.: Performance analysis of duplicate record detection techniques. Eng. Technol. Appl. Sci. Res. 9, 4755–4758 (2019)
Google Scholar
Shah, Y.A., Zade, S.S., Raut, S.M., Shirbhate, S.P., Khadse, V.U., Date, A.P.: A survey on data extraction and data duplication detection. Int. J. Recent Innovation Trends Comput. Commun. 6(5), 77–82 (2018)
Google Scholar
Guo, L., Wang, W., Chen, F., Tangi, X., Wang, W.: A similar duplicate data detection method based on fuzzy clustering for topology formation. Przegląd Elektrotechniczny (Electr. Rev.) 88(1), 26–30 (2012). ISSN 0033-2097, R. 88 NR 1b/2012
Google Scholar
Yujian, L., Bo, L.: A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29, 1091–1095 (2007)
Article Google Scholar
Babar, N.: https://dzone.com/articles/the-levenshtein-algorithm-1?source=post_page. Accessed 14 Dec 2019
Wang, Y., Qin, J., Wang, W.: Efficient approximate entity matching using Jaro-Winkler distance. In: Bouguettaya, A., et al. (eds.) Web Information Systems Engineering – WISE 2017, WISE 2017. Lecture Notes in Computer Science, vol. 10569. Springer, Cham (2017)
Google Scholar
Pandya, S.D., Virparia, P.V.: Context free data cleaning and its application in mechanism for suggestive data cleaning. Int. J. Inf. Sci. 1(1), 32–35 (2011). https://doi.org/10.5923/j.ijis.20110101.05
Article Google Scholar
Angeles, M.P., Espino-Gamez, A., Gil-Moncada, J.: Comparison of a Modified Spanish phonetic, Soundex, and Phonex coding functions during data matching process. In: Conference Paper, June 2015. https://doi.org/10.1109/iciev.2015.7334028
Mandal, A.K., Hossain, M.D., Nadim, M.: Developing an efficient search suggestion generator, ignoring spelling error for high speed data retrieval using Double Metaphone Algorithm. In: Proceedings of 13th International Conference on Computer and Information Technology (ICCIT 2010) (2010). https://doi.org/10.1109/iccitechn.2010.5723876
Uddin, M.P., et. al.: High speed data retrieval from National Data Center (NDC) reducing time and ignoring spelling error in search key based on double Metaphone algorithm. Int. J. Comput. Sci. Eng. Appl. (IJCSEA) 3(6) (2013). https://doi.org/10.5121/ijcsea.2013.3601
Article Google Scholar

Download references

Acknowledgments

We are thankful to the graduate student Andrzej Wilusz of Rzeszów University of Technology, for supporting us in the collection of useful information.

Funding

This work is financed by the Minister of Science and Higher Education of the Republic of Poland within the “Regional Initiative of Excellence” program for years 2019–2022. Project number 027/RID/2018/19, the amount granted 11 999 900 PLN.

Author information

Authors and Affiliations

Faculty of Electrical and Computer Engineering, Rzeszów University of Technology, al. Powstańców Warszawy 12, 35-959, Rzeszów, Poland
Paweł Dymora & Mirosław Mazurek

Authors

Paweł Dymora
View author publications
You can also search for this author in PubMed Google Scholar
Mirosław Mazurek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mirosław Mazurek .

Editor information

Editors and Affiliations

Wrocław University of Science and Technology, Wrocław, Poland
Wojciech Zamojski
Wrocław University of Science and Technology, Wrocław, Poland
Jacek Mazurkiewicz
Wrocław University of Science and Technology, Wrocław, Poland
Jarosław Sugier
Wrocław University of Science and Technology, Wrocław, Poland
Tomasz Walkowiak
Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Janusz Kacprzyk

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dymora, P., Mazurek, M. (2020). Performance Assessment of Selected Techniques and Methods Detecting Duplicates in Data Warehouses. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds) Theory and Applications of Dependable Computer Systems. DepCoS-RELCOMEX 2020. Advances in Intelligent Systems and Computing, vol 1173. Springer, Cham. https://doi.org/10.1007/978-3-030-48256-5_22

Download citation

DOI: https://doi.org/10.1007/978-3-030-48256-5_22
Published: 22 May 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-48255-8
Online ISBN: 978-3-030-48256-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics