Abstract
A significant and current research problem, as well as a practical one, is the problem of deduplication in databases. The solution of this problem is applicable, e.g., in the context of the following situations in which are stored apparently different records, which actually refer to the same entity (objects, individuals, etc.) in the real world. In such cases, the purpose is to identify and reconcile such records or to eliminate duplication. The paper describes algorithms for finding duplicates and implements them in the developed data warehouse. Efficiency and effectiveness tests were also carried out for sample data contained in individual tables of the warehouse. The work aims to analyze the existing methodologies for detecting similarities and duplicates in data warehouses, to implement algorithms physically, and to test their effectiveness and efficiency. A large scale of data created by IoT devices leads to the consumption of communication bandwidth and disk space because the data is highly redundant. Therefore, correct deduplication of information is necessary to eliminate redundant data.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Dymora, P., Mazurek, M.: Anomaly detection in IoT communication network based on spectral analysis and hurst exponent. Appl. Sci. 9(24), 5319 (2019). https://doi.org/10.3390/app9245319
Yan, H., Li, X., Wang, Y., Jia, Ch.: Centralized duplicate removal video storage system with privacy preservation in IoT. Sensors 18(6), 1814 2018
González-Serrano, L., Talón-Ballestero, P., Muñoz-Romero, S., Soguero-Ruiz, C., Rojo-Álvarez, J.L.: Entropic statistical description of big data quality in hotel customer relationship management. Entropy 21(4), 419 (2019)
Bahmani, Z., Bertossi, L., Vasiloglou, N.: ERBlox: combining matching dependencies with machine learning for entity resolution. Int. J. Approx. Reason. 83, 118–141 (2017)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19, 1–16 (2007)
Pinto, F., Santos, M.F., Cortez, P., Quintela, H.: Data pre-processing for database marketing. In: Data Gadgets, Workshop: Malaga, Spain, pp. 76–84 (2004)
Saberi, M., Theobald, M., Hussain, O.K., Chang, E., Hussain, F.K.: Interactive feature selection for efficient customer recognition in contact centers: dealing with common names. Expert Syst. Appl. 113, 356–376 (2018)
Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. Proc. VLDB Endow. 9, 684–695 (2016)
Lin, M.J., Yang, C.Z., Lee, C.Y., Chen, C.C.: Enhancements for duplication detection in bug reports with manifold correlation features. J. Syst. Softw. 121, 223–233 (2016)
Adil, S.H., Ebrahim, M., Ali, S.S.A., Raza, K.: Performance analysis of duplicate record detection techniques. Eng. Technol. Appl. Sci. Res. 9, 4755–4758 (2019)
Shah, Y.A., Zade, S.S., Raut, S.M., Shirbhate, S.P., Khadse, V.U., Date, A.P.: A survey on data extraction and data duplication detection. Int. J. Recent Innovation Trends Comput. Commun. 6(5), 77–82 (2018)
Guo, L., Wang, W., Chen, F., Tangi, X., Wang, W.: A similar duplicate data detection method based on fuzzy clustering for topology formation. Przegląd Elektrotechniczny (Electr. Rev.) 88(1), 26–30 (2012). ISSN 0033-2097, R. 88 NR 1b/2012
Yujian, L., Bo, L.: A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29, 1091–1095 (2007)
Babar, N.: https://dzone.com/articles/the-levenshtein-algorithm-1?source=post_page. Accessed 14 Dec 2019
Wang, Y., Qin, J., Wang, W.: Efficient approximate entity matching using Jaro-Winkler distance. In: Bouguettaya, A., et al. (eds.) Web Information Systems Engineering – WISE 2017, WISE 2017. Lecture Notes in Computer Science, vol. 10569. Springer, Cham (2017)
Pandya, S.D., Virparia, P.V.: Context free data cleaning and its application in mechanism for suggestive data cleaning. Int. J. Inf. Sci. 1(1), 32–35 (2011). https://doi.org/10.5923/j.ijis.20110101.05
Angeles, M.P., Espino-Gamez, A., Gil-Moncada, J.: Comparison of a Modified Spanish phonetic, Soundex, and Phonex coding functions during data matching process. In: Conference Paper, June 2015. https://doi.org/10.1109/iciev.2015.7334028
Mandal, A.K., Hossain, M.D., Nadim, M.: Developing an efficient search suggestion generator, ignoring spelling error for high speed data retrieval using Double Metaphone Algorithm. In: Proceedings of 13th International Conference on Computer and Information Technology (ICCIT 2010) (2010). https://doi.org/10.1109/iccitechn.2010.5723876
Uddin, M.P., et. al.: High speed data retrieval from National Data Center (NDC) reducing time and ignoring spelling error in search key based on double Metaphone algorithm. Int. J. Comput. Sci. Eng. Appl. (IJCSEA) 3(6) (2013). https://doi.org/10.5121/ijcsea.2013.3601
Acknowledgments
We are thankful to the graduate student Andrzej Wilusz of Rzeszów University of Technology, for supporting us in the collection of useful information.
Funding
This work is financed by the Minister of Science and Higher Education of the Republic of Poland within the “Regional Initiative of Excellence” program for years 2019–2022. Project number 027/RID/2018/19, the amount granted 11 999 900 PLN.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Dymora, P., Mazurek, M. (2020). Performance Assessment of Selected Techniques and Methods Detecting Duplicates in Data Warehouses. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds) Theory and Applications of Dependable Computer Systems. DepCoS-RELCOMEX 2020. Advances in Intelligent Systems and Computing, vol 1173. Springer, Cham. https://doi.org/10.1007/978-3-030-48256-5_22
Download citation
DOI: https://doi.org/10.1007/978-3-030-48256-5_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-48255-8
Online ISBN: 978-3-030-48256-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)