A system proposal for automated data cleaning environment

Carlos Roberto Valêncio; Toni Jardini; Victor Hugo Penhalves Martins; Angelo Cesar Colombini; Marcio Zamboti Fortes

doi:10.5935/jetia.v6i25.685

Carlos Roberto Valêncio São Paulo State University - UNESP. São Jose do Rio Preto–São Paulo, Brazil http://orcid.org/0000-0002-9325-3159
Toni Jardini São Paulo State University - UNESP. São Jose do Rio Preto–São Paulo, Brazil http://orcid.org/0000-0002-0686-6791
Victor Hugo Penhalves Martins São Paulo State University - UNESP. São Jose do Rio Preto–São Paulo, Brazil http://orcid.org/0000-0003-4507-0128
Angelo Cesar Colombini Federal University of Fluminense - UFF. Niterói–Rio de Janeiro, Brazil http://orcid.org/0000-0002-8906-4128
Marcio Zamboti Fortes Federal University of Fluminense - UFF. Niterói–Rio de Janeiro, Brazil http://orcid.org/0000-0003-4040-8126

DOI: https://doi.org/10.5935/jetia.v6i25.685

Abstract

One of the great challenges to obtaining knowledge from data sources is to ensure consistency and non-duplication of stored information. Many techniques have been proposed to minimize the work cost and to allow data to be analyzed and properly corrected. However, there are still other essential aspects for the success of data cleaning process that involve many technological areas: performance, semantic and autonomy of the process. Against this backdrop, we developed an automated configurable data cleaning environment based on training and physical-semantic data similarity, aiming to provide a more efficient and extensible tool for performing information correction which covers problems not yet explored such as semantic and autonomy of the cleaning implementation process. The developed work has, among its objectives, the reduction of user interaction in the process of analyzing and correcting data inconsistencies and duplications. With a properly calibrated environment, the efficiency is significant, covering approximately 90% of inconsistencies in the database, with a 0% percentage of false-positive cases. Approaches were also demonstrated to show that besides detecting and treating information inconsistencies and duplication of positive cases, they also addressed cases of detected false-positives and the negative impacts they may have on the data cleaning process, whether manual or automated, which is not yet widely discussed in literature. The most significant contribution of this work refers to the developed tool that, without user interaction, is automatically able to analyze and eliminate 90% of the inconsistencies and duplications of information contained in a database, with no occurrence of false-positives. The results of the tests proved the effectiveness of all the developed features, relevant to each module of the proposed architecture. In several scenarios the experiments demonstrated the effectiveness of the tool.

Downloads

Download data is not yet available.

References

Bharat, K. et al.. Special Issue on Data Cleaning, Bulletin of the Technical Committee on Data Engineering, 2000.

Andrade, T.L.; Souza, R.C.G.; Balbini, M.; Valêncio, C.R.. Optimization of Algorithm to Identification of Duplicate Tuples through Similarity Phonetic Based on Multithreading, in Proc. of 2011. 12th International Conference on Parallel and Distributed Computing, Applications and Technologies - PDCAT, 299–304, 2011. DOI: 10.1109/PDCAT.2011.58.

Rahm, E.; Do, H.H.. Data Cleaning: Problems and Current Approaches. IEEE Data Engineering Bulletin 23, pp. 3–13, 2000.

Silva, L.A.E..A Data Mining Approach for Standardization of Collectors Names in Herbarium Database. IEEE Latin America Transactions, vol.14, no.2, pp.805-810, 2016. DOI: 10.1109/TLA.2016.7437226.

Elmagarmid, A.K.; Ipeirotis, P.G.; Verykios, V.S.. Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering, vol.19, no.1, p.1–16, 2007. DOI: 10.1109/TKDE.2007.250581.

Ayad, L.A.K.; Barton, C.; Pissis, S.P.; A faster and more accurate heuristic for cyclic edit distance computation. Patter Recognition Letters, vol.88, p.81-87, 2017. DOI: 10.1016/j.patrec.2017.01.018.

Su, Z. et al.. Plagiarism Detection Using the Levenshtein Distance and Smith-Waterman Algorithm, in Proc. of 3rd International Conference on Innovative Computing, Information and Control - ICICIC '08, p.569-573, 2008. DOI: 10.1109/ICICIC.2008.422.

Gueddah, H.; Yousfi, A.; Belkasmi, M.. The filtered combination of the weighted edit distance and the Jaro-Winkler distance to improve spellchecking Arabic texts, in Proc. of 12th International Conference of Computer Systems and Applications – AICCSA, 2015. DOI: 10.1109/AICCSA.2015.7507128.

Hanada, H.; Kudo, M.; Nakamura, A.. Average-case linear-time similar substring searching by the q-gram distance. Theoretical Computer Science. vol. 530, p.23-41, 2014. DOI: 10.1016/j.tcs.2014.02.022.

Wang, G.; Wang, B.; Yang, X.; Yu, G.. Efficiently Indexing Large Space Graphs for Similarity Search. IEEE Transactions on Knowledge and Data Engineering. vol.24, no.3, p.440-451, 2010. DOI: 10.1109/TKDE.2010.28.

Petrovic, S.; Bakke, S.. Improving the Efficiency of Misuse Detection by Means of the q-gram Distance, in Proc. of Fourth International Conference on Information Assurance and Security - ISIAS '08, p.205–208, 2008. DOI: 10.1109/IAS.2008.39.

Khristodulo, O.I.; Makhmutov, A.A.; Sazonova, T.V.. Use Algorithm based at Hamming Neural Network Method for Natural Objects Classification. Procedia Computer Science, vol. 103, p.388-395, 2017. DOI: 10.1016/j.procs.2017.01.126.

Boutalis, Y.S.. A new method for constructing kernel vectors in morphological associative memories of binary patterns. Computer Science and Information Systems, vol.8, no.41, p.141-166, 2011. DOI: 10.2298/CSIS091114026B.

Chen, S.Y. et al.. Concept Extraction and Clustering for Search Result Organization and Virtual Community Construction. Computer Science and Information Systems, vol. 9, no.1, p.323-355, 2012. DOI: 10.2298/CSIS101124020C.

Jiang, Z.; Evans, M.; Oliver, D; Shekkar, S.. Identifying K Primary Conditions from urban bicycle GPS trajectories on a road network. Information Systems, vol.57, p.142-159, 2016. DOI: 10.1016/j.is.2015.10.009.

Hernandez, A.F.R.; Garcia, N.Y.H.; Distributed processing using cosine similarity for mapping Big Data in Hadoop. IEEE Transactions on Latin America, vol.14, no.6, p.2857-2861, 2016. DOI: 10.1109/TLA.2016.7555265.

Jimenez, S.; Gonzalez, F.A.; Gelbukh, A.. Mathematical properties of soft cardinality: Enhancing Jaccard, Dice and cosine similarity measures with element-wise distance. Information Sciences, vol.367-368, p.373-389, 2016. DOI: 10.1016/j.ins.2016.06.012.

Zhu, S.; Wu, J.; Xia, G.. TOP-K Cosine Similarity Interesting Pairs Search, in Proc. of Seventh International Conference on Fuzzy Systems and Knowledge Discovery - FSKD 2010, 1479–1483, 2010. DOI: 10.1109/FSKD.2010.5569212.

Monge, A.E.; Elkan, C.P.. The Field Matching Problem: Algorithms and Applications, in KDD-96 Proceedings, p.267–270, 1996.

Cohen, W.W.. Integration of heterogeneous databases without common domains using queries based on textual similarity, in Proc. 1998 ACM SIGMOD International Conference on Management of data, 201–212, 1998. DOI: 10.1145/276304.276323.

Gravano, L. et al.. Text joins in an RDBMS for web data integration, in Proc. of WWW ’03 – 12th International Conference on World Wide Web, 90-101, 2003. DOI: 10.1145/775152.775166.

Holmes, D.; McCabe, M.C.. Improving precision and recall for Soundex retrieval, in Proc. of International Conference on Information Technology: Coding and Computing – ITCC, p.22–26, 2002. DOI: 10.1109/ITCC.2002.1000354.

Mandal, A.K.; Hossain, M.D.; Nadim, M.. Developing an efficient search suggestion generator, ignoring spelling error for high speed data retrieval using Double Metaphone Algorithm, in Proc. of 13th International Conference on Computer and Information Technology – ICCIT, p.317–320, 2010. DOI: 10.1109/ICCITECHN.2010.5723876.

Taft, R.L.. Project SEARCH - Name Search Techniques – Special Report No. 1, New York State Identification and Intelligence System, 1-118, 1970.

Gill, L.E.. OX-LINK: The Oxford Medical Record Linkage System, in International Workshop and Exposition, p.15–45, 1997.

Cohen, W.W.. Data Integration using Similarity joins and a Word-Based Information Representation Language. ACM Transactions on Information Systems, vol.18, no.3, p.288-321, 2000.

Flamingo. Flamingo Project on Data Cleaning, Department of Computer Science. UC Irvine, 2017. http://flamingo.ics.uci.edu.

Yancey, W.E.. Big Match: A program for extracting probably matches from a large file for record linkage. U.S. Census Bureau, 2004. http://ww2.amstat.org/sections/srms/Proceedings/y2004/files/Jsm2004-000592.pdf>.

Valêncio, C.R. et al.. Otimização de técnicas e algoritmos aplicados à limpeza de Banco de Dados utilizando Multithreading, in Proc. of Conferencia Ibero Americana WWW/Internet 2010 - IADIS, p.414-418, 2010.

ANU Data Mining Group.: Febrl - FREELY Extensible Biomedical Record Linkage, 2012. <http://datamining.anu.edu.au/projects/linkage.html>.

Christen, P.. A Freely Available Record Linkage System with a Graphical User Interface, in Australasian Workshop on Health Data and Knowledge Management - HDKM 2008. p.17–25, 2008.

Elfeky, M.G.; Verykios, V.S.; Elmagarmid, A.K.. TAILOR: a record linkage toolbox, in Proc.of 18th International Conference on Data Engineering, 17–28, 2002. DOI: 10.1109/ICDE.2002.994694.

Oliveira, P.; Rodrigues, F.; Henriques, P.. SmartClean: An Incremental Data Cleaning Tool, in Proc. of 9th International Conference on Quality Software - QSIC 2009, 452–457, 2009. DOI: 10.1109/QSIC.2009.67.

Yan, H.; Diao, X.C.; Li, K.Q.. Research on Information Quality Driven Data Cleaning Framework, in Proc. Of International Seminar on Future Information Technology and Management Engineering - FITME´08, p.537–539, 2008. DOI: 10.1109/FITME.2008.126.

Yu, H.; Yi, Z.X.; Zhen, Y.; Guo-quan, J.. A Universal Data Cleaning Framework Based on User Model, in Proc.of International Colloquium on Computing, Communication, Control and Management - CCCM 2009, p. 200–202, 2009. DOI: 10.1109/CCCM.2009.5267946.

Bao, Y.; Song, J.; Shi, J.; Yu, G.. Case Study on Modeling Approaches and Framework of Scientific Data Cleaning, in Proc. of 9th IEEE International Conference on Computer and Information Technology - CIT’09, p.266–271, 2009. DOI: 10.1109/CIT.2009.88.

Arasu, A.; Kaushik, R.. A grammar-based entity representation framework for data cleaning, in Proc. of ACM SIGMOD International Conference on Management of data - SIGMOD’09, p.233-244, 2009. DOI: 10.1145/1559845.1559871.

Suriadi, S. et al.. Event log imperfection patterns for process mining: Towards a systematic approach to cleaning event logs. Information Systems, vol. 64, p.132-150, 2017. DOI: 10.1016/j.is.2016.07.011.

Shi, Y.; Li, S.. The design and implementation of dynamic data cleaning modeling, in Proc. of 2010 International Conference on Computer Application and System Modeling – ICCASM, 2010. DOI: 10.1109/ICCASM.2010.5619213.

Ali, K.; Warraich, M.A.. A framework to implement data cleaning in enterprise data warehouse for robust data quality, in Proc. of 2010 International Conference on Information and Emerging Technologies - ICIET 2010, p.1–6, 2010. DOI: 10.1109/ICIET.2010.5625701.

Berti-Equille, L.; Dasu, T.; Srivastava, D.. Discovery of complex glitch patterns: A novel approach to Quantitative Data Cleaning, in Proc. of 27th International Conference on Data Engineering – ICDE, p.733 –744, 2011. DOI: 10.1109/ICDE.2011.5767864.

Ciszak, L.. Application of clustering and association methods in data cleaning, in Proc. of International Multiconference on Computer Science and Information Technology - IMCSIT 2008, p. 97–103, 2008. DOI: 10.1109/IMCSIT.2008.4747224.

Qian, X. et al.. Data cleaning approaches in Web2.0 VGI application, in Proc. of 17th International Conference on Geoinformatics, p. 1–4, 2009. DOI: 10.1109/GEOINFORMATICS.2009.5293442.

Wang, H.. The Research of Outlier Data Cleaning through Relevance Comparison, in Proc. of 2nd International Conference on e-Business and Information System Security – EBISS, 1–3, 2010. DOI: 10.1109/EBISS.2010.5473717.

Okita, T.. Data cleaning for word alignment, in Proc. of ACL-IJCNLP Student Research Workshop, p.72–80, 2009.

Bertossi, L.; Kolahi, S.; Lakshmanan, L.V.S.. Data cleaning and query answering with matching dependencies and matching functions, in Proc. of 14th International Conference on Extending Database Technology - ICDT’11, p.268-279, 2010.

Chaturvedi, S. et al.. Optimal Training Data Selection for Rule-Based Data Cleansing Models, in Proc. of 2011 Annual SRII Global Conference, p.126–134, 2011. DOI: 10.1109/SRII.2011.25.

Prasad, K.H. et al.. Data Cleansing Techniques for Large Enterprise Datasets, in Proc. of 2011 Annual SRII Global Conference, p.135–144, 2011. DOI: 10.1109/SRII.2011.26.

SQLite: SQLite. http://www.sqlite.org.

Yu, Z., Bai, C.; Cai, K.Y.. Mutation-oriented test data argumentation for GUI software fault localization. Information and Software Technology, vol.55, no.12, p.2076-2098, 2013. DOI: 10.1016/j.infsof.2013.07.004

JETIA Journal data:
Available:	2015 - 2024
Volumes:	10
Issues:	45
Articles:	552
Article processing charges (APCs):	FREE