A New Efficient Data Cleansing Method

Zhao, Li; Yuan, Sung Sam; Peng, Sun; Wang, Ling Tok

doi:10.1007/3-540-46146-9_48

Li Zhao⁷,
Sung Sam Yuan⁷,
Sun Peng⁷ &
…
Ling Tok Wang⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2453))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

1427 Accesses
5 Citations

Abstract

One of the most important tasks in data cleansing is to detect and remove duplicate records, which consists of two main components, detection and comparison. A detection method decides which records will be compared, and a comparison method determines whether two records compared are duplicate. Comparisons take a great deal of data cleansing time. We discover that if certain properties are satisfied by a comparison method then many unnecessary expensive comparisons can be avoided. In this paper, we first propose a new comparison method, LCSS, based on the longest common subsequence, and show that it possesses the desired properties. We then propose two new detection methods, SNM-IN and SNM-INOUT, which are variances of the popular detection method SNM. The performance study on real and synthetic databases shows that the integration of SNM-IN (SNM-INOUT) and LCSS saves about 39% (56%) of comparisons.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. A. Saita. Declarative data cleaning: Language, mode, and algorithms. In Proc. 27th Int’l. Conf. on Very Large Databases, pages 371–380, Roma, Italy, 2001.
Google Scholar
M. Hernandez. A generalization of band joins and the merge/purge problem. Technical Report CUCS-005-1995, Columbia University, February 1996.
Google Scholar
M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 127–138, May 1995.
Google Scholar
K. S. Larsen. Length of maximal common subsequences. Available from http://www.daimi.au.dk/PB/426/PB-426.pdf.
M. L. Lee, T. W. Ling, and W. L. Low. Intelliclean: A knowledge-based intelligent data cleaner. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 290–294, 2000.
Google Scholar
M. L. Lee, H. J. Lu, T. W. Ling, and Y. T. Ko. Cleansing data for mining and warehousing. In Proceedings of the 10th International Conference on Database and Expert Systems Applications (DEXA), pages 751–760, 1999.
Google Scholar
Infoshare Limited. Best value guide to data standardizing. InfoDB, July 1998. Available from http://www.infoshare.ltd.uk.
A. E. Monge. Matching algorithm within a duplicate detection system. In IEEE Data Engineering Bulletin, volume 23(4), December 2000.
Google Scholar
A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proceeding of the ACMSIGMOD Workshop on Research Issues on Knowledge Discovery and Data Mining, Tucson, AZ, 1997.
Google Scholar
V. Raman and J. M. Hellerstein. Potter’s wheel: An interactive data cleaning system. In Proc. 27th Int’l. Conf. on Very Large Databases, Rome, 2001.
Google Scholar
A. Silberschatz, M. StoneBraker, and J. Ullman. Database research: Achievements and opportunities into the 21st century. In SIGMOD Record (ACM Special Interest Group on Management of Data), page 25(1):52, 1996.
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science, National Univ. of Singapore, 3 Science Drive 2, 117543, Singapore
Li Zhao, Sung Sam Yuan, Sun Peng & Ling Tok Wang

Authors

Li Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Sung Sam Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Sun Peng
View author publications
You can also search for this author in PubMed Google Scholar
Ling Tok Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Université Paul Sabatier, IRIT, 118 route de Narbonne, 31062, Toulouse Cedex, France
Abdelkader Hameurlain
Département Informatique, Université Aix-Marseille II, IUT, 413 Avenue Gaston Berger, 13625, Aix-en-Provence Cedex 1, France
Rosine Cicchetti
Institute of Applied Computer Science, University of Linz, Altenbergerstr. 69, 4040, Linz, Austria
Roland Traunmüller

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, L., Yuan, S.S., Peng, S., Wang, L.T. (2002). A New Efficient Data Cleansing Method. In: Hameurlain, A., Cicchetti, R., Traunmüller, R. (eds) Database and Expert Systems Applications. DEXA 2002. Lecture Notes in Computer Science, vol 2453. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46146-9_48

Download citation

DOI: https://doi.org/10.1007/3-540-46146-9_48
Published: 20 August 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44126-7
Online ISBN: 978-3-540-46146-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics