interview

Cleanix: a Parallel Big Data Cleaning System

Authors:
Hongzhi Wang

Harbin Institute of Technology

Harbin Institute of Technology
View Profile

,
Mingda Li

Harbin Institute of Technology

Harbin Institute of Technology
View Profile

,
Yingyi Bu

University of California, Irvine

University of California, Irvine
View Profile

,
Jianzhong Li

Harbin Institute of Technology

Harbin Institute of Technology
View Profile

,
Hong Gao

Harbin Institute of Technology

Harbin Institute of Technology
View Profile

,
Jiacheng Zhang

Tsinghua University

Tsinghua University
View Profile

Authors Info & Claims

ACM SIGMOD Record Volume 44 Issue 4December 2015pp 35–40https://doi.org/10.1145/2935694.2935702

Published:09 May 2016Publication History

ACM SIGMOD Record

Abstract

For big data, data quality problem is more serious. Big data cleaning system requires scalability and the abilityof handling mixed errors. Motivated by this, we develop Cleanix, a prototype system for cleaning relational Big Data. Cleanix takes data integrated from multiple data sources and cleans them on a shared-nothing machine cluster. The backend system is built on-top-of an extensible and flexible data-parallel substrate the Hyracks framework. Cleanix supports various data cleaning tasks such as abnormal value detection and correction, incomplete data filling, de-duplication, and conflict resolution. In this paper, we show the organization, data cleaning algorithms as well as the design of Cleanix.

References

Thomas N. Herzog, Fritz J. Scheuren, and William E. Winkler. Data quality and record linkage techniques. Springer, 2007. Google ScholarDigital Library
Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. CerFix: A system for cleaning data with certain fixes. PVLDB, 4(12):1375--1378, 2011.Google ScholarDigital Library
Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, and Cristian-Augustin Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB, pages 371--380, 2001. Google ScholarDigital Library
Vinayak R. Borkar, Michael J. Carey, Raman Grover, Nicola Onose, and Rares Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pages 1151--1162, 2011. Google ScholarDigital Library
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007. Google ScholarDigital Library
Erhard Rahm and Hong Hai Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.Google Scholar
Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746--755, 2007.Google ScholarCross Ref
Wenfei Fan and Floris Geerts. Relative information completeness. ACM Trans. Database Syst., 35(4):27, 2010. Google ScholarDigital Library
Philip Bohannon, Michael Flaster, Wenfei Fan, and Rajeev Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, June 14-16, 2005, pages 143--154, 2005. Google ScholarDigital Library
Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. Improving data quality: Consistency and accuracy. In Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007, pages 315--326, 2007. Google ScholarDigital Library
Amélie Marian and Minji Wu. Corroborating information from web sources. IEEE Data Eng. Bull., 34(3):11--17, 2011.Google Scholar
Xin Luna Dong, Laure Berti-Equille, and Divesh Srivastava. Integrating conflicting data: The role of source dependence. PVLDB, 2(1):550--561, 2009. Google ScholarDigital Library
Hongzhi Wang, Mingda Li, Yingyi Bu, Jianzhong Li, Hong Gao, and Jiacheng Zhang. Cleanix: A big data cleaning parfait. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 2014, Shanghai, China, November 3-7, 2014, pages 2024--2026, 2014. Google ScholarDigital Library
Vinayak R. Borkar, Michael J. Carey, and Chen Li. Inside "Big Data management": ogres, onions, or parfaits? In EDBT, pages 3--14, 2012. Google ScholarDigital Library
Esko Ukkonen. Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci., 92(1):191--211, 1992. Google ScholarDigital Library
Lingli Li, Hongzhi Wang, Hong Gao, and Jianzhong Li. EIF: A framework of effective entity identification. In WAIM, pages 717--728, 2010. Google ScholarDigital Library

Recommendations

Cleanix: A Big Data Cleaning Parfait
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

In this demo, we present Cleanix, a prototype system for cleaning relational Big Data. Cleanix takes data integrated from multiple data sources and cleans them on a shared-nothing machine cluster. The backend system is built on-top-of an extensible and ...
Read More
Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark
Read More
Big Data Analytics with R and Hadoop
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGMOD Record Volume 44, Issue 4
December 2015
59 pages
ISSN:0163-5808
DOI:10.1145/2935694
Editors:
Yanlei Diao
University of Massachusetts Amherst
,
Pablo Barceló
Universidad de Chile
,
Vanessa Braganholo
Universidade Federal Fluminense
,
Marco Brambilla
Politecnico di Milano
,
Chee Yong Chan
National University of Singapore
,
Rada Chirkova
North Carolina State University
,
Anastasios Kementsietsidis
Google Research
,
Olga Papaemmanoui
Brandeis Univesity
,
Aditya Parameswaran
University of Illinois
,
Anish Das Sarma
Google Research
,
Alkis Simitsis
HP Labs
,
Nesime Tatbul
ETH Zurich
,
Marianne Winslett
University of Illinois
,
Jun Yang
Duke University
Issue’s Table of Contents
Copyright © 2016 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 May 2016
Check for updates
Qualifiers
- interview
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 14
  Total Citations
  View Citations
- 365
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cleanix: a Parallel Big Data Cleaning System

ACM SIGMOD Record

Abstract

References

Cited By

Recommendations

Cleanix: A Big Data Cleaning Parfait

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark

Big Data Analytics with R and Hadoop