skip to main content
interview

Cleanix: a Parallel Big Data Cleaning System

Published:09 May 2016Publication History
Skip Abstract Section

Abstract

For big data, data quality problem is more serious. Big data cleaning system requires scalability and the abilityof handling mixed errors. Motivated by this, we develop Cleanix, a prototype system for cleaning relational Big Data. Cleanix takes data integrated from multiple data sources and cleans them on a shared-nothing machine cluster. The backend system is built on-top-of an extensible and flexible data-parallel substrate the Hyracks framework. Cleanix supports various data cleaning tasks such as abnormal value detection and correction, incomplete data filling, de-duplication, and conflict resolution. In this paper, we show the organization, data cleaning algorithms as well as the design of Cleanix.

References

  1. Thomas N. Herzog, Fritz J. Scheuren, and William E. Winkler. Data quality and record linkage techniques. Springer, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. CerFix: A system for cleaning data with certain fixes. PVLDB, 4(12):1375--1378, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, and Cristian-Augustin Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB, pages 371--380, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Vinayak R. Borkar, Michael J. Carey, Raman Grover, Nicola Onose, and Rares Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pages 1151--1162, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Erhard Rahm and Hong Hai Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.Google ScholarGoogle Scholar
  7. Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746--755, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  8. Wenfei Fan and Floris Geerts. Relative information completeness. ACM Trans. Database Syst., 35(4):27, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Philip Bohannon, Michael Flaster, Wenfei Fan, and Rajeev Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, June 14-16, 2005, pages 143--154, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. Improving data quality: Consistency and accuracy. In Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007, pages 315--326, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Amélie Marian and Minji Wu. Corroborating information from web sources. IEEE Data Eng. Bull., 34(3):11--17, 2011.Google ScholarGoogle Scholar
  12. Xin Luna Dong, Laure Berti-Equille, and Divesh Srivastava. Integrating conflicting data: The role of source dependence. PVLDB, 2(1):550--561, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Hongzhi Wang, Mingda Li, Yingyi Bu, Jianzhong Li, Hong Gao, and Jiacheng Zhang. Cleanix: A big data cleaning parfait. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 2014, Shanghai, China, November 3-7, 2014, pages 2024--2026, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Vinayak R. Borkar, Michael J. Carey, and Chen Li. Inside "Big Data management": ogres, onions, or parfaits? In EDBT, pages 3--14, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Esko Ukkonen. Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci., 92(1):191--211, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Lingli Li, Hongzhi Wang, Hong Gao, and Jianzhong Li. EIF: A framework of effective entity identification. In WAIM, pages 717--728, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader