Abstract
For big data, data quality problem is more serious. Big data cleaning system requires scalability and the abilityof handling mixed errors. Motivated by this, we develop Cleanix, a prototype system for cleaning relational Big Data. Cleanix takes data integrated from multiple data sources and cleans them on a shared-nothing machine cluster. The backend system is built on-top-of an extensible and flexible data-parallel substrate the Hyracks framework. Cleanix supports various data cleaning tasks such as abnormal value detection and correction, incomplete data filling, de-duplication, and conflict resolution. In this paper, we show the organization, data cleaning algorithms as well as the design of Cleanix.
- Thomas N. Herzog, Fritz J. Scheuren, and William E. Winkler. Data quality and record linkage techniques. Springer, 2007. Google ScholarDigital Library
- Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. CerFix: A system for cleaning data with certain fixes. PVLDB, 4(12):1375--1378, 2011.Google ScholarDigital Library
- Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, and Cristian-Augustin Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB, pages 371--380, 2001. Google ScholarDigital Library
- Vinayak R. Borkar, Michael J. Carey, Raman Grover, Nicola Onose, and Rares Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pages 1151--1162, 2011. Google ScholarDigital Library
- Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007. Google ScholarDigital Library
- Erhard Rahm and Hong Hai Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.Google Scholar
- Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746--755, 2007.Google ScholarCross Ref
- Wenfei Fan and Floris Geerts. Relative information completeness. ACM Trans. Database Syst., 35(4):27, 2010. Google ScholarDigital Library
- Philip Bohannon, Michael Flaster, Wenfei Fan, and Rajeev Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, June 14-16, 2005, pages 143--154, 2005. Google ScholarDigital Library
- Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. Improving data quality: Consistency and accuracy. In Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007, pages 315--326, 2007. Google ScholarDigital Library
- Amélie Marian and Minji Wu. Corroborating information from web sources. IEEE Data Eng. Bull., 34(3):11--17, 2011.Google Scholar
- Xin Luna Dong, Laure Berti-Equille, and Divesh Srivastava. Integrating conflicting data: The role of source dependence. PVLDB, 2(1):550--561, 2009. Google ScholarDigital Library
- Hongzhi Wang, Mingda Li, Yingyi Bu, Jianzhong Li, Hong Gao, and Jiacheng Zhang. Cleanix: A big data cleaning parfait. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 2014, Shanghai, China, November 3-7, 2014, pages 2024--2026, 2014. Google ScholarDigital Library
- Vinayak R. Borkar, Michael J. Carey, and Chen Li. Inside "Big Data management": ogres, onions, or parfaits? In EDBT, pages 3--14, 2012. Google ScholarDigital Library
- Esko Ukkonen. Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci., 92(1):191--211, 1992. Google ScholarDigital Library
- Lingli Li, Hongzhi Wang, Hong Gao, and Jianzhong Li. EIF: A framework of effective entity identification. In WAIM, pages 717--728, 2010. Google ScholarDigital Library
Recommendations
Cleanix: A Big Data Cleaning Parfait
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge ManagementIn this demo, we present Cleanix, a prototype system for cleaning relational Big Data. Cleanix takes data integrated from multiple data sources and cleans them on a shared-nothing machine cluster. The backend system is built on-top-of an extensible and ...
Comments