Automatic Web-based relational data imputation

Liu, Hailong; Li, Zhanhuai; Chen, Qun; Chen, Zhaoqiang

doi:10.1007/s11704-016-6319-3

Automatic Web-based relational data imputation

Research Article
Published: 07 February 2018

Volume 12, pages 1125–1139, (2018)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Hailong Liu¹,
Zhanhuai Li¹,
Qun Chen¹ &
…
Zhaoqiang Chen¹

57 Accesses
Explore all metrics

Abstract

Data incompleteness is one of the most important data quality problems in enterprise information systems. Most existing data imputing techniques just deduce approximate values for the incomplete attributes by means of some specific data quality rules or some mathematical methods. Unfortunately, approximation may be far away from the truth. Furthermore, when observed data is inadequate, they will not work well. The World Wide Web (WWW) has become the most important and the most widely used information source. Several current works have proven that using Web data can augment the quality of databases. In this paper, we propose a Web-based relational data imputing framework, which tries to automatically retrieve real values from the WWW for the incomplete attributes. In the paper, we try to take full advantage of relations among different kinds of objects based on the idea that the same kind of things must have the same kind of relations with their relatives in a specific world. Our proposed techniques consist of two automatic query formulation algorithms and one graph-based candidates extraction model. Several evaluations are proposed on two high-quality real datasets and one poor-quality real dataset to prove the effectiveness of our approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cost Reduction for Web-Based Data Imputation

Efficient Web-Based Data Imputation with Graph Model

Drawing CoCo Core-Sets from Incomplete Relational Data

References

Batista G E, Monard M C. An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 2003, 17(5–6): 519–533
Article Google Scholar
Ramoni M, Sebastiani P. Robust learning with missing data. Machine Learning, 2001, 45(2): 147–170
Article MATH Google Scholar
Grzymala-Busse J W, Hu M. A comparison of several approaches to missing attribute values in data mining. In: Proceedings of the 2nd International Conference on Rough Sets and Current Trends in Computing. 2000, 378–385
Google Scholar
Zhu X F, Zhang S C, Jin Z, Zhang Z L, Xu Z M. Missing value estimation for mixed-attribute data sets. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(1): 110–121
Article Google Scholar
Little R J, Rubin D B. Statistical Analysis with Missing Data. New York: John Wiley & Sons, 2002
Book MATH Google Scholar
Loshin D. Master Data Management. Boston: Morgan Kaufmann, 2010
MATH Google Scholar
Schlaefer N, Ko J, Betteridge J, Sautter G, Pathak M A, Nyberg E. Semantic extensions of the Ephyra QA system for TREC 2007. In: Proceedings of the 16th Text REtrieval Conference. 2007, 332–341
Google Scholar
Huhtala Y, Kärkkäinen J, Porkka P, Toivonen H. Tane: an efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 1999, 42(2): 100–111
Article MATH Google Scholar
Hollan J H. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control and Artificial Intelligence. Cambridge, MA: MIT press, 1992
Book Google Scholar
Goldberg D E. Genetic Algorithms in Search, Optimization, and Machine Learning. Pearson: Addison-Wesley Professional, 1989
MATH Google Scholar
Li Z X, Sharaf MA, Sitbon L, Sadiq S, Indulska M, Zhou X F. Webput: efficient Web-based data imputation. In: Proceedings of the 13th International Conference on Web Information Systems Engineering. 2012, 243–256
Google Scholar
Jurafsky D, James H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech. Upper Saddle River: Pearson Education, 2000
Google Scholar
Finkel J R, Grenager T, Manning C. Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. 2005, 363–370
Google Scholar
Fader A, Soderland S, Etzioni O. Identifying relations for open information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2011, 1535–1545
Google Scholar
Liu H L, Li Z H, Jin C Q, Chen Q. Web-based techniques for automatically detecting and correcting information errors in a database. In: Proceedings of the 3rd International Conference on Big Data and Smart Computing. 2016, 261–264
Google Scholar
Lakshminarayan K, Harp S A, Goldman R, Samad T. Imputation of missing data using machine learning techniques. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. 1996, 140–145
Google Scholar
Wang Q H, Rao J. Empirical likelihood-based inference in linear models with missing data. Scandinavian Journal of Statistics, 2002, 29(3): 563–576
Article MathSciNet MATH Google Scholar
Zhang S C, Zhang J L, Zhu Z F, Qin Y S, Zhang C Q. Missing value imputation based on data clustering. Transactions on Computational Science, 2008, 128–138
Google Scholar
Yakout M, Elmagarmid A K, Neville J, Ouzzani M, Ilyas I F. Guided data repair. Proceedings of the VLDB Endowment, 2011, 4(5): 279–289
Article Google Scholar
Tong Y X, Cao C C, Zhang C J, Li Y T and Chen L. Crowdcleaner: data cleaning for multi-version data on the Web via crowdsourcing. In: Proceedings of the 30th IEEE International Conference on Data Engineering. 2014, 1182–1185
Google Scholar
Fan W F, Geerts F. Capturing missing tuples and missing values. In: Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 2010, 169–178
Google Scholar
Fan W F, Geerts F. Relative information completeness. ACM Transactions on Database Systems, 2010, 35(4): 97–106
Article Google Scholar
Fan W F, Li J Z, Ma S, Tang N, Yu W Y. Towards certain fixes with editing rules and master data. Proceedings of the VLDB Endowment, 2010, 3(2): 213–238
Google Scholar
Cirasella J. Google Sets, Google Suggest, and Google Search History: three more tools for the reference librarian’s bag of trick. The Reference Librarian, 2007, 48(1): 57–65
Article Google Scholar
Wang R C, Cohen W W. Language-independent set expansion of named entities using the Web. In: Proceedings of the 7th IEEE International Conference on Data Mining. 2007, 342–350
Google Scholar
Wang R C, Cohen WW. Iterative set expansion of named entities using the Web. In: Proceedings of the 8th IEEE International Conference on Data Mining. 2008, 1091–1096
Google Scholar
Sadamitsu K, Saito K, Imamura K, Kikui G. Entity set expansion using topic information. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011, 726–731
Google Scholar
Dalvi B B, Cohen W W, Callan J. Websets: extracting sets of entities from the Web using unsupervised information extraction. In: Proceedings of the 5th ACM International Conference on Web search and Data Mining. 2012, 243–252
Google Scholar
Bian H Q, Chen Y G, Du X Y, Zhang X L. MetKB: enriching RDF knowledge bases with Web entity-attribute tables. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 2013, 2461–2464
Google Scholar
Zhang X L, Chen Y G, Chen J C, Du X Y, Zou L. Mapping entity-attribute Web tables to web-scale knowledge bases. In: Proceedings of the 18th International Conference on Database Systems for Advanced Applications. 2013, 108–122
Chapter Google Scholar
Li Z X, Sharaf M A, Sitbon L, Du X Y, Zhou X F. CoRE: a context-aware relation extraction method for relation completion. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(4): 836–849
Article Google Scholar
Tang N, Vemuri V R. Web-based knowledge acquisition to impute missing values for classification. In: Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence. 2004, 124–130
Chapter Google Scholar
Li Z X, Sharaf M A, Sitbon L, Sadiq S, Indulska M, Zhou X F. A web-based approach to data imputation. WorldWideWeb, 2014, 17(5): 873–897
Google Scholar
Li Z X, Shang S, Xie Q, Zhang X L. Cost reduction for web-based data imputation. In: Proceedings of the 19th International Conference on Database Systems for Advanced Applications. 2014, 438–452
Chapter Google Scholar
Soderland S. Learning information extraction rules for semi-structured and free text. Machine Learning, 1999, 34(1–3): 233–272
Article MATH Google Scholar
Liu H L, Li Z H, Chen Q, Chen Z Q. A review on web-based techniques for automatically detecting and correcting information errors in relational databases. Chinese Journal of Computers, 2016, 40(10): 2286–2304
Google Scholar

Download references

Acknowledgments

The authors would like to thank the anonymous referees for their valuable comments and the recommendation of ICYCSEE 2016. The work was supported by the Ministry of Science and Technology of China, National Key Research and Development Program (2016YFB1000700), the National Natural Science Foundation of China (Grant Nos. 61502390, 61472321, 61402370, 61502392), and the Basic Research Fund of Northwestern Polytechnical University (3102015JSJ0004, 3102014JSJ0013, 3102014JSJ0005).

Author information

Authors and Affiliations

School of Computer Science and Technology, Northwestern Polytechnical University, Xi’an, 710072, China
Hailong Liu, Zhanhuai Li, Qun Chen & Zhaoqiang Chen

Authors

Hailong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhanhuai Li
View author publications
You can also search for this author in PubMed Google Scholar
Qun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zhaoqiang Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hailong Liu.

Additional information

Hailong Liu is a lecturer at School of Computer Science and Technology, Northwestern Polytechnical University (NPU), China. He received his MS and PhD degrees from NPU. He is the member of China Computer Federation and ACM SIGMOD China. His research interests include data management and data quality.

Zhanhuai Li is a professor at School of Computer Science and Technology, Northwestern Polytechnical University (NPU), China. He is the vice-chairman of Database Technical Committee of China Computer Federation. He received his MS and PhD degrees from NPU. His research interests include data management and data quality.

Qun Chen is currently a professor at School of Computer Science and Technology, Northwestern Polytechnical University, China. He received his PhD degree from National University of Singapore, Singapore. Between 2004 and 2006, he was a research associate in Hong Kong University of Science and Technology, China. His research interests include data management and data quality.

Zhaoqiang Chen is a PhD candidate at School of Computer Science and Technology, Northwestern Polytechnical University (NPU), China. He received his MS from NPU. His research interests include data management and data quality.

Electronic supplementary material

Supplementary material, approximately 304 KB.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, H., Li, Z., Chen, Q. et al. Automatic Web-based relational data imputation. Front. Comput. Sci. 12, 1125–1139 (2018). https://doi.org/10.1007/s11704-016-6319-3

Download citation

Received: 16 June 2016
Accepted: 08 December 2016
Published: 07 February 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s11704-016-6319-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic Web-based relational data imputation

Abstract

Access this article

Similar content being viewed by others

Cost Reduction for Web-Based Data Imputation

Efficient Web-Based Data Imputation with Graph Model

Drawing CoCo Core-Sets from Incomplete Relational Data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material, approximately 304 KB.

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automatic Web-based relational data imputation

Abstract

Access this article

Similar content being viewed by others

Cost Reduction for Web-Based Data Imputation

Efficient Web-Based Data Imputation with Graph Model

Drawing CoCo Core-Sets from Incomplete Relational Data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material, approximately 304 KB.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation