COSSET+: Crowdsourced Missing Value Imputation Optimized by Knowledge Base

Wang, Hong-Zhi; Qi, Zhi-Xin; Shi, Ruo-Xi; Li, Jian-Zhong; Gao, Hong

doi:10.1007/s11390-017-1768-1

COSSET+: Crowdsourced Missing Value Imputation Optimized by Knowledge Base

Regular Paper
Published: 20 September 2017

Volume 32, pages 845–857, (2017)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Hong-Zhi Wang¹,
Zhi-Xin Qi¹,
Ruo-Xi Shi¹,
Jian-Zhong Li¹ &
…
Hong Gao¹

143 Accesses
4 Citations
Explore all metrics

Abstract

Missing value imputation with crowdsourcing is a novel method in data cleaning to capture missing values that could hardly be filled with automatic approaches. However, the time cost and overhead in crowdsourcing are high. Therefore, we have to reduce cost and guarantee the accuracy of crowdsourced imputation. To achieve the optimization goal, we present COSSET+, a crowdsourced framework optimized by knowledge base. We combine the advantages of both knowledge-based filter and crowdsourcing platform to capture missing values. Since the amount of crowd values will affect the cost of COSSET+, we aim to select partial missing values to be crowdsourced. We prove that the crowd value selection problem is an NP-hard problem and develop an approximation algorithm for this problem. Extensive experimental results demonstrate the efficiency and effectiveness of the proposed approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Crowdsourcing-Enhanced Missing Values Imputation Based on Bayesian Network

Capture Missing Values Based on Crowdsourcing

Adaptive multiple imputations of missing values using the class center

Article Open access 28 April 2022

References

Weinberg J B, Biswas G, Koller G R. Conceptual clustering with systematic missing values. In Proc. the 9th Int. Workshop on Machine Learning, July 1992, pp.464-469.
Silva L O, Zárate L E. A brief review of the main approaches for treatment of missing data. Intelligent Data Analysis, 2014, 18(6): 1177-1198.
Google Scholar
Hua M, Pei J. DiMaC: A system for cleaning disguised missing data. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2008, pp.1263-1266.
Himmelspach L, Conrad S. Clustering approaches for data with missing values: Comparison and evaluation. In Proc. the 5th Int. Conf. Digital Information Management, July 2010, pp.19-28.
Shan Y, Deng G. Kernel PCA regression for missing data estimation in DNA microarray analysis. In Proc. IEEE Int. Symp. Circuits and Systems, May 2009, pp.1477-1480.
Yang K, Li J Z, Wang C K. Missing values estimation in microarray data with partial least squares regression. In Proc. the 6th Int. Conf. Computational Science, May 2006, pp.662-669.
Siddique J, Belin T R. Using an Approximate Bayesian Bootstrap to multiply impute nonignorable missing data. Computational Statistics & Data Analysis, 2008, 53(2): 405-415.
Article MathSciNet MATH Google Scholar
Rubin D B. Multiple imputation after 18+ years. Journal of the American Statistical Association, 1996, 91(434): 473-489.
Article MATH Google Scholar
Patrician P A. Multiple imputation for missing data. Research in Nursing & Health, 2002, 25(1): 76-84.
Article Google Scholar
Lakshminarayan K, Harp S A, Goldman R, Samad T. Imputation of missing data using machine learning techniques. In Proc. the 2nd Int. Conf. Knowledge Discovery and Data Mining, August 1996, pp.140-145.
Li X B. A Bayesian approach for estimating and replacing missing categorical data. Journal of Data and Information Quality (JDIQ), 2009, 1(1): Article No. 3.
Di Zio M, Scanu M, Coppola L, Luzi O, Ponti A. Bayesian networks for imputation. Journal of the Royal Statistical Society Series A (Statistics in Society), 2004, 167(2): 309-322.
Mayfield C, Neville J, Prabhakar S. ERACER: A database approach for statistical inference and data cleaning. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2010, pp.75-86.
Zhang S C. Shell-neighbor method and its application in missing data imputation. Applied Intelligence, 2011, 35(1): 123-133.
Article Google Scholar
Zhang C Q, Zhu X F, Zhang J L, Qin Y S, Zhang S C. GBKII: An imputation method for missing values. In Proc. the 11th Pacific-Asia Conf. Knowledge Discovery and Data Mining, May 2007, pp.1080-1087.
Setiawan N A, Venkatachalam P A, Hani A F M. Missing attribute value prediction based on artificial neural network and rough set theory. In Proc. Int. Conf. Biomedical Engineering and Informatics, May 2008, pp.306-310.
Tang N, Vemuri V R. Web-based knowledge acquisition to impute missing values for classification. In Proc. the IEEE/WIC/ACM Int. Conf. Web Intelligence, September 2004, pp.124-130.
Hao S, Tang N, Li G L, Li J. Cleaning relations using knowledge bases. In Proc. the 33rd Int. Conf. Data Engineering, April 2017, pp.933-944.
Chu X, Morcos J, Ilyas I F, Ouzzani M, Papotti P, Tang N, Ye Y. KATARA: Reliable data cleaning with knowledge bases and crowdsourcing. Proceedings of the VLDB Endowment, 2015, 8(12): 1952-1955.
Article Google Scholar
Qi Z X, Wang H Z, Meng F S, Li J Z, Gao H. Capture missing values with inference on knowledge base. In Proc. the Int. Conf. Database Systems for Advanced Applications, March 2017, pp.185-194.
Ye C, Wang H Z. Capture missing values based on crowdsourcing. In Proc. the 9th Int. Conf. Wireless Algorithms Systems and Applications, June 2014, pp.783-792.
Ye C, Wang H Z, Li J Z, Gao H, Cheng S Y. Crowdsourcingenhanced missing values imputation based on Bayesian network. In Proc. the 21st Int. Conf. Database Systems for Advanced Applications, April 2016, pp.67-81.
Chu X, Morcos J, Ilyas I F, Ouzzani M, Papotti P, Tang N, Ye Y. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In Proc. the ACM SIGMOD Int. Conf. Management of Data, May 31-June 4, 2015, pp.1247-1261.
Wang Q, Wang B, Guo L. Knowledge base completion using embeddings and rules. In Proc. the 24th Int. Conf. Artificial Intelligence, July 2015, pp.1859-1865.
Neelakantan A, Chang M W. Inferring missing entity type instances for knowledge base completion: New dataset and methods. In Proc. Human Language Technologies: The 2015 Annual Conf. the North American Chapter of the ACL, May 2015, pp.515-525.
Neelakantan A, Roth B, McCallum A. Compositional vector space models for knowledge base completion. In Proc. the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Int. Joint Conf. Natural Language Processing, July 2015, pp.156-166.
Guo H Z, Chen Q C, Wang X L, Cui L. Tolerance rough set based attribute extraction approach for multiple semantic knowledge base integration. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2011, 19(4): 659-684.
Article MathSciNet Google Scholar
Marinos L, Lee J. Using structural and procedural knowledge in database and knowledge base integration. In Proc. IEEE Int. Workshop on Tools for Artificial Intelligence, Architectures Languages and Algorithms, October 1989, pp.407-417.
Zheng Y D, Li G L, Cheng R. DOCS: A domain-aware crowdsourcing system using knowledge bases. Proceedings of the VLDB Endowment, 2016, 10(4): 361-372.
Article Google Scholar
Li H W, Zhao B, Fuxman A. The wisdom of minority: Discovering and targeting the right group of workers for crowdsourcing. In Proc. the 23rd Int. Conf. World Wide Web, April 2014, pp.165-176.
Wang J, Ipeirotis P G Provost F. Quality-based pricing for crowdsourced workers NYU Working Paper No. 2451/31833 Social Science Electronic Publishing, 2013. https://ssrn.com/abstract=2283000, June 2017.
Fan J, Li G L, Ooi B C, Tan K L, Feng J H. iCrowd: An adaptive crowdsourcing framework. In Proc. the ACM SIGMOD Int. Conf. Management of Data, May 31-June 4, 2015, pp.1015-1030.
Feng J H, Li G L, Wang H N, Feng J H. Incremental quality inference in crowdsourcing. In Proc. the 19th Int. Conf. Database Systems for Advanced Applications, April 2014, pp.453-467.
Zheng Y D, Wang J N, Li G L, Cheng R, Feng J H. QASCA: A quality-aware task assignment system for crowdsourcing applications. In Proc. the ACM SIGMOD Int. Conf. Management of Data, May 31-June 4, 2015, pp.1031-1046.
Raykar V C, Yu S P. Eliminating spammers and ranking annotators for crowdsourced labeling tasks. The Journal of Machine Learning Research, 2012, 13(1): 491-518.
MathSciNet MATH Google Scholar
Cavallo R, Jain S. Efficient crowdsourcing contests. In Proc. the 11th Int. Conf. Autonomous Agents and Multiagent Systems, June 2012, pp.677-686.
Roy S B, Lykourentzou I, Thirumuruganathan S, AmerYahia S, Das G. Task assignment optimization in knowledge-intensive crowdsourcing. The VLDB Journal, 2015, 24(4): 467-491.
Article Google Scholar
Fomin F V, Grandoni F, Pyatkin A V, Stepanov A A. Bounding the number of minimal dominating sets: A measure and conquer approach. In Proc. the 16th Int. Symp. Algorithms and Computation, December 2005, pp.573-582.
DeVore R A, Temlyakov V N. Some remarks on greedy algorithms. Advances in Computational Mathematics, 1996, 5(1): 173-187.
Article MathSciNet MATH Google Scholar
Kann V. On the approximability of the maximum common subgraph problem. In Proc. the 9th Annual Symp. Theoretical Aspects of Computer Science, February 1992, pp.375-388.
Feige U. A threshold of lnn for approximating set cover. Journal of the ACM, 1998, 45(4): 634-652.
Article MathSciNet MATH Google Scholar
Rahman G, Islam Z. A decision tree-based missing value imputation technique for data pre-processing. In Proc. the 9th Australasian Data Mining Conf., December 2011, pp.41-50.
Li H, Emmanuel A, LI P, Wu M. Imputation algorithm of missing values based on EM and Bayesian network. Computer Engineering and Applications, 2010, 46(5): 123-125.
Miyakoshi Y, Kato S. A missing value imputation method using a Bayesian network with weighted learning. Electronics and Communications in Japan, 2012, 95(12): 1-9.
Article Google Scholar
Li Z X, Sharaf M A, Sitbon L, Sadiq S, Indulska M, Zhou X F. A web-based approach to data imputation. World Wide Web, 2014, 17(5): 873-897.
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
Hong-Zhi Wang, Zhi-Xin Qi, Ruo-Xi Shi, Jian-Zhong Li & Hong Gao

Authors

Hong-Zhi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhi-Xin Qi
View author publications
You can also search for this author in PubMed Google Scholar
Ruo-Xi Shi
View author publications
You can also search for this author in PubMed Google Scholar
Jian-Zhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Hong Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hong-Zhi Wang.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 314 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, HZ., Qi, ZX., Shi, RX. et al. COSSET+: Crowdsourced Missing Value Imputation Optimized by Knowledge Base. J. Comput. Sci. Technol. 32, 845–857 (2017). https://doi.org/10.1007/s11390-017-1768-1

Download citation

Received: 01 April 2017
Revised: 15 August 2017
Published: 20 September 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s11390-017-1768-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

COSSET+: Crowdsourced Missing Value Imputation Optimized by Knowledge Base

Abstract

Access this article

Similar content being viewed by others

Crowdsourcing-Enhanced Missing Values Imputation Based on Bayesian Network

Capture Missing Values Based on Crowdsourcing

Adaptive multiple imputations of missing values using the class center

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

COSSET+: Crowdsourced Missing Value Imputation Optimized by Knowledge Base

Abstract

Access this article

Similar content being viewed by others

Crowdsourcing-Enhanced Missing Values Imputation Based on Bayesian Network

Capture Missing Values Based on Crowdsourcing

Adaptive multiple imputations of missing values using the class center

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation