Skip to main content
Log in

COSSET+: Crowdsourced Missing Value Imputation Optimized by Knowledge Base

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Missing value imputation with crowdsourcing is a novel method in data cleaning to capture missing values that could hardly be filled with automatic approaches. However, the time cost and overhead in crowdsourcing are high. Therefore, we have to reduce cost and guarantee the accuracy of crowdsourced imputation. To achieve the optimization goal, we present COSSET+, a crowdsourced framework optimized by knowledge base. We combine the advantages of both knowledge-based filter and crowdsourcing platform to capture missing values. Since the amount of crowd values will affect the cost of COSSET+, we aim to select partial missing values to be crowdsourced. We prove that the crowd value selection problem is an NP-hard problem and develop an approximation algorithm for this problem. Extensive experimental results demonstrate the efficiency and effectiveness of the proposed approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Weinberg J B, Biswas G, Koller G R. Conceptual clustering with systematic missing values. In Proc. the 9th Int. Workshop on Machine Learning, July 1992, pp.464-469.

  2. Silva L O, Zárate L E. A brief review of the main approaches for treatment of missing data. Intelligent Data Analysis, 2014, 18(6): 1177-1198.

    Google Scholar 

  3. Hua M, Pei J. DiMaC: A system for cleaning disguised missing data. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2008, pp.1263-1266.

  4. Himmelspach L, Conrad S. Clustering approaches for data with missing values: Comparison and evaluation. In Proc. the 5th Int. Conf. Digital Information Management, July 2010, pp.19-28.

  5. Shan Y, Deng G. Kernel PCA regression for missing data estimation in DNA microarray analysis. In Proc. IEEE Int. Symp. Circuits and Systems, May 2009, pp.1477-1480.

  6. Yang K, Li J Z, Wang C K. Missing values estimation in microarray data with partial least squares regression. In Proc. the 6th Int. Conf. Computational Science, May 2006, pp.662-669.

  7. Siddique J, Belin T R. Using an Approximate Bayesian Bootstrap to multiply impute nonignorable missing data. Computational Statistics & Data Analysis, 2008, 53(2): 405-415.

    Article  MathSciNet  MATH  Google Scholar 

  8. Rubin D B. Multiple imputation after 18+ years. Journal of the American Statistical Association, 1996, 91(434): 473-489.

    Article  MATH  Google Scholar 

  9. Patrician P A. Multiple imputation for missing data. Research in Nursing & Health, 2002, 25(1): 76-84.

    Article  Google Scholar 

  10. Lakshminarayan K, Harp S A, Goldman R, Samad T. Imputation of missing data using machine learning techniques. In Proc. the 2nd Int. Conf. Knowledge Discovery and Data Mining, August 1996, pp.140-145.

  11. Li X B. A Bayesian approach for estimating and replacing missing categorical data. Journal of Data and Information Quality (JDIQ), 2009, 1(1): Article No. 3.

  12. Di Zio M, Scanu M, Coppola L, Luzi O, Ponti A. Bayesian networks for imputation. Journal of the Royal Statistical Society Series A (Statistics in Society), 2004, 167(2): 309-322.

  13. Mayfield C, Neville J, Prabhakar S. ERACER: A database approach for statistical inference and data cleaning. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2010, pp.75-86.

  14. Zhang S C. Shell-neighbor method and its application in missing data imputation. Applied Intelligence, 2011, 35(1): 123-133.

    Article  Google Scholar 

  15. Zhang C Q, Zhu X F, Zhang J L, Qin Y S, Zhang S C. GBKII: An imputation method for missing values. In Proc. the 11th Pacific-Asia Conf. Knowledge Discovery and Data Mining, May 2007, pp.1080-1087.

  16. Setiawan N A, Venkatachalam P A, Hani A F M. Missing attribute value prediction based on artificial neural network and rough set theory. In Proc. Int. Conf. Biomedical Engineering and Informatics, May 2008, pp.306-310.

  17. Tang N, Vemuri V R. Web-based knowledge acquisition to impute missing values for classification. In Proc. the IEEE/WIC/ACM Int. Conf. Web Intelligence, September 2004, pp.124-130.

  18. Hao S, Tang N, Li G L, Li J. Cleaning relations using knowledge bases. In Proc. the 33rd Int. Conf. Data Engineering, April 2017, pp.933-944.

  19. Chu X, Morcos J, Ilyas I F, Ouzzani M, Papotti P, Tang N, Ye Y. KATARA: Reliable data cleaning with knowledge bases and crowdsourcing. Proceedings of the VLDB Endowment, 2015, 8(12): 1952-1955.

    Article  Google Scholar 

  20. Qi Z X, Wang H Z, Meng F S, Li J Z, Gao H. Capture missing values with inference on knowledge base. In Proc. the Int. Conf. Database Systems for Advanced Applications, March 2017, pp.185-194.

  21. Ye C, Wang H Z. Capture missing values based on crowdsourcing. In Proc. the 9th Int. Conf. Wireless Algorithms Systems and Applications, June 2014, pp.783-792.

  22. Ye C, Wang H Z, Li J Z, Gao H, Cheng S Y. Crowdsourcingenhanced missing values imputation based on Bayesian network. In Proc. the 21st Int. Conf. Database Systems for Advanced Applications, April 2016, pp.67-81.

  23. Chu X, Morcos J, Ilyas I F, Ouzzani M, Papotti P, Tang N, Ye Y. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In Proc. the ACM SIGMOD Int. Conf. Management of Data, May 31-June 4, 2015, pp.1247-1261.

  24. Wang Q, Wang B, Guo L. Knowledge base completion using embeddings and rules. In Proc. the 24th Int. Conf. Artificial Intelligence, July 2015, pp.1859-1865.

  25. Neelakantan A, Chang M W. Inferring missing entity type instances for knowledge base completion: New dataset and methods. In Proc. Human Language Technologies: The 2015 Annual Conf. the North American Chapter of the ACL, May 2015, pp.515-525.

  26. Neelakantan A, Roth B, McCallum A. Compositional vector space models for knowledge base completion. In Proc. the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Int. Joint Conf. Natural Language Processing, July 2015, pp.156-166.

  27. Guo H Z, Chen Q C, Wang X L, Cui L. Tolerance rough set based attribute extraction approach for multiple semantic knowledge base integration. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2011, 19(4): 659-684.

    Article  MathSciNet  Google Scholar 

  28. Marinos L, Lee J. Using structural and procedural knowledge in database and knowledge base integration. In Proc. IEEE Int. Workshop on Tools for Artificial Intelligence, Architectures Languages and Algorithms, October 1989, pp.407-417.

  29. Zheng Y D, Li G L, Cheng R. DOCS: A domain-aware crowdsourcing system using knowledge bases. Proceedings of the VLDB Endowment, 2016, 10(4): 361-372.

    Article  Google Scholar 

  30. Li H W, Zhao B, Fuxman A. The wisdom of minority: Discovering and targeting the right group of workers for crowdsourcing. In Proc. the 23rd Int. Conf. World Wide Web, April 2014, pp.165-176.

  31. Wang J, Ipeirotis P G Provost F. Quality-based pricing for crowdsourced workers NYU Working Paper No. 2451/31833 Social Science Electronic Publishing, 2013. https://ssrn.com/abstract=2283000, June 2017.

  32. Fan J, Li G L, Ooi B C, Tan K L, Feng J H. iCrowd: An adaptive crowdsourcing framework. In Proc. the ACM SIGMOD Int. Conf. Management of Data, May 31-June 4, 2015, pp.1015-1030.

  33. Feng J H, Li G L, Wang H N, Feng J H. Incremental quality inference in crowdsourcing. In Proc. the 19th Int. Conf. Database Systems for Advanced Applications, April 2014, pp.453-467.

  34. Zheng Y D, Wang J N, Li G L, Cheng R, Feng J H. QASCA: A quality-aware task assignment system for crowdsourcing applications. In Proc. the ACM SIGMOD Int. Conf. Management of Data, May 31-June 4, 2015, pp.1031-1046.

  35. Raykar V C, Yu S P. Eliminating spammers and ranking annotators for crowdsourced labeling tasks. The Journal of Machine Learning Research, 2012, 13(1): 491-518.

    MathSciNet  MATH  Google Scholar 

  36. Cavallo R, Jain S. Efficient crowdsourcing contests. In Proc. the 11th Int. Conf. Autonomous Agents and Multiagent Systems, June 2012, pp.677-686.

  37. Roy S B, Lykourentzou I, Thirumuruganathan S, AmerYahia S, Das G. Task assignment optimization in knowledge-intensive crowdsourcing. The VLDB Journal, 2015, 24(4): 467-491.

    Article  Google Scholar 

  38. Fomin F V, Grandoni F, Pyatkin A V, Stepanov A A. Bounding the number of minimal dominating sets: A measure and conquer approach. In Proc. the 16th Int. Symp. Algorithms and Computation, December 2005, pp.573-582.

  39. DeVore R A, Temlyakov V N. Some remarks on greedy algorithms. Advances in Computational Mathematics, 1996, 5(1): 173-187.

    Article  MathSciNet  MATH  Google Scholar 

  40. Kann V. On the approximability of the maximum common subgraph problem. In Proc. the 9th Annual Symp. Theoretical Aspects of Computer Science, February 1992, pp.375-388.

  41. Feige U. A threshold of lnn for approximating set cover. Journal of the ACM, 1998, 45(4): 634-652.

    Article  MathSciNet  MATH  Google Scholar 

  42. Rahman G, Islam Z. A decision tree-based missing value imputation technique for data pre-processing. In Proc. the 9th Australasian Data Mining Conf., December 2011, pp.41-50.

  43. Li H, Emmanuel A, LI P, Wu M. Imputation algorithm of missing values based on EM and Bayesian network. Computer Engineering and Applications, 2010, 46(5): 123-125.

  44. Miyakoshi Y, Kato S. A missing value imputation method using a Bayesian network with weighted learning. Electronics and Communications in Japan, 2012, 95(12): 1-9.

    Article  Google Scholar 

  45. Li Z X, Sharaf M A, Sitbon L, Sadiq S, Indulska M, Zhou X F. A web-based approach to data imputation. World Wide Web, 2014, 17(5): 873-897.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hong-Zhi Wang.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 314 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, HZ., Qi, ZX., Shi, RX. et al. COSSET+: Crowdsourced Missing Value Imputation Optimized by Knowledge Base. J. Comput. Sci. Technol. 32, 845–857 (2017). https://doi.org/10.1007/s11390-017-1768-1

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-017-1768-1

Keywords

Navigation