skip to main content
10.1145/2213836.2213878acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

CrowdScreen: algorithms for filtering data with humans

Published:20 May 2012Publication History

ABSTRACT

Given a large set of data items, we consider the problem of filtering them based on a set of properties that can be verified by humans. This problem is commonplace in crowdsourcing applications, and yet, to our knowledge, no one has considered the formal optimization of this problem. (Typical solutions use heuristics to solve the problem.) We formally state a few different variants of this problem. We develop deterministic and probabilistic algorithms to optimize the expected cost (i.e., number of questions) and expected error. We experimentally show that our algorithms provide definite gains with respect to other strategies. Our algorithms can be applied in a variety of crowdsourcing scenarios and can form an integral part of any query processor that uses human computation.

References

  1. Mechanical Turk. http://mturk.com.Google ScholarGoogle Scholar
  2. A. Feng et al. Crowddb: Query processing with the vldb crowd (demo). In VLDB, 2011.Google ScholarGoogle Scholar
  3. A. Marcus et al. Crowdsourced databases: Query processing with people. In CIDR, 2011.Google ScholarGoogle Scholar
  4. A. Marcus et al. Demonstration of qurk: a query processor for human operators. In SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Parameswaran et al. Human-assisted graph search: it's okay to ask questions. In VLDB, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Omar Alonso, Daniel E. Rose, and Benjamin Stewart. Crowdsourcing for relevance evaluation. SIGIR Forum, 42, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Doan, R. Ramakrishnan, and A.Y. Halevy. Crowdsourcing systems on the world-wide web. Communications of the ACM, 54(4):86--96, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. Bakshy et al. Everyone's an influencer: quantifying influence on twitter. In WSDM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Parameswaran et al. Crowdscreen: Algorithms for filtering data with humans. Technical report, http://ilpubs.stanford.edu:8090/1011/.Google ScholarGoogle Scholar
  10. G. Little et al. Turkit: tools for iterative tasks on mechanical turk. In HCOMP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Whitehill et al. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In NIPS. 2009.Google ScholarGoogle Scholar
  12. M. J. Franklin et al. Crowddb: answering queries with crowdsourcing. In SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Adam Marcus, Eugene Wu, David Karger, Samuel Madden, and Robert Miller. Human-powered sorts and joins. Proc. VLDB Endow., 5, September 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Robert McCann, Warren Shen, and AnHai Doan. Matching schemas in online communities: A web 2.0 approach. In ICDE '08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Donmez et al. Efficiently learning the accuracy of labeling sources for selective sampling. In KDD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Perona P. Welinder. Online crowdsourcing: rating annotators and obtaining cost-effective labels. In CVPR, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  17. A. Parameswaran and N. Polyzotis. Answering queries using humans, algorithms and databases. In CIDR, 2011.Google ScholarGoogle Scholar
  18. Alexander J. Quinn and Benjamin B. Bederson. Human computation: a survey and taxonomy of a growing field. In CHI, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Gomes et al. Crowdclustering. In NIPS, 2011.Google ScholarGoogle Scholar
  20. R. Snow et al. Cheap and fast-but is it good?: evaluating non-expert annotations for natural language tasks. In EMNLP, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Tim Roughgarden. Algorithmic game theory. Commun. ACM, 53(7):78--86, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Burr Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison, 2009.Google ScholarGoogle Scholar
  23. V. Raykar et al. Supervised learning from multiple experts: whom to trust when everyone lies a bit. In ICML, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. V. S. Sheng et al. Get another label? improving data quality and data mining using multiple, noisy labelers. In KDD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Larry Wasserman. All of Statistics. Springer, 2003.Google ScholarGoogle Scholar
  26. Omar F. Zaidan and Chris Callison-Burch. Feasibility of human-in-the-loop minimum error rate training. In EMNLP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. CrowdScreen: algorithms for filtering data with humans

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
        May 2012
        886 pages
        ISBN:9781450312479
        DOI:10.1145/2213836

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 20 May 2012

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SIGMOD '12 Paper Acceptance Rate48of289submissions,17%Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader