ABSTRACT
The TREC 2012 Crowdsourcing track asked participants to crowdsource relevance assessments with the goal of replicating costly expert judgements with relatively fast, inexpensive, but less reliable judgements from anonymous online workers. The track used 10 "ad-hoc" queries, highly specific and complex (as compared to web search). The crowdsourced assessments were evaluated against expert judgments made by highly trained and capable human analysts in 1999 as part of ad hoc track collection construction. Since most crowdsourcing approaches submitted to the TREC 2012 track produced assessment sets nowhere close to the expert judgements, we decided to analyze crowdsourcing mistakes made on this task using data we collected via Amazon's Mechanical Turk service. We investigate two types of crowdsourcing approaches: one that asks for nominal relevance grades for each document, and the other that asks for preferences on many (not all) pairs of documents.
- C. Buckley, D. Dimmick, I. Soboroff, and E. M. Voorhees. Bias and the limits of pooling. In SIGIR, pages 619--620, 2006. Google ScholarDigital Library
- B. Carterette, P. N. Bennett, D. M. Chickering, and S. T. Dumais. Here or there. In ECIR, 2008.Google ScholarCross Ref
- M. Hosseini, I. J. Cox, N. Mili--c-Frayling, G. Kazai, and V. Vinay. On aggregating labels from multiple crowd workers to infer relevance of documents. In ECIR. Springer-Verlag, 2012. Google ScholarDigital Library
- P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In SIGKDD Workshop on Human Computation. ACM, 2010. Google ScholarDigital Library
- M. D. Smucker, G. Kazai, and M. Lease. Overview of the trec 2012 crowdsourcing track. In TREC 2012.Google Scholar
Index Terms
- An analysis of crowd workers mistakes for specific and complex relevance assessment task
Recommendations
Modus Operandi of Crowd Workers: The Invisible Role of Microtask Work Environments
The ubiquity of the Internet and the widespread proliferation of electronic devices has resulted in flourishing microtask crowdsourcing marketplaces, such as Amazon MTurk. An aspect that has remained largely invisible in microtask crowdsourcing is that ...
Inferring missing relevance judgments from crowd workers via probabilistic matrix factorization
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalIn crowdsourced relevance judging, each crowd worker typically judges only a small number of examples, yielding a sparse and imbalanced set of judgments in which relatively few workers influence output consensus labels, particularly with simple ...
Using crowdsourcing for TREC relevance assessment
Crowdsourcing has recently gained a lot of attention as a tool for conducting different kinds of relevance evaluations. At a very high level, crowdsourcing describes outsourcing of tasks to a large group of people instead of assigning such tasks to an ...
Comments