Towards methods for the collective gathering and quality control of relevance assessments

Authors:
Gabriella Kazai;Natasa Milic-Frayling;Jamie Costello
Affiliations:
Microsoft Research, Cambridge, United Kingdom;Microsoft Research, Cambridge, United Kingdom;Microsoft Research, Cambridge, United Kingdom
Venue:
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Year:
2009

Citing 12
Cited 18

How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Building a question answering test collection

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Regions and levels: measuring and mapping users' relevance judgments

Journal of the American Society for Information Science and Technology
Ranking retrieval systems without relevance judgments

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Labeling images with a computer game

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Forming test collections with no system pooling

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)

TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)
Power and bias of subset pooling strategies

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Designing games with a purpose

Communications of the ACM - Designing games with a purpose
A simple and efficient sampling method for estimating AP and NDCG

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Relevance assessment: are judges exchangeable and does it matter

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Sound and complete relevance assessment for XML retrieval

ACM Transactions on Information Systems (TOIS)

Report on INEX 2008

ACM SIGIR Forum
Report on INEX 2009

ACM SIGIR Forum
Overview of the INEX 2009 book track

INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval
Focused search in books and Wikipedia: categories, links and relevance feedback

INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval
An examination of two delivery modes for interactive search system experiments: remote and laboratory

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Design and implementation of relevance assessments using crowdsourcing

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
ViewSer: enabling large-scale remote user studies of web search examination and interaction

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Overview of the INEX 2010 book track: scaling up the evaluation using crowdsourcing

INEX'10 Proceedings of the 9th international conference on Initiative for the evaluation of XML retrieval: comparative evaluation of focused retrieval
Focus and element length for book and wikipedia retrieval

INEX'10 Proceedings of the 9th international conference on Initiative for the evaluation of XML retrieval: comparative evaluation of focused retrieval
Fu-Finder: a game for studying querying behaviours

Proceedings of the 20th ACM international conference on Information and knowledge management
Quality assurance in document conversion: a hit?

Proceedings of the 4th ACM workshop on Online books, complementary social media and crowdsourcing
A preliminary study using PageFetch to examine the searching ability of children and adults

Proceedings of the 4th Information Interaction in Context Symposium
Using crowdsourcing for TREC relevance assessment

Information Processing and Management: an International Journal
Phrase detectives: Utilizing collective intelligence for internet-scale language resource creation

ACM Transactions on Interactive Intelligent Systems (TiiS) - Special section on internet-scale human problem solving and regular papers
An analysis of human factors and label accuracy in crowdsourcing relevance judgments

Information Retrieval
Implementing crowdsourcing-based relevance experimentation: an industrial perspective

Information Retrieval
SRbench--a benchmark for soundtrack recommendation systems

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Growing interest in online collections of digital books and video content motivates the development and optimization of adequate retrieval systems. However, traditional methods for collecting relevance assessments to tune system performance are challenged by the nature of digital items in such collections, where assessors are faced with a considerable effort to review and assess content by extensive reading, browsing, and within-document searching. The extra strain is caused by the length and cohesion of the digital item and the dispersion of topics within it. We propose a method for the collective gathering of relevance assessments using a social game model to instigate participants' engagement. The game provides incentives for assessors to follow a predefined review procedure and makes provisions for the quality control of the collected relevance judgments. We discuss the approach in detail, and present the results of a pilot study conducted on a book corpus to validate the approach. Our analysis reveals intricate relationships between the affordances of the system, the incentives of the social game, and the behavior of the assessors. We show that the proposed game design achieves two designated goals: the incentive structure motivates endurance in assessors and the review process encourages truthful assessment.