MapReduce for information retrieval evaluation: "let's quickly test this on 12 TB of data"

Authors:
Djoerd Hiemstra;Claudia Hauff
Affiliations:
University of Twente, The Netherlands;University of Twente, The Netherlands
Venue:
CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Year:
2010

Citing 9
Cited 1

Parallel text search methods

Communications of the ACM
TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)

TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Introduction to Information Retrieval

Introduction to Information Retrieval
Challenges in building large-scale information retrieval systems: invited talk

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
Evaluating Systems for Multilingual and Multimodal Information Access: 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus, Denmark, ... Applications, incl. Internet/Web, and HCI)

Evaluating Systems for Multilingual and Multimodal Information Access: 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus, Denmark, ... Applications, incl. Internet/Web, and HCI)
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Taily: shard selection using the tail of score distributions

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost machines to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort. The code is available to other researchers at: http://mirex.sourceforge.net