On single-pass indexing with MapReduce

Authors:
Richard M. C. McCreadie;Craig Macdonald;Iadh Ounis
Affiliations:
University of Glasgow, Glasgow, Scotland Uk;University of Glasgow, Glasgow, Scotland Uk;University of Glasgow, Glasgow, Scotland Uk
Venue:
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Year:
2009

Citing 4
Cited 6

Efficient single-pass index construction for text databases

Journal of the American Society for Information Science and Technology
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Validity of the single processor approach to achieving large scale computing capabilities

AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference

Max-cover in map-reduce

Proceedings of the 19th international conference on World wide web
Scalable, statistical storage allocation for extensible inverted file construction

Journal of Systems and Software
An innovative approach to the development of E-government search services

EGOVIS'11 Proceedings of the Second international conference on Electronic government and the information systems perspective
MapReduce indexing strategies: Studying scalability and efficiency

Information Processing and Management: an International Journal
Computing scientometrics in large-scale academic search engines with mapreduce

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Indexing Word Sequences for Ranked Retrieval

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Indexing is an important Information Retrieval (IR) operation, which must be parallelised to support large-scale document corpora. We propose a novel adaptation of the state-of-the-art single-pass indexing algorithm in terms of the MapReduce programming model. We then experiment with this adaptation, in the context of the Hadoop MapReduce implementation. In particular, we explore the scale of improvements that can be achieved when using firstly more processing hardware and secondly larger corpora. Our results show that indexing speed increases in a close to linear fashion when scaling corpus size or number of processing machines. This suggests that the proposed indexing implementation is viable to support upcoming large-scale corpora.