MapReduce indexing strategies: Studying scalability and efficiency

Authors:
Richard Mccreadie;Craig Macdonald;Iadh Ounis
Affiliations:
Department of Computing Science, University of Glasgow, Glasgow G12 8QQ, United Kingdom;Department of Computing Science, University of Glasgow, Glasgow G12 8QQ, United Kingdom;Department of Computing Science, University of Glasgow, Glasgow G12 8QQ, United Kingdom
Venue:
Information Processing and Management: an International Journal
Year:
2012

Citing 24
Cited 5

What is scalability?

ACM SIGARCH Computer Architecture News
Guidelines for presentation and comparison of indexing techniques

ACM SIGMOD Record
Frameworks = (components + patterns)

Communications of the ACM
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient distributed algorithms to build inverted files

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Building a distributed full-text index for the Web

Proceedings of the 10th international conference on World Wide Web
Performance of inverted indices in shared-nothing distributed text document informatioon retrieval systems

PDIS '93 Proceedings of the second international conference on Parallel and distributed information systems
Efficient single-pass index construction for text databases

Journal of the American Society for Information Science and Technology
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Building Nutch: Open Source Search

Queue - Search Engines
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Towards Large Scale Semantic Annotation Built on MapReduce Architecture

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part III
Validity of the single processor approach to achieving large scale computing capabilities

AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
On single-pass indexing with MapReduce

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Data-intensive text processing with MapReduce

NAACL-Tutorials '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Tutorial Abstracts
MapReduce: a flexible data processing tool

Communications of the ACM - Amir Pnueli: Ahead of His Time
A case study of distributed information retrieval architectures to index one terabyte of text

Information Processing and Management: an International Journal
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Universal codeword sets and representations of the integers

IEEE Transactions on Information Theory

Apriori-based frequent itemset mining algorithms on MapReduce

Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
Optimizing and Tuning MapReduce Jobs to Improve the Large-Scale Data Analysis Process

International Journal of Intelligent Systems
Learning-Based interactive retrieval in large-scale multimedia collections

AMR'11 Proceedings of the 9th international conference on Adaptive Multimedia Retrieval: large-scale multimedia retrieval and evaluation
MRO-MPI: MapReduce overlapping using MPI and an optimized data exchange policy

Parallel Computing
Distributed media indexing based on MPI and MapReduce

Multimedia Tools and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

In Information Retrieval (IR), the efficient indexing of terabyte-scale and larger corpora is still a difficult problem. MapReduce has been proposed as a framework for distributing data-intensive operations across multiple processing machines. In this work, we provide a detailed analysis of four MapReduce indexing strategies of varying complexity. Moreover, we evaluate these indexing strategies by implementing them in an existing IR framework, and performing experiments using the Hadoop MapReduce implementation, in combination with several large standard TREC test corpora. In particular, we examine the efficiency of the indexing strategies, and for the most efficient strategy, we examine how it scales with respect to corpus size, and processing power. Our results attest to both the importance of minimising data transfer between machines for IO intensive tasks like indexing, and the suitability of the per-posting list MapReduce indexing strategy, in particular for indexing at a terabyte-scale. Hence, we conclude that MapReduce is a suitable framework for the deployment of large-scale indexing.