ACM SIGARCH Computer Architecture News
Guidelines for presentation and comparison of indexing techniques
ACM SIGMOD Record
Frameworks = (components + patterns)
Communications of the ACM
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient distributed algorithms to build inverted files
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Building a distributed full-text index for the Web
Proceedings of the 10th international conference on World Wide Web
PDIS '93 Proceedings of the second international conference on Parallel and distributed information systems
Efficient single-pass index construction for text databases
Journal of the American Society for Information Science and Technology
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Building Nutch: Open Source Search
Queue - Search Engines
Interpreting the data: Parallel analysis with Sawzall
Scientific Programming - Dynamic Grids and Worldwide Computing
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Towards Large Scale Semantic Annotation Built on MapReduce Architecture
ICCS '08 Proceedings of the 8th international conference on Computational Science, Part III
Validity of the single processor approach to achieving large scale computing capabilities
AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
A comparison of approaches to large-scale data analysis
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
On single-pass indexing with MapReduce
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Data-intensive text processing with MapReduce
NAACL-Tutorials '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Tutorial Abstracts
MapReduce: a flexible data processing tool
Communications of the ACM - Amir Pnueli: Ahead of His Time
A case study of distributed information retrieval architectures to index one terabyte of text
Information Processing and Management: an International Journal
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Universal codeword sets and representations of the integers
IEEE Transactions on Information Theory
Apriori-based frequent itemset mining algorithms on MapReduce
Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
Optimizing and Tuning MapReduce Jobs to Improve the Large-Scale Data Analysis Process
International Journal of Intelligent Systems
Learning-Based interactive retrieval in large-scale multimedia collections
AMR'11 Proceedings of the 9th international conference on Adaptive Multimedia Retrieval: large-scale multimedia retrieval and evaluation
Distributed media indexing based on MPI and MapReduce
Multimedia Tools and Applications
Hi-index | 0.00 |
In Information Retrieval (IR), the efficient indexing of terabyte-scale and larger corpora is still a difficult problem. MapReduce has been proposed as a framework for distributing data-intensive operations across multiple processing machines. In this work, we provide a detailed analysis of four MapReduce indexing strategies of varying complexity. Moreover, we evaluate these indexing strategies by implementing them in an existing IR framework, and performing experiments using the Hadoop MapReduce implementation, in combination with several large standard TREC test corpora. In particular, we examine the efficiency of the indexing strategies, and for the most efficient strategy, we examine how it scales with respect to corpus size, and processing power. Our results attest to both the importance of minimising data transfer between machines for IO intensive tasks like indexing, and the suitability of the per-posting list MapReduce indexing strategy, in particular for indexing at a terabyte-scale. Hence, we conclude that MapReduce is a suitable framework for the deployment of large-scale indexing.