Efficient single-pass index construction for text databases
Journal of the American Society for Information Science and Technology
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Validity of the single processor approach to achieving large scale computing capabilities
AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
Proceedings of the 19th international conference on World wide web
Scalable, statistical storage allocation for extensible inverted file construction
Journal of Systems and Software
An innovative approach to the development of E-government search services
EGOVIS'11 Proceedings of the Second international conference on Electronic government and the information systems perspective
MapReduce indexing strategies: Studying scalability and efficiency
Information Processing and Management: an International Journal
Computing scientometrics in large-scale academic search engines with mapreduce
WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Indexing Word Sequences for Ranked Retrieval
ACM Transactions on Information Systems (TOIS)
Hi-index | 0.00 |
Indexing is an important Information Retrieval (IR) operation, which must be parallelised to support large-scale document corpora. We propose a novel adaptation of the state-of-the-art single-pass indexing algorithm in terms of the MapReduce programming model. We then experiment with this adaptation, in the context of the Hadoop MapReduce implementation. In particular, we explore the scale of improvements that can be achieved when using firstly more processing hardware and secondly larger corpora. Our results show that indexing speed increases in a close to linear fashion when scaling corpus size or number of processing machines. This suggests that the proposed indexing implementation is viable to support upcoming large-scale corpora.