Parallel Generation of Inverted Files for Distributed Text Collections

Authors:
Berthier A. Ribeiro-Neto;Joao Paulo Kitajima;Gonzalo Navarro;Cláudio R. G. Sant'Ana;Nivio Ziviani
Affiliations:
-;-;-;-;-
Venue:
SCCC '98 Proceedings of the XVIII International Conference of the Chilean Computer Science Society
Year:
1998

Citing 0
Cited 8

Efficient distributed algorithms to build inverted files

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Searching large text collections

Handbook of massive data sets
Optimizing result prefetching in web search engines with segmented indices

ACM Transactions on Internet Technology (TOIT)
Efficient in-memory extensible inverted file

Information Systems
Optimizing result prefetching in web search engines with segmented indices

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Load and storage balanced posting file partitioning for parallel information retrieval

Journal of Systems and Software
Assigning documents to master sites in distributed search

Proceedings of the 20th ACM international conference on Information and knowledge management
A term-based inverted index partitioning model for efficient distributed query processing

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a scalable algorithm for the parallel computation of inverted files for large text collections. The algorithm takes into account an environment of a high bandwidth network of workstations with a shared-nothing memory organization. The text collection is assumed to be evenly distributed among the disks of the various workstations. Compression is used to save space in main memory (where inverted lists are kept) and to save time when data have to be moved across the network. The algorithm average running cost is O(t/p) where t is the size of the whole text collection and p is the number of available processors. We implemented our algorithm and drew experimental results. In a 100 Mbits/s switched Ethernet network with 4 PentiumPro 200 megahertz, 128 megabytes RAM on each processor, we were able to invert 2 gigabytes of TREC documents in 15 minutes. Further, we also proposed an analytical model for the algorithm execution time.