A parallel indexed algorithm for information retrieval

  • Authors:
  • C. Stanfill;R. Thau;D. Waltz

  • Affiliations:
  • Thinking Machines Corporation, 245 First Street, Cambridge MA;Thinking Machines Corporation, 245 First Street, Cambridge MA;Thinking Machines Corporation, 245 First Street, Cambridge MA

  • Venue:
  • SIGIR '89 Proceedings of the 12th annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 1989

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we present a parallel document ranking algorithm suitable for use on databases of 1-1000 GB, resident on primary or secondary storage. The algorithm is based on inverted indexes, and has two advantages over a previously published parallel algorithm for retrieval based on signature files. First, it permits the employment of ranking strategies which cannot be easily implemented using signature files, specifically methods which depend on document-term weighting. Second, it permits the interactive searching of databases resident on secondary storage. The algorithm is evaluated via a mixture of analytic and simulation techniques, with a particular focus on how cost-effectiveness and efficiency change as the size of the database, number of processors, and cost of memory are altered. In particular, we find that if the ratio of the number of processors and/or disks to the size of the database is held constant, then the cost-effectiveness of the resulting system remains constant. Furthermore, for a given size of database, there is a number of processors which optimizes cost-effectiveness. Estimated response times are also presented. Using these methods, it appears that cost-effective interactive access to databases in the 100-1000 GB range can be achieved using current technology.