PowerDB-IR: information retrieval on top of a database cluster

  • Authors:
  • Torsten Grabs;Klemens Böhm;Hans-Jörg Schek

  • Affiliations:
  • ETH Zurich, Zurich, Switzerland;ETH Zurich, Zurich, Switzerland;ETH Zurich, Zurich, Switzerland

  • Venue:
  • Proceedings of the tenth international conference on Information and knowledge management
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Our current concern is a scalable infrastructure for information retrieval (IR) with up-to-date retrieval results in the presence of frequent, continuous updates. Timely processing of updates is important with novel application domains, e.g., e-commerce. We want to use off-the-self hardware and software as much as possible. These issues are challenging, given the additional requirement that the resulting system must scale well. We have built PowerDB-IR, a system that has the characteristics sought. This paper describes its design, implementation, and evaluation. PowerDB-IR is a coordination layer for a database cluster. The rationale behind a database cluster is to 'scale-out', i.e., to add further cluster nodes, whenever necessary for better performance. We build on IR-to-database mappings and service decomposition to support high-level parallelism. We follow a three-tier architecture with the database cluster as the bottom layer for storage management. The middle tier provides IR-specific processing and update services. PowerDB-IR has the following features: It allows to insert and retrieve documents concurrently, and it ensures freshness with almost no overhead. Alternative physical data organization schemes provide adequate performance for different workloads. Query processing techniques for the different data organizations efficiently integrate the ranked retrieval results from the cluster nodes. We have run extensive experiments with our prototype using commercial database systems and middleware software products. The main result is that PowerDB-IR shows surprisingly ideal scalability and low response times.