PowerDB-IR – Scalable Information Retrieval and Storage with a Cluster of Databases

  • Authors:
  • Torsten Grabs;Klemens Böhm;Hans-Jörg Schek

  • Affiliations:
  • ETH Zürich, Database Research Group, Institute of Information Systems, Zürich, Switzerland;Otto-von-Guericke-Universität Magdeburg, Department of Computer Science, Magdeburg, Germany;ETH Zentrum IFW C49.2, Institut für Informationssysteme, Haldeneggsteig 4, 8092 Zürich, Switzerland

  • Venue:
  • Knowledge and Information Systems
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Our objective is a scalable infrastructure for information retrieval (IR) with up-to-date retrieval results in the presence of updates. Timely processing of updates is important with novel application domains such as e-commerce. These issues are challenging, given the additional requirement that the system must scale well. We have built PowerDB-IR, a system that has the characteristics sought. This article describes its design, implementation, and evaluation. We follow a three-tier architecture with a database cluster as the bottom layer for storage management. The rationale for a database cluster is to ‘scale out’, i.e., to add further cluster nodes, whenever necessary for better performance. The middle tier provides IR-specific retrieval and update services. We deploy state-of-the-art middleware software to coordinate the cluster and to invoke IR-specific components. PowerDB-IR extends the middleware layer with service decomposition and parallelisation. PowerDB-IR has the following features: It supports state-of-the-art retrieval models such as vector space retrieval. It allows documents to be inserted and retrieved concurrently and ensures up-to-date retrieval results with almost no overhead. PowerDB-IR ensures the correctness of global concurrency and recovery. Alternative physical data organisation schemes and respective query processing techniques provide adequate performance for different workloads and database sizes. Scaling out the database cluster yields higher throughput and lower response times. We have run extensive experiments with PowerDB-IR using several commercial database systems as well as different middleware products. Further experiments have quantified the effect of transactional guarantees on performance. The main result is that PowerDB-IR shows surprisingly good scalability and low response times.