PowerDB-IR: information retrieval on top of a database cluster

Authors:
Torsten Grabs;Klemens Böhm;Hans-Jörg Schek
Affiliations:
ETH Zurich, Zurich, Switzerland;ETH Zurich, Zurich, Switzerland;ETH Zurich, Zurich, Switzerland
Venue:
Proceedings of the tenth international conference on Information and knowledge management
Year:
2001

Citing 12
Cited 10

Data placement in Bubba

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Incremental updates of inverted lists for text document retrieval

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Integrating structured data and text: a relational approach

Journal of the American Society for Information Science
Toward a scalable distributed WWW server on workstation clusters

Journal of Parallel and Distributed Computing
Cluster-based scalable network services

Proceedings of the sixteenth ACM symposium on Operating systems principles
Infoseek's experiences searching the internet

ACM SIGIR Forum
The Gold Text Indexing Engine

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
High-level Parallelism in a Database Cluster: A Feasibility Study Using Document Services

Proceedings of the 17th International Conference on Data Engineering
Scalable Distributed Query and Update Service Implementations for XML Document Elements

Eleventh International Workshop on Research Issues in Data Engineering on Document Management for Data Intensive Business and Scientific Applications
Fast Incremental Indexing for Full-Text Information Retrieval

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Data Structures for an Integrated Data Base Management and Information Retrieval System

VLDB '82 Proceedings of the 8th International Conference on Very Large Data Bases
Structured document storage and refined declarative and navigational access mechanisms in HyperStorM

The VLDB Journal — The International Journal on Very Large Data Bases

Improving collection selection with overlap awareness in P2P search engines

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Report on the DB/IR panel at SIGMOD 2005

ACM SIGMOD Record
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Spark: top-k keyword query in relational databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
The Hyperdatabase Project --- From the Vision to Realizations

BNCOD '08 Proceedings of the 25th British national conference on Databases: Sharing Data, Information and Knowledge
Information retrieval from digital libraries in SQL

Proceedings of the 10th ACM workshop on Web information and data management
Affinity analysis of coded data sets

Proceedings of the 2009 EDBT/ICDT Workshops
Integrating databases, search engines and web applications: a model-driven approach

ICWE'07 Proceedings of the 7th international conference on Web engineering
On the usage of global document occurrences in peer-to-peer information systems

OTM'05 Proceedings of the 2005 Confederated international conference on On the Move to Meaningful Internet Systems - Volume >Part I
The MINERVA project: towards collaborative search in digital libraries using peer-to-peer technology

DELOS'04 Proceedings of the 6th Thematic conference on Peer-to-Peer, Grid, and Service-Orientation in Digital Library Architectures

Quantified Score

Hi-index	0.00

Visualization

Abstract

Our current concern is a scalable infrastructure for information retrieval (IR) with up-to-date retrieval results in the presence of frequent, continuous updates. Timely processing of updates is important with novel application domains, e.g., e-commerce. We want to use off-the-self hardware and software as much as possible. These issues are challenging, given the additional requirement that the resulting system must scale well. We have built PowerDB-IR, a system that has the characteristics sought. This paper describes its design, implementation, and evaluation. PowerDB-IR is a coordination layer for a database cluster. The rationale behind a database cluster is to 'scale-out', i.e., to add further cluster nodes, whenever necessary for better performance. We build on IR-to-database mappings and service decomposition to support high-level parallelism. We follow a three-tier architecture with the database cluster as the bottom layer for storage management. The middle tier provides IR-specific processing and update services. PowerDB-IR has the following features: It allows to insert and retrieve documents concurrently, and it ensures freshness with almost no overhead. Alternative physical data organization schemes provide adequate performance for different workloads. Query processing techniques for the different data organizations efficiently integrate the ranked retrieval results from the cluster nodes. We have run extensive experiments with our prototype using commercial database systems and middleware software products. The main result is that PowerDB-IR shows surprisingly ideal scalability and low response times.