Reducing storage costs for federated search of text databases

  • Authors:
  • Jie Lu;Jamie Callan

  • Affiliations:
  • Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA

  • Venue:
  • dg.o '03 Proceedings of the 2003 annual national conference on Digital government research
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

In environments containing many text search engines a federated search system provides people with a single point of access. When search engines are managed by independent organizations two key problems are discovering and representing the contents of each text database. Query-based sampling is a recent technique for discovering the contents of uncooperative databases so as to create database resource descriptions that support a variety of necessary capabilities. However, when the documents obtained by query-based sampling are very long, as is common in some government environments, disk storage costs can be surprisingly large. This paper investigates methods of pruning sampled documents to reduce storage costs. The experimental results demonstrate that disk storage costs can be reduced by 54-93% while causing only minor losses in federated search accuracy.