Query-driven document partitioning and collection selection

  • Authors:
  • Diego Puppin;Fabrizio Silvestri;Domenico Laforenza

  • Affiliations:
  • Istituto di Scienza e Tecnologie dell'Informazione (A. Faedo), Pisa, Italy;Istituto di Scienza e Tecnologie dell'Informazione (A. Faedo), Pisa, Italy;Istituto di Scienza e Tecnologie dell'Informazione (A. Faedo), Pisa, Italy

  • Venue:
  • InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a novel strategy to partition a document collection onto several servers and to perform effective collection selection. The method is based on the analysis of query logs. We proposed a novel document representation called query-vectors model. Each document is represented as a list recording the queries for which the document itself is a match, along with their ranks. To both partition the collection and build the collection selection function, we co-cluster queries and documents. The document clusters are then assigned to the underlying IR servers, while the query clusters represent queries that return similar results, and are used for collection selection. We show that this document partition strategy greatly boosts the performance of standard collection selection algorithms, including CORI, w.r.t. a round-robin assignment. Secondly, we show that performing collection selection by matching the query to the existing query clusters and successively choosing only one server, we reach an average precision-at-5 up to 1.74 and we constantly improve CORI precision of a factor between 11% and 15%. As a side result we show a way to select rarely asked-for documents. Separating these documents from the rest of the collection allows the indexer to produce a more compact index containing only relevant documents that are likely to be requested in the future. In our tests, around 52% of the documents (3,128,366) are not returned among the first 100 top-ranked results of any query.