TopX: efficient and versatile top-k query processing for semistructured data

  • Authors:
  • Martin Theobald;Holger Bast;Debapriyo Majumdar;Ralf Schenkel;Gerhard Weikum

  • Affiliations:
  • Max-Planck Institute for Informatics, Saarbruecken, Germany;Max-Planck Institute for Informatics, Saarbruecken, Germany;Max-Planck Institute for Informatics, Saarbruecken, Germany;Max-Planck Institute for Informatics, Saarbruecken, Germany;Max-Planck Institute for Informatics, Saarbruecken, Germany

  • Venue:
  • The VLDB Journal — The International Journal on Very Large Data Bases
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recent IR extensions to XML query languages such as Xpath 1.0 Full-Text or the NEXI query language of the INEX benchmark series reflect the emerging interest in IR-style ranked retrieval over semistructured data. TopX is a top-k retrieval engine for text and semistructured data. It terminates query execution as soon as it can safely determine the k top-ranked result elements according to a monotonic score aggregation function with respect to a multidimensional query. It efficiently supports vague search on both content- and structure-oriented query conditions for dynamic query relaxation with controllable influence on the result ranking. The main contributions of this paper unfold into four main points: (1) fully implemented models and algorithms for ranked XML retrieval with XPath Full-Text functionality, (2) efficient and effective top-k query processing for semistructured data, (3) support for integrating thesauri and ontologies with statistically quantified relationships among concepts, leveraged for word-sense disambiguation and query expansion, and (4) a comprehensive description of the TopX system, with performance experiments on large-scale corpora like TREC Terabyte and INEX Wikipedia.