Skip-and-prune: cosine-based top-k query processing for efficient context-sensitive document retrieval

  • Authors:
  • Jong Wook Kim;K. Selçuk Candan

  • Affiliations:
  • Arizona State University, Tempe, AZ, USA;Arizona State University, Tempe, AZ, USA

  • Venue:
  • Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Keyword search and ranked retrieval together emerged as popular data access paradigms for various kinds of data, from web pages to XML and relational databases. A user can submit keywords without knowing much (sometimes nothing) about the complex structure underlying a data collection, yet the system can identify, rank, and return a set of relevant matches by exploiting statistics about the distribution and structure of the data. Keyword-based data models are also suitable for capturing user's search context in terms of weights associated to the keywords in the query. Given a search context, the data in the database can also be re-interpreted for semantically correct retrieval. This option, however, is often ignored as the cost of re-assessing the content in the database naively tends to be prohibitive. In this paper, we first argue that top-k query processing can help tackle this challenge by re-assessing only the relevant parts of the database, efficiently. A road-block in this process, however, is that most efficient implementations of top-k query processing assume that the scoring function is monotonic, whereas the cosine-based scoring function needed for re-interpretation of content based on user context is not. In this paper, we develop an efficient top-k query processing algorithm, skip-and-prune (SnP), which is able to process top-k queries under cosine-based non-monotonic scoring functions. We compare the use of proposed algorithm against the alternative implementations of the context-aware retrieval, including naive top-k, accumulator-based inverted files, and full-scan. The experiment results show that while being fast, naive top-k is not an effective solution due to the non-monotonicity of underlying scoring function. The proposed technique, SnP, however, matches the precision of accumulator-based inverted files and full-scan, yet it is orders of magnitude faster than these.