High-performance processing of text queries with tunable pruned term and term pair indexes

Authors:
Andreas Broschart;Ralf Schenkel
Affiliations:
Universität des Saarlandes and Max-Planck-Institut für Informatik, Germany;Universität des Saarlandes and Max-Planck-Institut für Informatik, Germany
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2012

Citing 39
Cited 3

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Effective document presentation with a locality-based similarity heuristic

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Static index pruning for information retrieval systems

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient phrase querying with an auxiliary index

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Minimal probing: supporting expensive predicates for top-k queries

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
Optimal aggregation algorithms for middleware

Journal of Computer and System Sciences - Special issu on PODS 2001
Fast phrase querying with combined indexes

ACM Transactions on Information Systems (TOIS)
Three-level caching for efficient query processing in large Web search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
A Markov random field model for term dependencies

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Indexing time vs. query time: trade-offs in dynamic information retrieval systems

Proceedings of the 14th ACM international conference on Information and knowledge management
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Pruned query evaluation using pre-computed impacts

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Term proximity scoring for ad-hoc retrieval on very large text collections

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
IO-Top-k: index-access optimized top-k query processing

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
A document-centric approach to static index pruning in text retrieval systems

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Efficient document retrieval in main memory

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
An exploration of proximity measures in information retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Heavy-tailed distributions and multi-keyword queries

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Top-k query evaluation with probabilistic guarantees

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Effective top-k computation in retrieving structured documents with term-proximity support

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
A survey of top-k query processing techniques in relational database systems

ACM Computing Surveys (CSUR)
Introduction to Information Retrieval

Introduction to Information Retrieval
Can phrase indexing help to process non-phrase queries?

Proceedings of the 17th ACM conference on Information and knowledge management
Top-k aggregation using intersections of ranked inputs

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Search Engines: Information Retrieval in Practice

Search Engines: Information Retrieval in Practice
Best-Effort Top-k Query Processing Under Budgetary Constraints

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Compressing term positions in web indexes

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
A proximity language model for information retrieval

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Positional language models for information retrieval

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
The Probabilistic Relevance Framework: BM25 and Beyond

Foundations and Trends in Information Retrieval
Term proximity scoring for keyword-based retrieval systems

ECIR'03 Proceedings of the 25th European conference on IR research
Efficient text proximity search

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Viewing term proximity from a different perspective

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Efficient term proximity search with term-pair indexes

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Faster top-k document retrieval using block-max indexes

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Boosting web retrieval through query operations

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Efficient phrase querying with common phrase index

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Exploring the magic of WAND

Proceedings of the 18th Australasian Document Computing Symposium
Indexing Word Sequences for Ranked Retrieval

ACM Transactions on Information Systems (TOIS)
Document vector representations for feature extraction in multi-stage document ranking

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Term proximity scoring is an established means in information retrieval for improving result quality of full-text queries. Integrating such proximity scores into efficient query processing, however, has not been equally well studied. Existing methods make use of precomputed lists of documents where tuples of terms, usually pairs, occur together, usually incurring a huge index size compared to term-only indexes. This article introduces a joint framework for trading off index size and result quality, and provides optimization techniques for tuning precomputed indexes towards either maximal result quality or maximal query processing performance under controlled result quality, given an upper bound for the index size. The framework allows to selectively materialize lists for pairs based on a query log to further reduce index size. Extensive experiments with two large text collections demonstrate runtime improvements of more than one order of magnitude over existing text-based processing techniques with reasonable index sizes.